Pages in topic:   [1 2] >
Acronym Extraction/Mining Software
Thread poster: Mark
Mark
Local time: 12:05
Italian to English
May 21, 2015

Hi,

I had a thought a while back, about acronyms, and I was just speaking to a colleague of mine who had the same thought independently. How come there isn’t an application (that we know of, at least) that mines acronyms and their potential expanded forms from documents? We can’t be the only people who translate pages full of impenetrable acronyms that you suspect are hiding in plain sight in the text, can we?

It’s not hard to find proposals on how to do this online, with algorithms an suchlike, but the actual applications seem more elusive:

http://research.microsoft.com/en-us/people/hangli/ji-apweb08.pdf
http://ciir.cs.umass.edu/pubfiles/ir-186.pdf

I found a Word add-on that lets you simply extract the acronyms into a table for you to define yourself at a later date, but that seems like a job half done to me.

I suppose it could also be done manually with regular expressions, but I’m not really that confident with them myself. If I imagine that I don't know what HRH stands for, for example:

H|h\w+\sR|r\w+\sH|h

I suspect that’s not right, but the idea is that that should match, say:

her royal highness
hall roof hat
high rumble hip

And then I could work out the rest on my own. In any case, I’m still surprised not to be able to find something to automate this. Am I missing something?


Direct link Reply with quote
 

Igor Kmitowski  Identity Verified
Poland
Local time: 12:05
English to Polish
+ ...
CafeTran can extract acronyms May 21, 2015

Hi Mark,

The feature is available in the free version of CafeTran Espresso.

1. Create a Project with your source document(s).
2. Go to Edit > Find... panel.
3. Select Project Source Segments scope.
4. Select Regular expression box and Extract reg. exp. results box.
5. Type your regular expression and click Find.

CafeTran will list the results in one simple text column that you can save or copy.

Igor


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:05
Member (2009)
Dutch to English
+ ...
online/local acronym-mining tool w/ expanded form finder May 21, 2015

An acronym-mining tool would indeed be very useful, especially if it was also able to find potential expanded forms. It would also be great if it could be let loose on the internet.

related stuff:

Havbe you seen András Farkas’s big EU acronym collection?

@ http://www.farkastranslations.com/glossaries.php

"EU acronym collection - NOW AVAILABLE!

Similarly to the glossary, the acronyms were also harvested from aligned document sets using custom software tools made for this purpose, with a limited amount of manual correction. The acronym collection is available as a bilingual or multilingual glossary. In bilingual versions, each entry contains four fields: the acronym in language 1, the full expression in language 1, the acronym in language 2 and the full expression in language 2. E.g. ETO / European telecommunications office / BET / Bureau européen des télécommunications. In some entries, certain fields (full expression in languages other than English) are empty. In most cases, this is because the language in question uses the English acronym and thus the letters of the acronym don't match the full form, which prevents automated recognition. Every English acronym is listed along with the corresponding full English expression, and detailed statistics on other languages are available on request.
The acronym collection covers a vast range of areas, with entries ranging from the British Aluminium Foil Rollers Association (BAFRA) to the International Plant Protection Convention (IPPC, French: CIPV, Convention internationale pour la protection des végétaux). There are about 8,000 entries in all, with the potential to save you untold hours of tedious research. The acronym collection covers all EU languages except Croatian and Irish. The number of entries depends on the language pair requested.
A sample is available here (xls). The sample file contains the English, French and German versions of all the acronyms that start with the letter A.
Formats: tab delimited txt, xls and tmx. Other formats (tbx, xml etc.) available on request. I recommend using this data as a termbase, not a TM (i.e. import it into MultiTerm or the terminology module of your CAT of choice). If your terminology software can't handle synonyms (e.g. two English columns and two French columns in the same table), let me know and I will create a special two-column version that allows both the acronyms and the full forms to be all imported into the same database.
Price: EUR 25 for a bilingual glossary, plus EUR 10 for each additional language."


You might want to ask him about those "custom software tools made for this purpose".

I am also working on my own collection (in my limited spare time): http://www.acronymbook.com
However, my collection methods are much more lo-fi: I simply scour the internet in search of lists of acronyms, and also extract content from existing collections using scraping tools like HTTrack.

Another related idea is to use IntelliWebSearch to search sites like http://www.acronymfinder.com/ with a Windows shortcut, for when you come across a pesky one while translating.

As a happy CafeTran user, I thought I'd also mention the automatic abbreviation extractor, under:

Tools > Abbreviations > Scan Project for abbreviations


Direct link Reply with quote
 
Mark
Local time: 12:05
Italian to English
TOPIC STARTER
Thank you, both May 22, 2015

I might have a fiddle with CafeTran at home then (I don’t imagine my employers would bother installing it on the system for the one function; they’re bound to tell me I should be doing it with UltraEdit). CafeTran and CafeTran Espresso are, I gather, different ways of saying the same thing?

Since András is selling the glossaries and suggests that people who want "to extract terminological data from [large amounts of text] to create specialized termbases/glossaries like these […] get in touch", I imagine that he’s decided to keep his methods for himself. Perhaps I’ll contact him to make sure though; it seems to me he could sell his work in another way if he chose to.


Direct link Reply with quote
 
FarkasAndras
Local time: 12:05
English to Hungarian
+ ...
regex May 22, 2015

Mark Dobson wrote:

I suppose it could also be done manually with regular expressions, but I’m not really that confident with them myself. If I imagine that I don't know what HRH stands for, for example:

H|h\w+\sR|r\w+\sH|h

I suspect that’s not right, but the idea is that that should match, say:

her royal highness
hall roof hat
high rumble hip

And then I could work out the rest on my own. In any case, I’m still surprised not to be able to find something to automate this. Am I missing something?



That's pretty much the basis of how I did it, although of course there is quite a bit more nuance to it than that. I admit that I didn't do much research to see if there is ready-made open software available for the purpose. If the task is not very complicated, it's often more convenient to write the code yourself than to try and get someone else's code working - you often struggle to get it to run, wonder how you're supposed to use it or whether it's doing exactly what you want it to.
I only went after acronyms that occur along with the expanded form (I don't see much of a point in collecting just an acronym that you will have to research from scratch anyway). The easiest way to do it is to find patterns like this: L1\w+ L2\w+ L3\w+ \(L1L2L3\). I collected acronyms in multiple languages based on aligned texts so there is some wizardry in trying to make sure that they are paired up correctly and there is a lot of fiddling in covering various kinds of unusual cases (E.g. CITES is the Convention on International Trade in Endangered Species, which won't be picked up by a primitive pattern search that doesn't know that "on" and "in" are filler words. Even worse, if your acronym is the framework for international bartending standards or the Organisation Of European Fortune Tellers, a primitive algorithm will chop the first word off. Ask how I know.)
I had a quick look at the linked MS paper, and it looks like they went a LOT deeper down this rabbit hole with AcroMiner than I did. It's a shame they didn't publish the code. I'm not even sure there's any point in publishing what looks like it was intended as a scientific paper and then holding back the actual goods.

Not sure if I want to share my script... It's pretty rough and designed to work with my specific EU files. It could be polished up a little and upgraded to work with other files, but that would be a fair bit of work and it would still be inferior to better researched software like acrominer. If no other (better) tool is available online I might be persuaded to do it.


[Edited at 2015-05-22 14:46 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:05
Member (2009)
Dutch to English
+ ...
Ideally … May 26, 2015

FarkasAndras wrote:

Mark Dobson wrote:

I suppose it could also be done manually with regular expressions, but I’m not really that confident with them myself. If I imagine that I don't know what HRH stands for, for example:

H|h\w+\sR|r\w+\sH|h

I suspect that’s not right, but the idea is that that should match, say:

her royal highness
hall roof hat
high rumble hip

And then I could work out the rest on my own. In any case, I’m still surprised not to be able to find something to automate this. Am I missing something?



That's pretty much the basis of how I did it, although of course there is quite a bit more nuance to it than that. I admit that I didn't do much research to see if there is ready-made open software available for the purpose. If the task is not very complicated, it's often more convenient to write the code yourself than to try and get someone else's code working - you often struggle to get it to run, wonder how you're supposed to use it or whether it's doing exactly what you want it to.
I only went after acronyms that occur along with the expanded form (I don't see much of a point in collecting just an acronym that you will have to research from scratch anyway). The easiest way to do it is to find patterns like this: L1\w+ L2\w+ L3\w+ \(L1L2L3\). I collected acronyms in multiple languages based on aligned texts so there is some wizardry in trying to make sure that they are paired up correctly and there is a lot of fiddling in covering various kinds of unusual cases (E.g. CITES is the Convention on International Trade in Endangered Species, which won't be picked up by a primitive pattern search that doesn't know that "on" and "in" are filler words. Even worse, if your acronym is the framework for international bartending standards or the Organisation Of European Fortune Tellers, a primitive algorithm will chop the first word off. Ask how I know.)
I had a quick look at the linked MS paper, and it looks like they went a LOT deeper down this rabbit hole with AcroMiner than I did. It's a shame they didn't publish the code. I'm not even sure there's any point in publishing what looks like it was intended as a scientific paper and then holding back the actual goods.

Not sure if I want to share my script... It's pretty rough and designed to work with my specific EU files. It could be polished up a little and upgraded to work with other files, but that would be a fair bit of work and it would still be inferior to better researched software like acrominer. If no other (better) tool is available online I might be persuaded to do it.


[Edited at 2015-05-22 14:46 GMT]


Ideally, we would take your script, plug it into some kind of (open source) web crawler and scour the internet, and then dump all the data into an open source db (available online, via something like my own fledgling project: http://acronymbook.com ).

If the data was available in some form of delimited UTF-8 text format, people could then download it and convert it for use in their own CAT tools.


[Edited at 2015-05-26 14:34 GMT]


Direct link Reply with quote
 

neilmac  Identity Verified
Spain
Local time: 12:05
Spanish to English
+ ...
No need May 27, 2015

Just tell the perpetrators (the people who use acronyms without defining them) that it's up to them to define the blessed things. In my experience, it never crosses their minds that they might be a mystery to many.
One of my basic conditions for collaboration is the understanding that only the most common and widely understood acronyms (BBC, EU, USA, IMF...) will be translated, while the authors must take responsibility for their more recondite cousins. KWIM?


Direct link Reply with quote
 
FarkasAndras
Local time: 12:05
English to Hungarian
+ ...
maybe Jun 4, 2015

Michael Beijer wrote:

Ideally, we would take your script, plug it into some kind of (open source) web crawler and scour the internet, and then dump all the data into an open source db (available online, via something like my own fledgling project: http://acronymbook.com ).

If the data was available in some form of delimited UTF-8 text format, people could then download it and convert it for use in their own CAT tools.


Not entirely against the idea. If there is a specific "we" that wants to do this and is willing to put in the time, hit me up via email.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:05
Member (2009)
Dutch to English
+ ...
"We" is still just me, and is currently on the back burner, but… Jun 5, 2015

FarkasAndras wrote:

Michael Beijer wrote:

Ideally, we would take your script, plug it into some kind of (open source) web crawler and scour the internet, and then dump all the data into an open source db (available online, via something like my own fledgling project: http://acronymbook.com ).

If the data was available in some form of delimited UTF-8 text format, people could then download it and convert it for use in their own CAT tools.


Not entirely against the idea. If there is a specific "we" that wants to do this and is willing to put in the time, hit me up via email.


…I'll drop you a line when I get around to working on the idea a bit more!

I'm still trying to devise an optimal way to work acronyms and abbreviations into my daily workflow while translating (with CafeTran). I have masses of them, in various formats, but can't figure out a way so they are available when I need them but don't clutter up my view when translating.

I currently have an IntelliWebSearch shortcut set up to simultaneously search several of the leading acronym sites online and I have my massive db as a tab-del glossary in CafeTran.

I think that a collaboratively maintained mega-list would be a very valuable resource for us translators.


Direct link Reply with quote
 
Mark
Local time: 12:05
Italian to English
TOPIC STARTER
I was thinking of something more targeted myself. Jun 12, 2015

Michael Beijer wrote:

Ideally, we would take your script, plug it into some kind of (open source) web crawler and scour the internet
Speaking for myself, I was more interested in looking at the source documents of specific translations rather than creating the ultimate repository of internet acronyms. I’m somewhat sceptical about the idea: there is a fair bit of nonsense on the web, isn’t there? I wonder what you would really gain from cataloguing it.

I use those acronym sites very infrequently, imagining that if an acronym is widely used enough to be considered sufficient to express the concept, I’ll be able to find it on my own.


Direct link Reply with quote
 
FarkasAndras
Local time: 12:05
English to Hungarian
+ ...
Yes Jun 12, 2015

Limiting any such collection effort to specific reputable sources and/or collecting entries by domain would probably be a good idea. Otherwise there may be too much noise and too little signal coming from the resulting termbase.

Direct link Reply with quote
 
Spaddock
United States
Local time: 05:05
MAX - My Acronym eXtractor - an OSX App that extracts acronyms and their definitions from text files Aug 29, 2015

My company has put an OSX App on the App Store that extracts acronyms and their definitions (which are assumed to be found in front of the acronym) from text documents.

You can find it here:
https://itunes.apple.com/us/app/max-my-acronym-extractor/id658390220?mt=12

If you are outside of the countries in which we offer it, send us a message, and we will add your country (we use a default list of countries because we don't want to worry about paying taxes in countries where we don't actually have a great demand).

Here is a link to a video that describes how it works:
https://www.youtube.com/watch?v=SZn-dWag4Uc

We regularly get requests (mostly from PhD students) who work on PCs to make a PC version, but we don't have such plans at the moment. Usually, people will have a friend/colleague with a Mac and borrow it to create the first draft of the list of acronyms, and this step alone saves tons of times.

Of course, MAX is a computer program and not perfect at finding 100% of all acronyms, but it finds the vast majority and saves technical writers tons of times by doing that.

Hope this helps.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 12:05
Member (2006)
English to Afrikaans
+ ...
MAX isn't quite what we mean Aug 29, 2015

Spaddock wrote:
My company has put an OSX App on the App Store that extracts acronyms and their definitions (which are assumed to be found in front of the acronym) from text documents.


Yes, MAX assumes that the acronym is defined in front of the acronym. In my target language, we sometimes use the acronym first and put the definition in brackets afterwards, but not always.

Even so, this is not (I think) what is needed by most respondents in this thread. What we are referring is to a situation in which an acronym is used on its own in the text, without begin defined, but which does have full form somewhere else in the text.

For example, if the author of the text assumed that his reader knows that ABC means "Apple Bureau for Certification", he might mention "Apple Bureau for Certification" or even "Apple's bureau for certification" somewhere in his document while using "ABC" elsewhere in the same document. What is needed is a program that can guess what "ABC" is the abbreviation of.


[Edited at 2015-08-29 18:04 GMT]


Direct link Reply with quote
 
Spaddock
United States
Local time: 05:05
MAX 2.0 - this is exactly the kind of feedback we need Aug 30, 2015

We have not encountered that problem ourselves (because we always start with a proper draft written by a qualified science writer), but if this is a frequent problem, we can easily add this functionality to MAX and let the regex run all across the text and provide best matches together with a likelihood score.

How sure could we be that the acronym is indeed found spelled out somewhere in the text? Is it possible that the author never mentions it? Would it, therefore, make sense to add an internet search (based on the content of the text) to see what else it could mean?

Thanks!


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:05
Member (2009)
Dutch to English
+ ...
useful and necessary additions Aug 30, 2015

Spaddock wrote:

We have not encountered that problem ourselves (because we always start with a proper draft written by a qualified science writer), but if this is a frequent problem, we can easily add this functionality to MAX and let the regex run all across the text and provide best matches together with a likelihood score.

How sure could we be that the acronym is indeed found spelled out somewhere in the text? Is it possible that the author never mentions it? Would it, therefore, make sense to add an internet search (based on the content of the text) to see what else it could mean?

Thanks!


Hi Spaddock,

Both very useful and necessary additions. Pity your tool doesn't run on Windows. Most translators I know run Windows. Myself included.

Michael


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Acronym Extraction/Mining Software

Advanced search






BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search