Terminology Reseach tools for translators
Thread poster: John Moran

John Moran  Identity Verified
Ireland
Local time: 22:55
Member (2004)
German to English
+ ...
Jun 11, 2010

Hi all,

I have a question regarding tools for doing terminology research. We are thinking of embarking on an academic project to develop a software tool which would be of use to professional translators. I am a translator/PM myself and the ideas are tools that I would find useful in my own work.

Before starting out we need to do some reseach to prove the novelty of our work, i.e. that it has not been done before. On that note here are our two ideas and if anyone has any pointers to software which is the same or even similar I would be very grateful.

1) Tool to download a mid-size website (not www.microsoft.com) and create a corpus of parallel texts which can be searched using a concordaner.

The motivation for this tool was that many smaller companies do not seem to have a consistent way of putting their documents on their website in different languages.
We know of tools to download the website (httrack for example) and the alignment will be done using known algorithms so I guess the question here is: Does anyone know of a tool for translators that does both together?

2) Tool to narrow down google search results based on "contextual clues" in the document currently being translated.

This is a harder problem. The idea is to be able to trigger a google search from a CAT tool. The tool would then access all the pages returned from the first (say 20) results and list them in order of their "similarity" to the document being translated. The hard part will be judging similarity of course.

In this case I would be greatful for any pointers to any software that is vaguely similar.

Finally, if anyone has any ideas for terminology software they would like to see please feel free to reply to this post (or if you prefer confidentiality you can find my email address on http://www.scss.tcd.ie/~moranj3/)

Thanks!


 

John Moran  Identity Verified
Ireland
Local time: 22:55
Member (2004)
German to English
+ ...
TOPIC STARTER
Two tools Jun 11, 2010

by the way, I should say we are aware of Babylon and Linguee http://www.linguee.com (which I really like).

 

Ana Malovrh  Identity Verified
Slovenia
Local time: 23:55
Member (2010)
German to Slovenian
+ ...
some tools I heard of Jun 14, 2010

It's been a while since I attended a conference about building corpora, but I remember they were talking about BootCaT in combination with WordSmith.

I don't know if they are any better but they are being used at universities.


 

FarkasAndras
Local time: 23:55
English to Hungarian
+ ...
Various related projects Jun 14, 2010

John Moran wrote:
1) Tool to download a mid-size website (not www.microsoft.com) and create a corpus of parallel texts which can be searched using a concordaner.

The motivation for this tool was that many smaller companies do not seem to have a consistent way of putting their documents on their website in different languages.
We know of tools to download the website (httrack for example) and the alignment will be done using known algorithms so I guess the question here is: Does anyone know of a tool for translators that does both together?

BootCaT looks a lot like what you're trying to do. I'll check it out myself, here are some thoughts in the meantime.

This is something many people are doing, but they all have their own proprietary solutions they are not too keen on sharing with us... It'd be nice if someone developed an easy-to-use, customizable, free and open source solution.
Ideally, the program would include: 1) a language recognition component that identifies what language each text is in, then 2) a component that tells if two given texts are each other's translations, and then 3) an aligner. 2 is the most difficult task by far, I'd think.

Google and all the MT developers do this, as do translated: http://mymemory.translated.net/.

Here is yet another project, possibly the most interesting one for you:

http://www.umiacs.umd.edu/~resnik/strand/

http://www.proz.com/forum/internet_for_translators/163544-what_is_strand_structural_translation_recognition_for_acquiring_natural_data_.html

I never checked if they released the software itself or just the URL lists it generated.


For the aligner, check out Hunalign here:
mokk.bme.hu/resources/hunalign (the site seems to be down now and I can't even get at the version cached by Google, hopefully it'll be back up soon)
I'm quite confident in saying that it's the best open source aligner there is. It's an improved version of the Gale-Church algorightm. Unless you've already developed some amazing algorithm, it'd be silly not to use Hunalign.
I wrote a preprocessor/postprocessor/frontend for Hunalign, see here: http://sourceforge.net/projects/aligner/
BTW it's pretty heavy duty stuff, it can easily hande hundreds of thousands of segments in one go.

As long as your project is free and open source, you're more than welcome to use my ideas/code. It's a windows BAT which is surely not what you'll do, but most of what it does is sed commands which you could easily use on whatever your platform is.
In fact, the two projects could be integrated very easily. If your sw produces a list of URL pairs (i.e. URLs of matching webpages), I could very easily add a line or two to my aligner code which would allow it to download the pages identified by your sw as matching, strip HTML tags, align them and produce a TMX. I'm planning to port the whole thing to perl soon which would make it more platform-independent.

Some related projects:
http://www.statmt.org/europarl/ - contains some tools including a configurable sentence segmenter.
http://langtech.jrc.it/Documents/JRC-Workshop_2005-09/2005_JRC-Workshop_Varga.pdf - this is a writeup of the whole toolchain/project that Hunalign is part of. A must read in your situation, I'd say. If and when the mokk.bme.hu site comes back online, you can find more info and the actual scripts there. They are all free and open source.


BTW, I'd probably use wget, not HTTrack for the downloads.

[Edited at 2010-06-14 09:24 GMT]


 

John Moran  Identity Verified
Ireland
Local time: 22:55
Member (2004)
German to English
+ ...
TOPIC STARTER
Thanks and a link Jun 23, 2010

Hi all,

Thanks for the valuable information, in particular FarkasAndras.

Sometimes these things are on your own doorstep.

http://bitextor.sourceforge.net is pretty much what I was looking for. It was developed by Mikel Forcada and he wrote a really good paper on it.

http://www.dlsi.ua.es/~mlf/docum/esplagomis10j.pdf

Cheers,

John


 

FarkasAndras
Local time: 23:55
English to Hungarian
+ ...
Thank you for checking back Jun 23, 2010

Nice to hear back from you and thank you for sharing your find.
I skimmed the article, it looks great.
I'll check out both BootCaT and bitextor.
I just checked the wiki page and found out that bitextor can learn how to identify new languages from a corpus, which is more than I was hoping for.
This project has everything you need to set up your own little spider, crawling the web for multilingual content. If you could, say, do a google search to obtain the URLs of a couple of hundred of pages on an area or topic of interest and then have bitextor download and process each page... then you could essentially just provide some keywords, hit Enter and then go and grab some lunch and have a pretty decent TMX ready for you when you come back an hour and a half later.
It look like the first part (generating a list of URLs that contain a lot of high quality text based on a few keywords) is something BootCaT is good at. If you could somehow get it to focus on multilingual sites, maybe by mixing languages among the keywords or by running the query in both languages separately and looking for sites that pop up in both lists, you could get a really good website list to pass on to bitextor.

I also really hope I can get the URL list or raw data out of these things. I'm not a fan of take-it-or-leave-it autoaligned, pre-pruned TMs, at least not in every case.

[Edited at 2010-06-23 21:12 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Terminology Reseach tools for translators

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search