Terminology Reseach tools for translators
Thread poster: John Moran

John Moran  Identity Verified
Ireland
Local time: 07:48
Member (2004)
German to English
+ ...
Jun 11, 2010

Hi all,

I have a question regarding tools for doing terminology research. We are thinking of embarking on an academic project to develop a software tool which would be of use to professional translators. I am a translator/PM myself and the ideas are tools that I would find useful in my own work.

Before starting out we need to do some reseach to prove the novelty of our work, i.e. that it has not been done before. On that note here are our two ideas and if anyone has any pointers to software which is the same or even similar I would be very grateful.

1) Tool to download a mid-size website (not www.microsoft.com) and create a corpus of parallel texts which can be searched using a concordaner.

The motivation for this tool was that many smaller companies do not seem to have a consistent way of putting their documents on their website in different languages.
We know of tools to download the website (httrack for example) and the alignment will be done using known algorithms so I guess the question here is: Does anyone know of a tool for translators that does both together?

2) Tool to narrow down google search results based on "contextual clues" in the document currently being translated.

This is a harder problem. The idea is to be able to trigger a google search from a CAT tool. The tool would then access all the pages returned from the first (say 20) results and list them in order of their "similarity" to the document being translated. The hard part will be judging similarity of course.

In this case I would be greatful for any pointers to any software that is vaguely similar.

Finally, if anyone has any ideas for terminology software they would like to see please feel free to reply to this post (or if you prefer confidentiality you can find my email address on http://www.scss.tcd.ie/~moranj3/)

Thanks!


Direct link Reply with quote
 

John Moran  Identity Verified
Ireland
Local time: 07:48
Member (2004)
German to English
+ ...
TOPIC STARTER
Two tools Jun 11, 2010

by the way, I should say we are aware of Babylon and Linguee http://www.linguee.com (which I really like).

Direct link Reply with quote
 

GerSi  Identity Verified
Slovenia
Local time: 08:48
Member (2010)
German to Slovenian
+ ...
some tools I heard of Jun 14, 2010

It's been a while since I attended a conference about building corpora, but I remember they were talking about BootCaT in combination with WordSmith.

I don't know if they are any better but they are being used at universities.


Direct link Reply with quote
 
FarkasAndras
Local time: 08:48
English to Hungarian
+ ...
Various related projects Jun 14, 2010

John Moran wrote:
1) Tool to download a mid-size website (not www.microsoft.com) and create a corpus of parallel texts which can be searched using a concordaner.

The motivation for this tool was that many smaller companies do not seem to have a consistent way of putting their documents on their website in different languages.
We know of tools to download the website (httrack for example) and the alignment will be done using known algorithms so I guess the question here is: Does anyone know of a tool for translators that does both together?

BootCaT looks a lot like what you're trying to do. I'll check it out myself, here are some thoughts in the meantime.

This is something many people are doing, but they all have their own proprietary solutions they are not too keen on sharing with us... It'd be nice if someone developed an easy-to-use, customizable, free and open source solution.
Ideally, the program would include: 1) a language recognition component that identifies what language each text is in, then 2) a component that tells if two given texts are each other's translations, and then 3) an aligner. 2 is the most difficult task by far, I'd think.

Google and all the MT developers do this, as do translated: http://mymemory.translated.net/.

Here is yet another project, possibly the most interesting one for you:

http://www.umiacs.umd.edu/~resnik/strand/

http://www.proz.com/forum/internet_for_translators/163544-what_is_strand_structural_translation_recognition_for_acquiring_natural_data_.html

I never checked if they released the software itself or just the URL lists it generated.


For the aligner, check out Hunalign here:
mokk.bme.hu/resources/hunalign (the site seems to be down now and I can't even get at the version cached by Google, hopefully it'll be back up soon)
I'm quite confident in saying that it's the best open source aligner there is. It's an improved version of the Gale-Church algorightm. Unless you've already developed some amazing algorithm, it'd be silly not to use Hunalign.
I wrote a preprocessor/postprocessor/frontend for Hunalign, see here: http://sourceforge.net/projects/aligner/
BTW it's pretty heavy duty stuff, it can easily hande hundreds of thousands of segments in one go.

As long as your project is free and open source, you're more than welcome to use my ideas/code. It's a windows BAT which is surely not what you'll do, but most of what it does is sed commands which you could easily use on whatever your platform is.
In fact, the two projects could be integrated very easily. If your sw produces a list of URL pairs (i.e. URLs of matching webpages), I could very easily add a line or two to my aligner code which would allow it to download the pages identified by your sw as matching, strip HTML tags, align them and produce a TMX. I'm planning to port the whole thing to perl soon which would make it more platform-independent.

Some related projects:
http://www.statmt.org/europarl/ - contains some tools including a configurable sentence segmenter.
http://langtech.jrc.it/Documents/JRC-Workshop_2005-09/2005_JRC-Workshop_Varga.pdf - this is a writeup of the whole toolchain/project that Hunalign is part of. A must read in your situation, I'd say. If and when the mokk.bme.hu site comes back online, you can find more info and the actual scripts there. They are all free and open source.


BTW, I'd probably use wget, not HTTrack for the downloads.

[Edited at 2010-06-14 09:24 GMT]


Direct link Reply with quote
 

John Moran  Identity Verified
Ireland
Local time: 07:48
Member (2004)
German to English
+ ...
TOPIC STARTER
Thanks and a link Jun 23, 2010

Hi all,

Thanks for the valuable information, in particular FarkasAndras.

Sometimes these things are on your own doorstep.

http://bitextor.sourceforge.net is pretty much what I was looking for. It was developed by Mikel Forcada and he wrote a really good paper on it.

http://www.dlsi.ua.es/~mlf/docum/esplagomis10j.pdf

Cheers,

John


Direct link Reply with quote
 
FarkasAndras
Local time: 08:48
English to Hungarian
+ ...
Thank you for checking back Jun 23, 2010

Nice to hear back from you and thank you for sharing your find.
I skimmed the article, it looks great.
I'll check out both BootCaT and bitextor.
I just checked the wiki page and found out that bitextor can learn how to identify new languages from a corpus, which is more than I was hoping for.
This project has everything you need to set up your own little spider, crawling the web for multilingual content. If you could, say, do a google search to obtain the URLs of a couple of hundred of pages on an area or topic of interest and then have bitextor download and process each page... then you could essentially just provide some keywords, hit Enter and then go and grab some lunch and have a pretty decent TMX ready for you when you come back an hour and a half later.
It look like the first part (generating a list of URLs that contain a lot of high quality text based on a few keywords) is something BootCaT is good at. If you could somehow get it to focus on multilingual sites, maybe by mixing languages among the keywords or by running the query in both languages separately and looking for sites that pop up in both lists, you could get a really good website list to pass on to bitextor.

I also really hope I can get the URL list or raw data out of these things. I'm not a fan of take-it-or-leave-it autoaligned, pre-pruned TMs, at least not in every case.

[Edited at 2010-06-23 21:12 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Terminology Reseach tools for translators

Advanced search






memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search