Any bilingual ( french- english) corpus extractor recommended?
Thread poster: BOLDXPRESS
BOLDXPRESS
Canada
Local time: 02:24
English to French
+ ...
Mar 25, 2016

Dear forum members,

Wondering if someone can recommend me a website or software that can help me build a corpus, ideally a bilingual corpus in a specific domain ( astronomy, law, medical)
For example I want to have around 15 000 ( french and english) words in context

Thanks for you help


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 07:24
Member (2009)
Dutch to English
+ ...
You might try these two: Mar 25, 2016

1. https://www.sketchengine.co.uk/
2. http://www.farkastranslations.com/eu_translation_memories.php

Michael


Direct link Reply with quote
 
FarkasAndras
Local time: 08:24
English to Hungarian
+ ...
Custom job Mar 25, 2016

I presume you want a sentence-aligned corpus (each FR sentence paired with the corresponding EN sentence).
Do you have the texts? It sounds like you don't, so you'll either need to find a preexisting aligned corpus/TM/database or somebody who will collect texts that meet your criteria and align them for you.
The best collection of free preexisting aligned corpora is this: http://opus.lingfil.uu.se/

If you can't find preexisting aligned corpora, you need an aligner to make your own. There are many out there, search the forums. I wrote one of them, called lf aligner. Of course you need to find and collect the texts first, which could require other software. Then you after alignment might want to filter out certain low quality segments, and you probably want to do some manual checks/corrections to make sure everything is good, fix errors and possibly re-do parts of the alignment. 15000 words is not a lot, so it's not that big a job but it could take a while to work out the process itself if you want to do it yourself. It all depends on whether you have the texts, how picky you are about what texts will work and how high quality you need the final corpus to be. If your input texts are crap and you need perfect results, you will tear your hair out.

Regarding finding texts in your fields: law is easy if you're not extremely picky about specific types of law, medical a wee bit harder but still fairly easy if you're not too particular (see EMEA on OPUS). Astronomy is tougher. Not sure where one would find large amounts of en-fr astronomy texts. Perhaps there are bilingual canadian astronomy journals but that's a long shot. It would be easy to collect sentences or documents that mention astronomy-related subjects by filtering them out of larger general collections, but that might not yield an awfully high quality corpus.


If you can't/don't want to do the legwork, I take jobs like this for a fee. Not sure if anyone else offers this kind of service, frankly. It's kind of a niche activity that I happen to have an interest in. You could call it a hobby that occasionally generates income. Michael linked my website above, some info is available there.

[Edited at 2016-03-25 10:30 GMT]


Direct link Reply with quote
 
BOLDXPRESS
Canada
Local time: 02:24
English to French
+ ...
TOPIC STARTER
Thanks for your input Mar 26, 2016

Thanks Andras,


What I actually meant is a site where I can find an important collection of bilingual texts (corpora). My goal is to later align them. But I first need to find the collection of texts in a specialized domain. It doesnt really matter which domain it is.

Thanks a lot


Direct link Reply with quote
 
FarkasAndras
Local time: 08:24
English to Hungarian
+ ...
Well... Mar 28, 2016

If you read back your first post, you'll see you asked for a "corpus extractor" and not for texts. If you expect to get help, it pays for you to be clear and comprehensible. Best of luck.

[Edited at 2016-03-29 07:28 GMT]


Direct link Reply with quote
 

expressisverbis
Portugal
Local time: 07:24
Member (2015)
English to Portuguese
+ ...
Have you tried Tradooit.com? Mar 28, 2016

BOLDXPRESS wrote:

Thanks Andras,


What I actually meant is a site where I can find an important collection of bilingual texts (corpora). My goal is to later align them. But I first need to find the collection of texts in a specialized domain. It doesnt really matter which domain it is.

Thanks a lot


TradooIT is a computer-assisted translation suite that includes a translation memory, a terminology bank, a bilingual concordancer, a text alignment tool, a pretranslation tool and a Word add-in.
I find it very useful:
http://www.tradooit.com/index.php
You can select your files to align them, using he "Import/Align your files" function.


Direct link Reply with quote
 

Reed James
Chile
Local time: 03:24
Member (2005)
Spanish to English
Synchroterm Mar 29, 2016

You do have to pay some money for this software, but it takes care of a lot of actions that can be a pain with the other free term extraction tools I've seen. It depends on your priorities and how often you need to extract terms.

Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Any bilingual ( french- english) corpus extractor recommended?

Advanced search







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search