What max size of TMX corpus to run term extract with Multiterm Extract
Thread poster: Happylion
Jan 6, 2010

Hi all,
I'm willing to run 'massive' term extraction on huge TMX files (containing between 80,000 to 150,000 TUs, weighting around 50 to 85 MB) using Multiterm Extract 2007.
It seems that the SW stops working after processing around 29% of the extraction pass on such bug TMXs.
The progress is definitely stopped after several attempts at the same stage for different TMXs of around the same size.
Anyone to advise on the maximum recommended size for TMX/corpus/bilingual files to perform TE on it?
Or any suggestion to do TE on a different format with more chances to run TE to the end?
Or any suggestion of a tool to split TMX to perform TE on splitted TMXs?

Note that I can afford waiting for several hours to complete a long TE process, but on the other hand, I don't want to wait 24 hours and the process not completed at the end either...

Any suggestions most welcome!




Yasmina Ait Ali
Local time: 13:45
English to French
+ ...
Hi Vince Oct 24, 2012

I have the same problem. Did you or somebody else find a solution for this problem?

Thanks in advance.


Nope, no solution Oct 24, 2012

Hi Yasmina,
Very surprised to receive a follow up to my quite old post. To be honest, I did not even remember having posted such a question.
Unfortunately, I did not find any answer from anyone.
So instead of automating term harvesting from a big corpus, I did extract manually a few hundreds terms.
This can be done easily after turning the tmx into an Xls file. First you delete very short segments that do not contain any valuable info, then you can sort them in alpha order and go through the list to extract manually the terms.
This may take a few hours but in the end, revising the result of term extraction from automation tools takes a while too. And you're certain that manual extraction is always well spotted.


To report site rules violations or get help, contact a site moderator:

You can also contact site staff by submitting a support request »

What max size of TMX corpus to run term extract with Multiterm Extract

Advanced search

CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »

  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search