Still these big databases...
Thread poster: MikeTrans

MikeTrans
Germany
Local time: 13:02
Italian to German
+ ...
Jul 7, 2012

Hello,

this post should not be a rant of some sort nor a product evaluation. I'm instead very interested to see some flaws of CafeTran going away very quickly, especially seeing that it performs very well in areas where other tools just don't help.

Some introduction:
-------------------------------------------------------

While I do have lots of small memories for about 20.000 segments, I very rarely just use these alone.
Usually I use:

- 1 small memory + 350.000 segments from a big memory (read-only) or:
- 1 small memory + 200.000+ segments extracted from a big one containing 1.800.000+ segments

Just extracting what you need from a big memory is very time-consuming, given that they are not ordered by topic, but by the natural occurence of documents treated; so one document may speak about fishing treaties and the next about a regulation in the car industry. Also knowing what's exactly in these big TMs like DGT, EuroParl etc. is very difficult to determine, I have given up trying, no time for this.

Imperfect solution:
---------------------------------------------------

What can be done relatively fast if you want to extract useful content from a big database and make CafeTran load and work faster is:
You use an extraction tool (ExtPhr32, Synchroterm, Multiterm extract etc.) and extract all single words from your project to be translated bypassing common words. You put these expressions in a list, you strip out what you think is still too common, you search/replace this list by replacing [carriage returns] into [space]or[space].

Your list should look like: Expression_1 or Expression_2 or Expression_3 or ...

You now use XBench, you load your big TMs and you put your list in the Search box. Don't be afraid: it's possible, even with 1000 terms to check. You tick "Words only", "Show all matches" and you perform a Power Search.
Next, you export all "displayed only" expressions to TMX.

This is the fastest way I found to present CafeTran a 'smaller' TM, but still it takes too much time for my taste (about 10 minutes and x words you could have translated...), and what's more: still the remaining TMX are very large.

Currently, as soon as I receive a project, I start CafeTran to pretranslate with a big memory, while I'm evaluating the project and do related admin work on it, so I put CafeTran to be 20 minutes ahead of me, but generally I will catch up quickly and there will be "dead time": waiting for him to search and display the segment results.

3 Questions:
------------------------------------------------

When pretranslating a big TM, CafeTran will cache it's results somewhere; can these results be saved someway, so that If I save and close the program this can be re-used when opening CafeTran again?

Can I tell CafeTran to exactly pretranslate from segment Nr.x to the end, and not from the beginning?

If I make a RAM upgrade from 4-8 GB, will this make the search results from big databases be faster, or will this purely depend on the processor type (and not of the RAM, thus making an upgrade futile)? I'm not interested in making my TMs *load* faster, but I want the *searches* to be faster.

--------------------------------------------------------------
CafeTran is a very useful CAT tool with features well ahead of some leading products, but it should also handle big databases reasonably fast which currently is not the case with a medium computer system (1 year old laptop, Win7 x64, i5 Intel Processor and 4GB RAM).

As soon as this problem is settled it will be an investment in the future for any translator whose bases are growing and growing...

Greets,
Mike




[Edited at 2012-07-07 13:44 GMT]


 

Meta Arkadia
Local time: 18:02
English to Indonesian
+ ...
Corrupt databases? Jul 7, 2012

Hi Mike,

My procedure is almost exactly as yours, that is, maybe until recently. Since the MTs became available, I rely more on them than on my Big Mama, and quite often nowadays, I don't even load it. This depends on whether it's an ongoing job or not. If it's not, I use the MTs as my Big Mama, and add the finished segments to my Big Mama later, in case there's a follow up project.

On my blog, I mentioned a few solutions for slow databases. I wouldn't be surprised if "corruption" is the major culprit. My big databases have been used in several CAT tools, and I added several TMs from clients to them. No wonder they are slow. So now I only use my Big Mama if working on a follow-up project, or if I think "perfect" matches are likely to occur. I then set the Big Mama on Read Only and Pretranslate. You don't have to give the pretranslation a head-start, well, not more than a few minutes, according to Igor. If you do need more, I suppose there really is something the matter with the database.

I don't extract terms before I start translating because I consider it to be a waste of time. I add terms on the go, and sometimes check the frequency list to see if it's useful to add a particular term/phrase or not.

I hope Igor or somebody else will answer your first two questions. I use an iMac 27" with a dual core processor 3 GHz and 12 GB of RAM, 4 GB of which assigned to Java/CT. In the case of RAM, I think "more is better."

Cheers,

Hans


 

MikeTrans
Germany
Local time: 13:02
Italian to German
+ ...
TOPIC STARTER
No corrupt bases Jul 7, 2012

Hi Hans,

thanks for replying so promptly.
I'm afraid the processing speed has nothing to do with my tmx bases being corrupt: I have worked hard to compile them, taking out duplicates, numbers only etc. and eliminating any superfluous attribute fields (actually no fields at all), so their size are at optimum and I only use them as read-only to avoid any corruption.

When I'm saying CafeTran is 'slow' in handling large bases, I *only* mean the situation where the sub-segment part of the search is turned on.
Fuzzy search is very fast and helpful, the same is true for searches with CTRL-F.

Things that I did in CafeTran to speed up sub-segment searches also include:

In Options > Memory, I've changed "Subsegment look-up limit" from 15 to 6.
The result is: this will not reduce the number of sub-segments but it will reduce the number of suggested context segments (the segments proposed as numbers when you mouse over them).
The search speed increases somewhat.
I've changed the "Fuzzy match threshold" from 33 to 25, but maybe this is the error. I will put it to 75, I shouldn't lose any matches because there are still the sub-segments.

I will try to experiment with "Subsegment to Virtual threshold", "Function words threshold". I've read the description in the Help, but no idea what all this means. Could be helpful too.

It's clear that my computer system is no match in comparison with yours. On your system, I figure out your searches are almost 'instant', and you can even load your big TMs as "Automatic" ?

Cheers,
Mike


 

Igor Kmitowski  Identity Verified
Poland
Local time: 13:02
Member (2016)
English to Polish
+ ...
Big Memory Acceleration Jul 9, 2012

Hi Mike,

Please follow below useful tips to handle huge TMs:

1. Pretranslate project segments.

The Pretranslation mechanism in CafeTran works in the background so while the computer is doing its job you may translate with your own human pace at the same time.

2. Turn off fuzzy subsegment matching.

Please note that with this option off, CafeTran still finds exact matches for subsegments along with fuzzy matching for full segments. Fuzzy subsegment matching requires some extra processing time because not only does it look for context but also it tries to infer the meaning from the given context.

3. Reduce subsegment look-up limit.

If you wish to keep the fuzzy subsegment matching, you might try changing this value (the default is maximum 15 segments to find). If you reduce it to, say, 7, CafeTran will stop searching after the 7th match found. However, the more fuzzy subsegment matches it has, the more accurate translation guess will be.

4. Upgrade your hardware.

Naturally, more RAM and a faster processor help with handling such huge memories, which in business like solutions need almost server type hardware to handle smoothly. You don't have to buy a server but 4GB+ RAM is a must, in my opinion.

For the next CT update, I will add the option to "Pretranslate from the current segment" to skip form pretranslating those segments which are already translated.

I hope the acceleration tips will help.

Igor

MikeTrans wrote:

Hello,

this post should not be a rant of some sort nor a product evaluation. I'm instead very interested to see some flaws of CafeTran going away very quickly, especially seeing that it performs very well in areas where other tools just don't help.

Some introduction:
-------------------------------------------------------

While I do have lots of small memories for about 20.000 segments, I very rarely just use these alone.
Usually I use:

- 1 small memory + 350.000 segments from a big memory (read-only) or:
- 1 small memory + 200.000+ segments extracted from a big one containing 1.800.000+ segments

Just extracting what you need from a big memory is very time-consuming, given that they are not ordered by topic, but by the natural occurence of documents treated; so one document may speak about fishing treaties and the next about a regulation in the car industry. Also knowing what's exactly in these big TMs like DGT, EuroParl etc. is very difficult to determine, I have given up trying, no time for this.

Imperfect solution:
---------------------------------------------------

What can be done relatively fast if you want to extract useful content from a big database and make CafeTran load and work faster is:
You use an extraction tool (ExtPhr32, Synchroterm, Multiterm extract etc.) and extract all single words from your project to be translated bypassing common words. You put these expressions in a list, you strip out what you think is still too common, you search/replace this list by replacing [carriage returns] into [space]or[space].

Your list should look like: Expression_1 or Expression_2 or Expression_3 or ...

You now use XBench, you load your big TMs and you put your list in the Search box. Don't be afraid: it's possible, even with 1000 terms to check. You tick "Words only", "Show all matches" and you perform a Power Search.
Next, you export all "displayed only" expressions to TMX.

This is the fastest way I found to present CafeTran a 'smaller' TM, but still it takes too much time for my taste (about 10 minutes and x words you could have translated...), and what's more: still the remaining TMX are very large.

Currently, as soon as I receive a project, I start CafeTran to pretranslate with a big memory, while I'm evaluating the project and do related admin work on it, so I put CafeTran to be 20 minutes ahead of me, but generally I will catch up quickly and there will be "dead time": waiting for him to search and display the segment results.

3 Questions:
------------------------------------------------

When pretranslating a big TM, CafeTran will cache it's results somewhere; can these results be saved someway, so that If I save and close the program this can be re-used when opening CafeTran again?

Can I tell CafeTran to exactly pretranslate from segment Nr.x to the end, and not from the beginning?

If I make a RAM upgrade from 4-8 GB, will this make the search results from big databases be faster, or will this purely depend on the processor type (and not of the RAM, thus making an upgrade futile)? I'm not interested in making my TMs *load* faster, but I want the *searches* to be faster.

--------------------------------------------------------------
CafeTran is a very useful CAT tool with features well ahead of some leading products, but it should also handle big databases reasonably fast which currently is not the case with a medium computer system (1 year old laptop, Win7 x64, i5 Intel Processor and 4GB RAM).

As soon as this problem is settled it will be an investment in the future for any translator whose bases are growing and growing...

Greets,
Mike




[Edited at 2012-07-07 13:44 GMT]


 

MikeTrans
Germany
Local time: 13:02
Italian to German
+ ...
TOPIC STARTER
@Igor, Jul 9, 2012

Hello Igor,
thank you for your feedback.

I think your point nr. 1 is the most convenient for me: if I organize my work carefully while CT is doing its pretranslation, then the processing speed will not be an issue: most of the time, giving CT a little ahead time will give instant results on most of the first segments you translate.

Igor Kmitowski wrote:

2. Turn off fuzzy subsegment matching.


How do you do this?


Currently, I'm trying another trick: I copy my pesky big TM under another name and load this one in the Fuzzy mode alone while I load the other one in Fuzzy+subsegment mode. With this I can still search manually in one of them, while not disturbing CT for doing its pretranslation. He, he...

RAM increase:
In this specific case, for the search results processing of large TMs, what do you think is more relevant: a faster processor or an upgrade from 4 to 8 GB RAM ?
I'm certain that TMs will load faster with 8 GB rather than with 4, but will the searching process of CT also increase by speed? Thanks to let me know.

Greets,
Mike


[Edited at 2012-07-09 11:15 GMT]


 

Igor Kmitowski  Identity Verified
Poland
Local time: 13:02
Member (2016)
English to Polish
+ ...
RAM Increase Jul 9, 2012

To turn off fuzzy subsegment matching, choose Fuzzy only instead of Fuzzy & Subsegments in the Matching type options.

I would go for more RAM in the first place to increase both loading and searching speed. If possible, you might try CT on another computer (e.g. your friend's) with a faster processor before you decide to upgrade your hardware. I am almost certain that both RAM and a better chip will have a noticeable increase in speed. However, give it a try before you make any hardware upgrade decision.

Cheers,
Igor

MikeTrans wrote:

Hello Igor,
thank you for your feedback.

I think your point nr. 1 is the most convenient for me: if I organize my work carefully while CT is doing its pretranslation, then the processing speed will not be an issue: most of the time, giving CT a little ahead time will give instant results on most of the first segments you translate.

Igor Kmitowski wrote:

2. Turn off fuzzy subsegment matching.


How do you do this?


Currently, I'm trying another trick: I copy my pesky big TM under another name and load this one in the Fuzzy mode alone while I load the other one in Fuzzy+subsegment mode. With this I can still search manually in one of them, while not disturbing CT for doing its pretranslation. He, he...

RAM increase:
In this specific case, for the search results processing of large TMs, what do you think is more relevant: a faster processor or an upgrade from 4 to 8 GB RAM ?
I'm certain that TMs will load faster with 8 GB rather than with 4, but will the searching process of CT also increase by speed? Thanks to let me know.

Greets,
Mike


[Edited at 2012-07-09 11:15 GMT]


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

Still these big databases...

Advanced search






SDL Trados Studio 2017 only €435 / $519
Get the cheapest prices for SDL Trados Studio 2017 on ProZ.com

Join this translator’s group buy brought to you by ProZ.com and buy SDL Trados Studio 2017 Freelance for only €435 / $519 / £345 / ¥63000 You will also receive FREE access to Studio 2019 when released.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search