Pages in topic:   [1 2] >
How can I do concordance search in large TMs?
Thread poster: FarkasAndras

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
Jun 10, 2009

Specifically, the DGT-TM (Acqius TM). This thing is pretty darn big at a million or so TUs.

My Trados version (7.1) can't handle it reliably so that's out. Sometimes it works, but it often gives a "Not enough storage space" error despite the decent system resources (2 GB physical RAM, 4 GB virtual, 30 GB free HDD space)

I just tried xbench, it gave me an out of memory error as well, again, despite seemingly adequate hardware resources. It seems like Xbench would be great for my purposes, but it choked even with a reduced (50%) TM.

If anyone has tips for making it work with either of these programs, or suggestions of other SW that can handle fatties like the DGT-TM, please fire away. Maybe a free demo version of DVX would do the trick? Does it do concordance on large TMs?
Anything that can use TMX, Trados TM, Trados export txt or any other reasonable format will do. I just want concordance search, preferably with fuzzy search, searching multiple terms and showing the top X hits at the same time. Being able to search several TMs at a time would be a plus.


Direct link Reply with quote
 

Andy Bell  Identity Verified
Local time: 11:24
Member (2002)
Norwegian to English
+ ...
Reference folder in Wordfast Jun 10, 2009

You might be able to download a demo version of Wordfast (Classic) and cut&paste the entire TM into a Word document and then save it in the reference folder (you'll need to download the manual). Then you could use the reference function rather than the concordance function in this instance. Alternatively, couldn't you simply use the Find function in Word or whatever file format you save the TM in. Not sure if this helps - little hasty. Nb. If you want to use this as an actual TM you could perhaps export it to Wordfast and split the TM into 2 or three. Maybe that would do it.
Best
Andy Bell

[Edited at 2009-06-10 15:26 GMT]


Direct link Reply with quote
 

Boyan Brezinsky  Identity Verified
Bulgaria
Local time: 06:24
English to Bulgarian
+ ...
Use some search indexing tools Jun 10, 2009

My first suggestion was - a simple text editor. OK, maybe not as simple as Notepad, because it is very slow with such large files, but there are plenty of adequate editors and viewers.
Then I saw you'd like fuzzy search. For that possibly a document search utility such as Windows Search, Google Desktop and others might help. I use Windows Search here and sometimes it seems to find word forms. Still haven't quite got the pattern of its work though Maybe Google Desktop will do better, never used it. But I seem to recall that in some previous version it indexed only the beginning of larger documents, so make sure to check for that.
And I just saw there exist some scripts that implement word form search for sites. Maybe one of them could be used for your purposes. But of course, this might be quite a serious hassle.


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
thanks for trying to help... Jun 10, 2009

Andy Bell wrote:

You might be able to download a demo version of Wordfast (Classic) and cut&paste the entire TM into a Word document and then save it in the reference folder (you'll need to download the manual). Then you could use the reference function rather than the concordance function in this instance. Alternatively, couldn't you simply use the Find function in Word or whatever file format you save the TM in. Not sure if this helps - little hasty. Nb. If you want to use this as an actual TM you could perhaps export it to Wordfast and split the TM into 2 or three. Maybe that would do it.
Best
Andy Bell

[Edited at 2009-06-10 15:26 GMT]


But I'm afraid Wordfast is woefully inadequate for the task. Last time I tried to use a TM that was mildly on the plump side it shrieked in despair and ran for cover. IIRC concordance was annoyingly slow even with just 80,000 TUs.
It's just nowhere near robust enough to handle the sort of bulk we're talking about here.

Word won't cut it either. It won't even open a file of this size. The thing is close to a GB and - I guess - a couple of hundred thousand pages.


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
Google Desktop Jun 10, 2009

Boyan Brezinsky wrote:

My first suggestion was - a simple text editor. OK, maybe not as simple as Notepad, because it is very slow with such large files, but there are plenty of adequate editors and viewers.
Maybe Google Desktop will do better, never used it. But I seem to recall that in some previous version it indexed only the beginning of larger documents, so make sure to check for that.


Well, text editors are out. Way out. I don't know of any that can handle the size, except for vi/vim. It's a primitive-looking command line deal that will handle just about anything. But once we're in command line territory, one might as well just use the grep command to extract the lines that contain the search term, without opening the original file... it would work and at least it would give all the hits in one go but it wouldn't be very convenient to use.

I have and use google desktop for general searches and I have thought about using it for this... it's mighty fast and it has a very smart search, and I'm sure it wouldn't have any problem with the amount of data... but I have not heard of any convenient way of limiting google desktop searches to a particular folder. It would just spit out results from all over the place. On top of that, if the little section it gives on the results page doesn't contain the text in the other language, you're SOL... opening the file to look it up is not an option.

A friend of mine has a smart little desktop search tool that may be a good solution (can't remember the name off the top of my head). It searches in the files or folders you tell it to and spits out a results page with short sections of text with the search terms highlighted. Still, something that was actually designed for TMs would perhaps be a bit better, if only because it gives you the actual sentence pairs instead of just some random context that may or may not contain the target language term.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 05:24
Member (2006)
English to Afrikaans
+ ...
Erm, split the file? Jun 10, 2009

FarkasAndras wrote:
Specifically, the DGT-TM (Acqius TM). This thing is pretty darn big at a million or so TUs.


1. In what format is this TM? Can you reduce the file size by simplifying the format (i.e. if it is TMX, can you turn it into CSV)?

2. Is it a single file or multiple files? Can't you split the file(s) into smaller files, which would be less memory intensive?

If you can do the above two things, you can use any desktop search program to index the files.

I just want concordance search, preferably with fuzzy search, searching multiple terms and showing the top X hits at the same time. Being able to search several TMs at a time would be a plus.


I know of no CAT tool that offers fuzzy concordance search... do you know of any?


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
responses Jun 10, 2009

Samuel Murray wrote:

FarkasAndras wrote:
Specifically, the DGT-TM (Acqius TM). This thing is pretty darn big at a million or so TUs.


1. In what format is this TM? Can you reduce the file size by simplifying the format (i.e. if it is TMX, can you turn it into CSV)?

2. Is it a single file or multiple files? Can't you split the file(s) into smaller files, which would be less memory intensive?

If you can do the above two things, you can use any desktop search program to index the files.

I just want concordance search, preferably with fuzzy search, searching multiple terms and showing the top X hits at the same time. Being able to search several TMs at a time would be a plus.


I know of no CAT tool that offers fuzzy concordance search... do you know of any?


As I stated in the first post, it's a single TMX. I could split it up of course. It comes in 12 blocks of raw data anyway, and you can compile TMs in any permutation of those, so it's trivial to make 12 component TMs. Well, that gives me an idea: maybe I'll try and feed it to xbench in nice little spoonfuls. As we all know Trados can't be bothered to search more than one (2, with limitations) TMs but xbench is not this picky. Maybe it will help, maybe it won't.

I don't want to use a desktop search program for a buch of reasons, see previous post.

You got me on fuzzy... I think Trados does fuzzy concordance but I may be dead wrong. What really matters to me is that if I search for, say, a 10-word sentence, I want to get hits that contain just 7 (or any) of those words in whatever order. And order the TUs by how good the match is...
This is part of the reason why I don't want to use a text editor.


Come on, people, am I the only one who uses the DGT-TM for concordance? Am I the only one who does EU translation/interpreting? I'm sure someone has this worked out already.


Direct link Reply with quote
 

Tangopeter

Local time: 05:24
English to Dutch
+ ...
My Trados had no problems with it... Jun 10, 2009

I did not meet any problems using the DGT-TM's, even with one as main TM and another one as ref-TM. I use SDL Trados 2007 (8.3), but have used the DGT-TM's in Trados 7.5 as well. My hardware configuration is about the same as yours (2 GB mem etc).

I don't know how you proceeded, but here's what I did:
1. Downloaded TM-zips (volume_1 to volume_12.zip) + TMXtract.exe + swt-win32-3218.dll to the same folder.
2. Started TMXtract.exe, selected all 12 zips, chose Source and Target language, and created one target file (TMX) per language combination.
3. Started Trados WB with a new (empty) TM for the appropriate language combination, imported the TMX as version 1.1 (maybe that is important?)

Anyway, that worked like a charm.

Note: The export from TWB as txt is about 1/3 of the TMX size.

HTH


Direct link Reply with quote
 

Daniel García
English to Spanish
+ ...
Smaller TMs by Subject Area Jun 10, 2009

What we have done is create smaller TMs for the subject area that we are interested in. People use these smaller TMs and then they have the large ones, if they need them.

You could try DTSearch or Funduc's search and replace but you might want to remove the inline formatting from the segments to improve the search results.

Daniel


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
versions Jun 10, 2009

Tangopeter wrote:

I did not meet any problems using the DGT-TM's


Well, I did : ) I imported the TM in the same way as you; there is not much one could do differently.
I have a different Trados version which has issues with giant files like this. Maybe I'll upgrade someday... but the fact that you haven't had issues doesn't mean no one ever will, even with the same Trados version. Sh... Stuff happens, and even more so when you're using SDL products.

Dgmaga: I'm intrigued. Do you mean you chopped up the DGT-TM? I can't see any text field that you could use to identify the different subject areas, only a document identifier. I think it uses CELEX numbers, and as far as I know CELEX numbers give no indication of subject area.

DTSearch might just be the neat little search tool I mentioned earlier... maybe an option. Funduc's search... if it can do many (100,000+) search&replace operations then it may come in handy for other tasks, but I can't see using it for concordance searches.

[Edited at 2009-06-10 17:54 GMT]


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
update Jun 10, 2009

I just created 6 TMX files out of the DGT-TM data and created a project with all of them in xbench. It works reasonably well. It takes ages to load the project (well over a minute, but then the files are 800+ MB and they are not indexed), and the searches are just about fast enough to be bearable (3 seconds or so). Again, not indexed...
I searched the string "the" just to torture it: 11 sec, seven hundred thousand hits successfully retrieved and displayed. Not bad.

BTW I just bought a crazy fast SSD drive... I don't even want to think about how long the load and the searches would take with my old, puny little platter drive. I'm pretty sure the program would be unusable.

Anyway, xbench seems like a viable solution, but it's not exactly fast because of the lack of indexing and whatnot. It ate up close to 400 MB of memory as well.


Thanks to everyone for all the help so far, and if anyone has suggestions or comments on how DVX handles large TMs, please share them.


Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 23:24
English to French
+ ...
Split the TM Jun 10, 2009

That is a humongous TM! I am not at all surprised that XBench can't handle half of it - I expect it will not even handle one tenth of it. However, if you split it up into chunks of ten to twenty thousand unit bits (this means splitting it up 50- or 100-fold, no less) and load all those into an XBench project, XBench should be able to give you good results.

I am afraid that no tool will be able to run a TM of that size smoothly.

As for the desktop search utility, one that is free and that I highly recommend is Exalead Desktop at http://www.exalead.com/software/products/desktop-search/ - I have used it with success to search across TMs and bilingual documents. It's pretty fast and searches are highly customizable.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 05:24
Member (2006)
English to Afrikaans
+ ...
Some answers Jun 10, 2009

FarkasAndras wrote:
As I stated in the first post, it's a single TMX.


Actually, you said no such thing. Following your post, I did spend about 10 minutes on various sites that mention DGT-TM and Acqius TM, but I was unable to determine whether it is one file or multiple files, or what format the TM is in.

I don't want to use a desktop search program for a buch of reasons, see previous post:
...but I have not heard of any convenient way of limiting google desktop searches to a particular folder. ... On top of that, if the little section it gives on the results page doesn't contain the text in the other language, you're SOL.


1. Google Desktop is only one desktop searcher. Try Wilma from Redtree.

2. If you convert the TMX to CSV (or better: tab delimited), then source and target will be on a single line, which hopefully means it will both display in a result. TMX was designed by idiots for idiots for idiotic purposes.

What really matters to me is that if I search for, say, a 10-word sentence, I want to get hits that contain just 7 (or any) of those words in whatever order. And order the TUs by how good the match is...


Let us know if you find something that does that.


Direct link Reply with quote
 

FarkasAndras
Local time: 05:24
English to Hungarian
+ ...
TOPIC STARTER
decent search Jun 10, 2009

Samuel Murray wrote:

What really matters to me is that if I search for, say, a 10-word sentence, I want to get hits that contain just 7 (or any) of those words in whatever order. And order the TUs by how good the match is...


Let us know if you find something that does that.


Trados, and hopefully all of them.
I was pretty astonished to find that some CATs don't make this the default. I think WF Pro only gives concordance hits by default if it finds a TU with all searched words present (maybe only if they are in the same order), and you have to replace all the spaces with + signs to get what I consider to be normal search results.

Obviously one rarely does concordance on a whole sentence, but a 5-word name of an institution is a frequent scenario, and some minor variation in the name shouldn't stop a good hit showing up at the top of the list.


Direct link Reply with quote
 

ViktoriaG  Identity Verified
Canada
Local time: 23:24
English to French
+ ...
Do you mean concordance search? Jun 12, 2009

FarkasAndras wrote:

Trados, and hopefully all of them.

If you mean concordance search, I would just like to say that I found it so inefficient (0 results - I swear I know that word is buried somewhere in this TM, in at least three segments!) that I started using XBench, with awesome results. If it's there, XBench will find it.

Seriously, the only tool out there that is nearly as efficient as what you describe is XBench (to my knowledge - and I tried out many). But no way will it handle millions of TUs in one big file. If you split it all up into many smaller TMs, it will most likely work. You would still have all of the TM, except that XBench will open many smaller files one after the other instead of trying to open a huge one. I suspect it isn't searching in the file that causes accidents but just opening it.

The only other alternative I can suggest is to use an ultra fast text editor like emEditor which can open files of up to 2Gb if I am not mistaken. then, you can search over a text file. Still, that would be a very flat search, and it will not even let you compare two search results...


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How can I do concordance search in large TMs?

Advanced search







SDL Trados Studio 2017 only €415 / $495
Get the cheapest prices for SDL Trados Studio 2017 on ProZ.com

Join this translator’s group buy brought to you by ProZ.com and buy SDL Trados Studio 2017 Freelance for only €415 / $495 / £325 / ¥60,000 You will also receive FREE access to our getting started eLearning program!

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search