Bilingual Concordancer on large TMX files or WordSmith Tools?
Thread poster: Paolo Troiani

Paolo Troiani  Identity Verified
Italy
Local time: 02:51
Member (2010)
English to Italian
+ ...
Oct 17, 2009

Hi,

I keep on doing some research (Thanks God is almost finished..) on bilingual aligned corpora.
Those corpora are basically "only" large TMX file.

Problem is that, as far as I know, there is no such tool like a concordancer able to analyze large tmx files (large = more than 65000 TU).
Therefore my idea is :
1) Use WordSMith Tool 5.0
2) In order to process the corpora with WS 5.0, I have to split my TMX blingual file into two tmx monolingual files.

Questions are:

1) Does it make sense to use WS 5.0, or am I overlooking some concordance tool?
2) Is it possible to create two TMX "monolingual" memories out of one original TMX bilingual? If yes, how can i do that?

Thanky you very much for any comments suggestions ideas.


Direct link Reply with quote
 

Lutz Molderings  Identity Verified
Germany
Local time: 02:51
Member (2007)
German to English
+ ...
Corsis Oct 18, 2009

When I was at university I used WordSmith Tools 4 for analysing my self-aligned parallel corpus. It all worked pretty well.
When I had almost completed my dissertation I came across this guy from Heidelberg University who was working on an open-source corpus-processing tool. It does much the same as WordSmith tools.
I am planning to continue my studies at the beginning of next year and will certainly give this tool a try.

http://sourceforge.net/projects/corsis/

[Edited at 2009-10-18 07:39 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 02:51
English to Hungarian
+ ...
Analyse? Oct 18, 2009

What sort of analysis do you want to do?
Xbench may be able to do what you need: number of occurrences of a word, word fragment of combination of words in language1 or language2, number of segments where language1 has word x and language2 has (or _doesn't have_) word y - all of these searches would of course also give you the actual text of the segment pairs as well. Xbench works on large databases; I'm using it on 1.5 million segments now, about 2 million sentences. You just have to split the material up into about 200,000 segments per file and have lots of RAM.
If you split the corpus up into two independent corpora, you won't be able to do what I consider the most relevant searches... You'll probably have to do both, i.e. advanced corpus analysis on the two independent corpora and "correspondence analysis" with xbench or a similar tool on the TMX.

Converting tmx to monolingual is possible of course, but it may not be trivial. If the file is large, I'd use linux command line tools (do you have a geek friend who'll do it for you?) In my experience, the likes of grep, sed and vim provide the ultimate solution for things like this. In this case, vim to check the structure of the TMX and identify the search words you need to use, and then grep to extract the relevant parts, then possibly sed to remove the unneeded bits (tags etc.)
If you don't know your way around linux distros and command line tools, you could just use search and replace in Word and/or do it in Excel, or look for some conversion tool... but none of these are as robust as doing it yourself in the command line. BTW a ubuntu live cd would allow you to do this without installing linux, or you could just look for a port of these tools that works on the OS of your choice.

Edit: BTW grep could come in very handy for analysis as well. If you have a gigantic text file and want to extract all the lines that contain a given word, grep is your best option. It handles xml files out of the box and of course you can set it to also extract n number of lines before and after the hit.

[Edited at 2009-10-18 09:01 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 02:51
English to Hungarian
+ ...
tmx Oct 18, 2009

In the previous post, I of course meant to say that grep handles *TMX* files out of the box, not xml. Of course it does handle xml as well but that's beside the point.

BTW if your corpus is the DTG-TM, you could just generate 6 TMX files, 2 volumes per TMX and feed them to xbench. If your source file is some other large bicorpus, you can chop the TMX up - with a unix command line tool, what else...


Direct link Reply with quote
 

Paolo Troiani  Identity Verified
Italy
Local time: 02:51
Member (2010)
English to Italian
+ ...
TOPIC STARTER
Reason for using WS 50 Oct 19, 2009

First of all, thank you very much, both of you.

So far, Xbench comes out to be a good idea. As far as I understand (give me some time to practice) is a very powerful concordancer, and the kind of analysis I am performing is exactly what Farkas was guessing:
"where language1 has word x and language2 has (or _doesn't have_) word y"

Reason for Using WS 5.0 is that I would like also to highlight the "time-evolution" of a term i.e. its frequency within a certain period, in order to answer questions like:
1) Is there a change in using this term?
2) When is this term more frequent and why?
It seems to me the plot functionality in WS 5.0 sounds good for this analysis, or once again, is there something else I am overlooking?


Direct link Reply with quote
 
Tangopeter

Local time: 02:51
English to Dutch
+ ...
Maybe a useful article? Oct 21, 2009

Hi,

I suppose you read this article?

http://simmer-lossner.blogspot.com/2008/11/practical-use-of-corpora-in-acquiring.html

Just to be sure...


nomade69 wrote:

Hi,

I keep on doing some research (Thanks God is almost finished..) on bilingual aligned corpora.
Those corpora are basically "only" large TMX file.

Problem is that, as far as I know, there is no such tool like a concordancer able to analyze large tmx files (large = more than 65000 TU).
Therefore my idea is :
1) Use WordSMith Tool 5.0
2) In order to process the corpora with WS 5.0, I have to split my TMX blingual file into two tmx monolingual files.

Questions are:

1) Does it make sense to use WS 5.0, or am I overlooking some concordance tool?
2) Is it possible to create two TMX "monolingual" memories out of one original TMX bilingual? If yes, how can i do that?

Thanky you very much for any comments suggestions ideas.



Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Bilingual Concordancer on large TMX files or WordSmith Tools?

Advanced search







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search