TMXMerger - any way to see source filenames for merged TMs?
Thread poster: Mercer
Nov 4, 2013

Hi, when merging a large number of small .tmx files into a larger one using TMXMerger, is there any way after that to see from which file that segment was originally from?

I merged together TMX files were that were originally created from individual texts using LF Aligner. The files were merged together since OmegaT was running out of memory trying to load them all individually and seems to have an easier time when they're grouped in a few large files.

I am not seeing how that would be possible, but is there a way to do it?


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 03:29
Member (2007)
English to French
+ ...
Not without modifying the source code Nov 5, 2013

Mercer wrote:
Hi, when merging a large number of small .tmx files into a larger one using TMXMerger, is there any way after that to see from which file that segment was originally from?

Without modifying the source code, I don't think so.

I merged together TMX files were that were originally created from individual texts using LF Aligner. The files were merged together since OmegaT was running out of memory trying to load them all individually and seems to have an easier time when they're grouped in a few large files.

Have you tried increasing the memory allocated to OmegaT?

Didier


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 03:29
Member (2006)
English to Afrikaans
+ ...
Only if you can pollute the original segments Nov 5, 2013

Mercer wrote:
Hi, when merging a large number of small TMX files into a larger one using TMXMerger, is there any way after that to see from which file that segment was originally from?


No, not with TMXMerger.

But if you can edit the original TMX files themselves, then you can add short codes to either the source or the target text of each segment, which will help you identify the origin when you see it in OmegaT's fuzzy match pane.

If it is not simple enough to edit your TMX files, then it may be possible to edit the segments before or during the alignment (I have no idea whether LF Aligner allows you to edit segments before it creates the final TMX file).

I don't use OmegaT often, but in my own CAT tool I often do this: I add e.g. [ENG] to the start of each segment's source text, if e.g. that segment came from an engineering text. This reduces the match percentage, though. If you don't want to reduce match percentages, you can add e.g. [ENG] to the start of each segment's target text, but then you run the risk that some of these tags will end up in your translation without you noticing it.

If you do do this (i.e. add the tag to the target text), then I recommend that you add a custom "tag" in your tag validation settings. In OmegaT, go to Options > Tag Validation, and in the "Regular expression for custom tags" field, type this (without the spaces):

\ [ . + ? \ ]

When you do tag validation, segments with such left-over tags will be reported.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 03:29
Member (2007)
English to French
+ ...
You could edit changeID or creationID instead Nov 5, 2013

Samuel Murray wrote:
But if you can edit the original TMX files themselves, then you can add short codes to either the source or the target text of each segment, which will help you identify the origin when you see it in OmegaT's fuzzy match pane.

If it is not simple enough to edit your TMX files, then it may be possible to edit the segments before or during the alignment (I have no idea whether LF Aligner allows you to edit segments before it creates the final TMX file).

I don't use OmegaT often, but in my own CAT tool I often do this: I add e.g. [ENG] to the start of each segment's source text, if e.g. that segment came from an engineering text. This reduces the match percentage, though. If you don't want to reduce match percentages, you can add e.g. [ENG] to the start of each segment's target text, but then you run the risk that some of these tags will end up in your translation without you noticing it.

You could edit changeID or creationID instead. That way, you do not change the segment itself, and you can display the origin in the Fuzzy Matches pane.

Didier


Direct link Reply with quote
 
FarkasAndras
Local time: 03:29
English to Hungarian
+ ...
add separate field Nov 5, 2013

Samuel Murray wrote:

Mercer wrote:
Hi, when merging a large number of small TMX files into a larger one using TMXMerger, is there any way after that to see from which file that segment was originally from?


No, not with TMXMerger.

But if you can edit the original TMX files themselves, then you can add short codes to either the source or the target text of each segment, which will help you identify the origin when you see it in OmegaT's fuzzy match pane.

If it is not simple enough to edit your TMX files, then it may be possible to edit the segments before or during the alignment (I have no idea whether LF Aligner allows you to edit segments before it creates the final TMX file).

I don't use OmegaT often, but in my own CAT tool I often do this: I add e.g. [ENG] to the start of each segment's source text, if e.g. that segment came from an engineering text. This reduces the match percentage, though. If you don't want to reduce match percentages, you can add e.g. [ENG] to the start of each segment's target text, but then you run the risk that some of these tags will end up in your translation without you noticing it.

If you do do this (i.e. add the tag to the target text), then I recommend that you add a custom "tag" in your tag validation settings. In OmegaT, go to Options > Tag Validation, and in the "Regular expression for custom tags" field, type this (without the spaces):

\ [ . + ? \ ]

When you do tag validation, segments with such left-over tags will be reported.

Why not add a text field? That will show up in your CAT and will not affect the matches you get. LF Aligner can do this (you can type in the 'Note' when you generate the TMX and it adds it to every TU).
If the files are not generated by alignment, you need to add this between the opening tu tag and the opening tuv tag: <prop type="Txt::Source">this is where you put the source ID</prop>
Maybe it's possible to just add this to the header instead of adding it to every TU, I don't know.

[Edited at 2013-11-05 09:26 GMT]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 03:29
Member (2006)
English to Afrikaans
+ ...
@Didier Nov 5, 2013

Didier Briel wrote:
You could edit changeID or creationID instead. That way, you do not change the segment itself, and you can display the origin in the Fuzzy Matches pane.


That's a good idea. If you know how to do that, that would be a good place to put the marker.

FarkasAndras wrote:
Why not add a text field? That will show up in your CAT and will not affect the matches you get. LF Aligner can do this (you can type in the 'Note' when you generate the TMX and it adds it to every TU).


I'm glad to know that LF Aligner has the ability to do this.


Direct link Reply with quote
 
Mercer
TOPIC STARTER
Thanks for the answers Nov 5, 2013

Thank you for the answers, I will try these options and give an update. One of the computer it has to work on has very little RAM, so giving more memory to OmegaT was not an option.

Didier Briel wrote:
You could edit changeID or creationID instead. That way, you do not change the segment itself, and you can display the origin in the Fuzzy Matches pane.


Thanks, it is a good idea, I am new to this and was not aware that the information showed in the OmegaT fuzzy match pane could be easily modified. I have tried now and it looks like this could work, but TMXMerger seems to get rid of all ID tags and notes, so I will try to see if there are other ways to merge the files.

FarkasAndras wrote:
Why not add a text field? That will show up in your CAT and will not affect the matches you get. LF Aligner can do this (you can type in the 'Note' when you generate the TMX and it adds it to every TU).
If the files are not generated by alignment, you need to add this between the opening tu tag and the opening tuv tag: this is where you put the source ID
Maybe it's possible to just add this to the header instead of adding it to every TU, I don't know.

[Edited at 2013-11-05 09:26 GMT]


Does the LF Aligner batch mode fill the note field automatically? After it is filled how would I get it to show in OmegaT?


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 03:29
Member (2007)
English to French
+ ...
Configuration of the fuzzy match pane Nov 5, 2013

Thanks, it is a good idea, I am new to this and was not aware that the information showed in the OmegaT fuzzy match pane could be easily modified. I have tried now and it looks like this could work, but TMXMerger seems to get rid of all ID tags and notes, so I will try to see if there are other ways to merge the files.

As far as I can see in the source code, changeID is preserved, as well as changeDate.

Does the LF Aligner batch mode fill the note field automatically? After it is filled how would I get it to show in OmegaT?

See https://sourceforge.net/p/omegat/feature-requests/598/

Didier


Direct link Reply with quote
 
FarkasAndras
Local time: 03:29
English to Hungarian
+ ...
play around with it Nov 5, 2013

Mercer wrote:

Thank you for the answers, I will try these options and give an update. One of the computer it has to work on has very little RAM, so giving more memory to OmegaT was not an option.

Didier Briel wrote:
You could edit changeID or creationID instead. That way, you do not change the segment itself, and you can display the origin in the Fuzzy Matches pane.


Thanks, it is a good idea, I am new to this and was not aware that the information showed in the OmegaT fuzzy match pane could be easily modified. I have tried now and it looks like this could work, but TMXMerger seems to get rid of all ID tags and notes, so I will try to see if there are other ways to merge the files.

FarkasAndras wrote:
Why not add a text field? That will show up in your CAT and will not affect the matches you get. LF Aligner can do this (you can type in the 'Note' when you generate the TMX and it adds it to every TU).
If the files are not generated by alignment, you need to add this between the opening tu tag and the opening tuv tag: this is where you put the source ID
Maybe it's possible to just add this to the header instead of adding it to every TU, I don't know.

[Edited at 2013-11-05 09:26 GMT]


Does the LF Aligner batch mode fill the note field automatically? After it is filled how would I get it to show in OmegaT?


If you created the files with LF Aligner with default settings, then they should all have a Note field containing the name of the input files (such as Englishfile.doc_Frenchfile.doc). Open one of the tmx files with a text editor and see if it has a prop type="Txt::Note" field. Then open the merged tmx and see if the note field is still there (TMXMerger may have removed it). If it's there in the merged tmx, you should be able to see it in OmegaT.

If you're aligning a bunch of files from scratch using the LF Aligner batch mode in V 4.04, specify an output file with --outfile="path\file.txt". Add that to every command and you'll get a single tab delimited file with all the texts in it and the source file names in the third column. You can use search and replace in a text editor to change the text if you want to. Then run the TMX maker with default settings on that file and you should get a single TMX file with all your stuff in it and correct 'Note' fields added to each TU. Check the tmx in a text editor before importing to make sure.
Editing the CreationID is also a reasonable option, but if you already have TMX files with the Note fields from LF Aligner, it should be easier to just use them. This is what the Note field is for (so that you can see which source file the TM hit came from).


Direct link Reply with quote
 
Mercer
TOPIC STARTER
Thanks! Nov 6, 2013

Didier Briel wrote:
Does the LF Aligner batch mode fill the note field automatically? After it is filled how would I get it to show in OmegaT?

See https://sourceforge.net/p/omegat/feature-requests/598/

Didier


Thanks for the link, very useful. Is there also a way to configure the search window to show notes, or only the match pane?

FarkasAndras wrote:
If you created the files with LF Aligner with default settings, then they should all have a Note field containing the name of the input files (such as Englishfile.doc_Frenchfile.doc). Open one of the tmx files with a text editor and see if it has a prop type="Txt::Note" field. Then open the merged tmx and see if the note field is still there (TMXMerger may have removed it). If it's there in the merged tmx, you should be able to see it in OmegaT.

If you're aligning a bunch of files from scratch using the LF Aligner batch mode in V 4.04, specify an output file with --outfile="path\file.txt". Add that to every command and you'll get a single tab delimited file with all the texts in it and the source file names in the third column. You can use search and replace in a text editor to change the text if you want to. Then run the TMX maker with default settings on that file and you should get a single TMX file with all your stuff in it and correct 'Note' fields added to each TU. Check the tmx in a text editor before importing to make sure.
Editing the CreationID is also a reasonable option, but if you already have TMX files with the Note fields from LF Aligner, it should be easier to just use them. This is what the Note field is for (so that you can see which source file the TM hit came from).


Thanks, I ended up using the note field. I am not sure why I was losing it and the other metadata fields when using TMXMerger, but that explains why the merged files ended up being significantly smaller than the size of all the individual files. I merged the existing TMX files using a text editor to keep the note field, and it now shows in the OmegaT match pane by following the instructions Didier posted.

I am happy this works, thanks!


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 03:29
Member (2007)
English to French
+ ...
Only the match pane Nov 6, 2013

Mercer wrote:
Thanks for the link, very useful. Is there also a way to configure the search window to show notes, or only the match pane?

Only the match pane.

Didier


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


TMXMerger - any way to see source filenames for merged TMs?

Advanced search






SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search