Re-segmentation of TMX files: is there a tool for this?
Thread poster: Hans Lenting

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
+ ...
Sep 5, 2012

Hi,

Sometimes I have TMX files where segmentation isn't ideal. E.g. when translating SDLXLIFF files in another CAT tool I get TUs that contain more than one sentence.

Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU? In an ideal world this tool would use SRX rule files.

Thanks for your suggestions!

Hans

[Bearbeitet am 2012-09-05 19:29 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 10:16
English to Hungarian
+ ...
Hardly Sep 5, 2012

I'd be very surprised if such a tool existed. It wouldn't be too difficult, though, to extract the two texts from the tmx and then segment and align them. All the sw you'd need is available in my projects on sourceforge.net/projects/aligner. The only trouble is that you would lose all metadata, and you might run into some issues as this process is largely untested.

Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 11:16
Finnish to French
TM-driven segmentation in memoQ Sep 6, 2012

Hans Lenting wrote:
Sometimes I have TMX files where segmentation isn't ideal. E.g. when translating SDLXLIFF files in another CAT tool I get TUs that contain more than one sentence.

Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU? In an ideal world this tool would use SRX rule files.

memoQ has a feature called "TM-driven segmentation":



I believe it does what you want to do, but the other way round: rather than adapting the segmentation of your TM to your document, it adapts the segmentation of your document to that of the TM.

The feature works very well and I'm surprised it hasn't been copied by other tools (AFAIK). Maybe it's something you should suggest to Igor


Direct link Reply with quote
 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
+ ...
TOPIC STARTER
Is this simple? Sep 6, 2012

Dominique Pivard wrote:


I believe it does what you want to do, but the other way round: rather than adapting the segmentation of your TM to your document, it adapts the segmentation of your document to that of the TM.



I'm afraid not ... What I want is: translate SDLXLIFF in another CAT tool (e.g. CafeTran), change multi-sentence segments to segments only containing 1 sentence (part) each.

«tu tuid="1" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies ist der erste Satz. Dies der Zweite. Und dies ist der dritte Satz.«/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit is de eerste zin. Dit de tweede. En dit is de derde zin.«/seg»«/tuv»«/tu»

And this is what we want:

«tu tuid="1" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies ist der erste Satz. «/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit is de eerste zin. «/seg»«/tuv»«/tu»

«tu tuid="2" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies der Zweite. «/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit de tweede. «/seg»«/tuv»«/tu»

«tu tuid="3" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Und dies ist der dritte Satz.«/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»En dit is de derde zin.«/seg»«/tuv»«/tu»

Where the number of sentences per TU in both source and target is identical, this proces can be automated. In cases where the numbers differ (because of joining/splitting) a simple dialog box should pop up.

[Bearbeitet am 2012-09-06 06:55 GMT]


Direct link Reply with quote
 
Adam Łobatiuk  Identity Verified
Poland
Local time: 10:16
Member (2009)
English to Polish
+ ...
Not simple but might work Sep 6, 2012

I think you can use Olifant (part of the free Okapi tools) and its "Split entries on markers" feature. This is what the Help says:


Splits the selected entries according the split markers in the source and target text. If no entry is selected, the entry where is cursor is located is automatically selected.

The split markers should be the text "[$SPLIT$]". The first segment replaces the text of the original entry, the other segments are copied into new entries.

For example:

Original entry source: "Open File...[$SPLIT$]Open"
Original entry target: "Ouvrir un fichier[$SPLIT$]Ouvrir"

Gives:

Original entry source: "Open File..."
Original entry target: "Ouvrir un fichier"
New entry source: "Open"
New entry target: "Ouvrir"


You would probably have to sort your TMX file for entries which include more than one full stop or some other specific feature, and then replace the full stop with a full stop and the marker. So, it won't be fully automatic, but still might do some work for you.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 10:16
Member (2004)
English to Polish
Alignment Sep 6, 2012

If you have the sdlxliff files, you can import them into LiveDocs in MemoQ. If you use "Add document" and not "Import bilingual", the source would be resegmented and most (if not all) segments should be aligned correctly.

Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 10:16
German to Swedish
+ ...
Yes Sep 6, 2012

That's a simple menu command in my MacOSX app TMXReader.
(Currently in beta and not released, but get back to me if you have a Mac.)

(Of course, this only works if the source and target sentence count is identical.)


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 09:16
Member (2009)
Dutch to English
+ ...
it's a few years later, but ... anyone made any progress re re-segmenting TMXs? May 18

Has anyone made any progress in this area?

I'm currently fiddling with my segmentation rules in Déjà Vu X3, to create special rules for translating patents. However, I then realised that I would also like to edit my big PATENTS.tmx to reflect my new seg rules, and stumbled across this page while googling "Resegment TMX"… (not at all surprised to see your name pop up, Hans

The gist of my patent-specific seg rules revolve around segmenting the src text at semicolons, and at certain standard phrases, such as "characterised in that...", etc. (In Dutch: daardoor gekenmerkt dat; met het kenmerk dat; gekenmerkt doordat)

so I figured I would need to go through my TMX and re-segmented at the specific markers. However, there is obviously the problem that these markers are only in my source text, so where would I cut the target text in each corresponding TU? A bit of a mystery.

Michael

also discussing this over @ https://groups.yahoo.com/neo/groups/dejavu-l/conversations/messages/137044

[Edited at 2017-05-18 16:18 GMT]


Direct link Reply with quote
 

CafeTran Training
Netherlands
Local time: 10:16
Member (2016)
New idea May 20

Michael Joseph Wdowiak Beijer wrote:

Has anyone made any progress in this area?



Hi Michael,

Thank you for bringing this important task back to my attention! I've just posted a new idea here:

https://cafetran.freshdesk.com/support/discussions/topics/6000048846

Let's see what Igor (the developer of CafeTran, who normally is extremely responsive) will answer.

Have a great weekend!

Hans


Direct link Reply with quote
 

CafeTran Training
Netherlands
Local time: 10:16
Member (2016)
CafeTran can split TMX units (TUs) now May 25

Yesterday's build of CafeTran offers a useful feature to re-segment TMX files:

https://cafetran.freshdesk.com/support/discussions/topics/6000048934

You can insert the split characters manually or via a replacement action using regular expressions.

Re-segment this:

a

to this:

b


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:16
Member (2006)
English to Afrikaans
+ ...
I can't help but I can comment May 25

Hans Lenting wrote:
Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU?


1. No, I have had to do this myself, and I could only accomplish it with conversions and with loss of information (i.e. the resulting TM can only be used as a reference TM, not an active TM). What I did in the past is to convert the TMX file to WFC TM format, then convert that using a hack script of mine to PO format, then use "posegment" (from TranslateHouse) to split it into sentences, then use "po2tmx" (also from TranslateHouse) to convert it to TMX (and if I wanted WFC TM format, I'd convert the TMX to WFC TM as well). The posegment program is not interactive, but when it encounters a paragraph segment with a dissimilar number of sentences in the source and target field, it would simply create a sentence segment for each source text sentence, containing the entire paragraph's text as the target text. This allowed a CAT tool to perform fuzzy matching on the sentences, but required the user to manually check the translations (i.e. don't use pre-translation on such a TM). Personally, my solution to that would have been to simply added a single dummy character to each source text (i.e. in cases of dissimilarity), to ensure that it never gives a 100% match to anything.

2. Apparently, you can do the conversion using OmegaT, by tricking OmegaT into thinking that you've loaded an old project from the days when OmegaT could not do sentence segmentation. This might be an option for you, since there is no "conversion" to other formats -- OmegaT takes your TMX file and rewrites it. The instructions are given by Didier here, ...but I have just tried it again and can't get it to work (perhaps you can).


[Edited at 2017-05-25 09:34 GMT]


Direct link Reply with quote
 

Meta Arkadia
Local time: 15:16
English to Indonesian
+ ...
It can be done, but May 25

Samuel Murray wrote: ...with loss of information


I provided a solution:

  • Open the TMX file in CafeTran Edit TM
  • Export it to a two-columns HTML
  • Open the HTML in Word
  • Select the SL column, and save it as a txt file. Repeat with the TL column
  • Use a regex to replace all punctuation in both files with a hard return, or, if you want to keep the punctuation, use a regex that adds a hard return after the punctuation (this may result in empty "rows" which should be deleted, either before or after point 6)
  • Align using CT's aligner (auto should do) or another aligner
  • Import the aligned file into a TMX memory

    But
  • You will lose any and all metadata present in the original TMX file.

    Since the source TMX is basically an aligned file, and the operations are performed on both files, aligning shouldn't be a problem. Unless...

    Since those operations usually involve punctuation, there can be problems, like with decimal dots (and God knows what else).

    An advantage of my approach would be that you can do several operations/punctuations at once.

    Still, there are many things that can - and therefore will - go wrong.

    Cheers,

    Hans

    Direct link Reply with quote
     

  • Samuel Murray  Identity Verified
    Netherlands
    Local time: 10:16
    Member (2006)
    English to Afrikaans
    + ...
    In OmegaT May 25

    Hans Lenting wrote:
    Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU?


    I've found out how to do this in OmegaT.

    1. Open the TMX file in a Unicode-aware text editor and make sure the <header> tag contains segtype="paragraph".

    segtype

    2. In OmegaT, create a new project (Project > New), and remember to untick "Enable sentence-level segmenting" in the project's properties. Add at least one dummy translatable file to the project (e.g. drag and drop), and translate at least one segment (and save with Ctrl+S). Then close OmegaT.

    3. Rename your TMX file to "project_save.tmx" and put it in the /omegat/ subfolder of the OmegaT project folder (i.e. replace the existing file called "project_save.tmx").

    4. In OmegaT, open the project again, and press Ctrl+E (project properties). Now tick the option "Enable sentence-level segmenting". Translate at least one segment, and save (Ctrl+S).

    5. Close OmegaT again.

    If all goes to plan, your TMX is now sentence segmented. Please try it and let us know if it works.


    Direct link Reply with quote
     


    To report site rules violations or get help, contact a site moderator:


    You can also contact site staff by submitting a support request »

    Re-segmentation of TMX files: is there a tool for this?

    Advanced search







    Protemos translation business management system
    Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

    The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

    More info »
    Wordfast Pro
    Translation Memory Software for Any Platform

    Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

    More info »



    Forums
    • All of ProZ.com
    • Term search
    • Jobs
    • Forums
    • Multiple search