Mobile menu

Re-segmentation of TMX files: is there a tool for this?
Thread poster: Hans Lenting

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
+ ...
Sep 5, 2012

Hi,

Sometimes I have TMX files where segmentation isn't ideal. E.g. when translating SDLXLIFF files in another CAT tool I get TUs that contain more than one sentence.

Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU? In an ideal world this tool would use SRX rule files.

Thanks for your suggestions!

Hans

[Bearbeitet am 2012-09-05 19:29 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 17:26
English to Hungarian
+ ...
Hardly Sep 5, 2012

I'd be very surprised if such a tool existed. It wouldn't be too difficult, though, to extract the two texts from the tmx and then segment and align them. All the sw you'd need is available in my projects on sourceforge.net/projects/aligner. The only trouble is that you would lose all metadata, and you might run into some issues as this process is largely untested.

Direct link Reply with quote
 

Dominique Pivard  Identity Verified
Local time: 18:26
Finnish to French
TM-driven segmentation in memoQ Sep 6, 2012

Hans Lenting wrote:
Sometimes I have TMX files where segmentation isn't ideal. E.g. when translating SDLXLIFF files in another CAT tool I get TUs that contain more than one sentence.

Is there a tool to re-segment TMX files *without further conversion* in such a way that every sentence (and its translation) is placed in its own TU? In an ideal world this tool would use SRX rule files.

memoQ has a feature called "TM-driven segmentation":



I believe it does what you want to do, but the other way round: rather than adapting the segmentation of your TM to your document, it adapts the segmentation of your document to that of the TM.

The feature works very well and I'm surprised it hasn't been copied by other tools (AFAIK). Maybe it's something you should suggest to Igor


Direct link Reply with quote
 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
+ ...
TOPIC STARTER
Is this simple? Sep 6, 2012

Dominique Pivard wrote:


I believe it does what you want to do, but the other way round: rather than adapting the segmentation of your TM to your document, it adapts the segmentation of your document to that of the TM.



I'm afraid not ... What I want is: translate SDLXLIFF in another CAT tool (e.g. CafeTran), change multi-sentence segments to segments only containing 1 sentence (part) each.

«tu tuid="1" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies ist der erste Satz. Dies der Zweite. Und dies ist der dritte Satz.«/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit is de eerste zin. Dit de tweede. En dit is de derde zin.«/seg»«/tuv»«/tu»

And this is what we want:

«tu tuid="1" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies ist der erste Satz. «/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit is de eerste zin. «/seg»«/tuv»«/tu»

«tu tuid="2" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Dies der Zweite. «/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»Dit de tweede. «/seg»«/tuv»«/tu»

«tu tuid="3" creationdate="20120905T225134Z" creationid="HL"»
«tuv xml:lang="de-DE"»«seg»Und dies ist der dritte Satz.«/seg»«/tuv»
«tuv xml:lang="nl-NL"»«seg»En dit is de derde zin.«/seg»«/tuv»«/tu»

Where the number of sentences per TU in both source and target is identical, this proces can be automated. In cases where the numbers differ (because of joining/splitting) a simple dialog box should pop up.

[Bearbeitet am 2012-09-06 06:55 GMT]


Direct link Reply with quote
 
Adam Łobatiuk  Identity Verified
Poland
Local time: 17:26
Member (2009)
English to Polish
+ ...
Not simple but might work Sep 6, 2012

I think you can use Olifant (part of the free Okapi tools) and its "Split entries on markers" feature. This is what the Help says:


Splits the selected entries according the split markers in the source and target text. If no entry is selected, the entry where is cursor is located is automatically selected.

The split markers should be the text "[$SPLIT$]". The first segment replaces the text of the original entry, the other segments are copied into new entries.

For example:

Original entry source: "Open File...[$SPLIT$]Open"
Original entry target: "Ouvrir un fichier[$SPLIT$]Ouvrir"

Gives:

Original entry source: "Open File..."
Original entry target: "Ouvrir un fichier"
New entry source: "Open"
New entry target: "Ouvrir"


You would probably have to sort your TMX file for entries which include more than one full stop or some other specific feature, and then replace the full stop with a full stop and the marker. So, it won't be fully automatic, but still might do some work for you.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 17:26
Member (2004)
English to Polish
Alignment Sep 6, 2012

If you have the sdlxliff files, you can import them into LiveDocs in MemoQ. If you use "Add document" and not "Import bilingual", the source would be resegmented and most (if not all) segments should be aligned correctly.

Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 17:26
German to Swedish
+ ...
Yes Sep 6, 2012

That's a simple menu command in my MacOSX app TMXReader.
(Currently in beta and not released, but get back to me if you have a Mac.)

(Of course, this only works if the source and target sentence count is identical.)


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 16:26
Member (2009)
Dutch to English
+ ...
it's a few years later, but ... anyone made any progress re re-segmenting TMXs? May 18

Has anyone made any progress in this area?

I'm currently fiddling with my segmentation rules in Déjà Vu X3, to create special rules for translating patents. However, I then realised that I would also like to edit my big PATENTS.tmx to reflect my new seg rules, and stumbled across this page while googling "Resegment TMX"… (not at all surprised to see your name pop up, Hans

The gist of my patent-specific seg rules revolve around segmenting the src text at semicolons, and at certain standard phrases, such as "characterised in that...", etc. (In Dutch: daardoor gekenmerkt dat; met het kenmerk dat; gekenmerkt doordat)

so I figured I would need to go through my TMX and re-segmented at the specific markers. However, there is obviously the problem that these markers are only in my source text, so where would I cut the target text in each corresponding TU? A bit of a mystery.

Michael

also discussing this over @ https://groups.yahoo.com/neo/groups/dejavu-l/conversations/messages/137044

[Edited at 2017-05-18 16:18 GMT]


Direct link Reply with quote
 

CafeTran Training
Netherlands
Local time: 17:26
Member (2016)
New idea May 20

Michael Joseph Wdowiak Beijer wrote:

Has anyone made any progress in this area?



Hi Michael,

Thank you for bringing this important task back to my attention! I've just posted a new idea here:

https://cafetran.freshdesk.com/support/discussions/topics/6000048846

Let's see what Igor (the developer of CafeTran, who normally is extremely responsive) will answer.

Have a great weekend!

Hans


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Re-segmentation of TMX files: is there a tool for this?

Advanced search


Translation news related to CAT tools





Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs