Segmentation rule that doesn't split segments at the TAB character?
Thread poster: Hans Lenting

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
Feb 27, 2020

I want to translate MS Word documents like this one:

Hyi7IBhi31jk4a1OVMAyuDbEk6LM8SQvyg

There should be no splitting at the TAB characters.

I've tried omegaT's and LanguageTerminal's SRX files, to no avail.

Could someone please provide a SRX file that offers the option to ignore TAB characters when segmenting?


 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:23
Member (2006)
English to Afrikaans
+ ...
@Hans Feb 27, 2020

Hans Lenting wrote:
I've tried OmegaT's ... SRX file, to no avail.


Where did you get "OmegaT's SRX file"? I have OmegaT but there is no SRX file here. I know OmegaT's own segmentation file is based off of SRX, but how do you convert OmegaT's segmentation rules to an actual SRX file?

Could someone please provide a SRX file that offers the option to ignore TAB characters when segmenting?


It is my understanding that SRX does not segment on tab by default, which would mean that either tabs are indicated as break positions somewhere in your current SRX file, or your CAT tool automatically segments by tab before processing the SRX file.

In SRX (according to the 2008 specification), tabs are indicated using \t or \u0009. Does your SRX file specify a break at \t or u0009 at all?

There should be no splitting at the TAB characters.


When I get files like this, I replace the tabs with e.g. {{TAB}} (and mark them as internal text), then do the translation, and then afterwards replace the {{TAB}}s with actual tabs again. Ditto line feeds. Ditto line breaks inside tables, etc. You're very brave to fiddle with your CAT tool's advanced segmentation settings.


[Edited at 2020-02-27 12:20 GMT]


 

Hans Lenting  Identity Verified
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
@Samuel Feb 27, 2020

Samuel Murray wrote:

Where did you get "OmegaT's SRX file"? I have OmegaT but there is no SRX file here. I know OmegaT's own segmentation file is based off of SRX, but how do you convert OmegaT's segmentation rules to an actual SRX file?


Here: https://raw.githubusercontent.com/omegat-org/omegat/master/src/org/omegat/core/segmentation/defaultRules.srx

It is my understanding that SRX does not segment on tab by default, which would mean that either tabs are indicated as break positions somewhere in your current SRX file, or your CAT tool automatically segments by tab before processing the SRX file.


I think that the splitting is related to how tab characters are represented in e.g. MS Word's XML.

Screenshot 2020-02-27 at 13.35.03


 

Samuel Murray  Identity Verified
Netherlands
Local time: 01:23
Member (2006)
English to Afrikaans
+ ...
@Hans Feb 27, 2020

Hans Lenting wrote:
I think that the splitting is related to how tab characters are represented in e.g. MS Word's XML.
Screenshot 2020-02-27 at 13.35.03


Could be... perhaps you are using a CAT tool that is either unable to ignore the tags between "Cell"s for segmentation purposes or that is unable to treat those tags as internal text (i.e. line-level tagging) instead of external text (i.e. block-level tagging). But that would be odd, in my opinion, since lots of text in DOCX have lots of tags surrounding it anyway.

Automatic splitting by tab can also be something that is built into the DOCX filter that is used by your CAT tool. The SRX rules are applied to the extracted (filtered) text, so if the filter itself splits by tab, then no amount of SRX fiddling is going to solve it.


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 00:23
Member (2009)
Dutch to English
+ ...
http://185.13.37.79 Feb 27, 2020

Hans Lenting wrote:

I want to translate MS Word documents like this one:

Hyi7IBhi31jk4a1OVMAyuDbEk6LM8SQvyg

There should be no splitting at the TAB characters.

I've tried omegaT's and LanguageTerminal's SRX files, to no avail.

Could someone please provide a SRX file that offers the option to ignore TAB characters when segmenting?


Hi Hans,

I'd ask the guy in charge of the DGT-OmegaT project. see: http://185.13.37.79/
He knows a lot about making Studio stuff work with OmegaT, and SRX rules.

Studio by default does what you want, I think it leaves tabs in there without segmenting, but I can't figure out how to export the default Studio seg rules to SRX.

Michael

[Edited at 2020-02-27 21:38 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Segmentation rule that doesn't split segments at the TAB character?







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »