Using CafeTran to re-segment a legacy TMX file
Thread poster: CafeTran Training

CafeTran Training
Netherlands
Local time: 07:45
Member (2016)
May 25

Sometimes you receive a legacy TMX file that has been segmented paragraph-wise, while you want to use it for your current translation project that has been segmented after every sentence.

Here are two videos that show you how to re-segment the legacy TMX file (note that TU attributes will get lost, but how relevant is that for legacy TMX files anyway?)

Re-segmenting a TM - Part 1
https://youtu.be/h7xEoARMKB0

Re-segmenting a TM - Part 2
https://youtu.be/gEBbA4okhdk

[Edited at 2017-05-25 18:10 GMT]


Direct link Reply with quote
 

CafeTran Training
Netherlands
Local time: 07:45
Member (2016)
TOPIC STARTER
Improved regular expression May 27

Here's an improved regular expression, that will cover texts like:

1

Expression: (?<=[a-z%\d\\)][\.\!\?]) (?=([a-z]?[A-Z]))

Result:

2

[Edited at 2017-05-27 08:27 GMT]


Direct link Reply with quote
 

Meta Arkadia
Local time: 12:45
English to Indonesian
+ ...
Improve it a little more. And then again... May 27

CafeTran Training wrote:
Here's an improved regular expression


Using regexes to achieve this will be an endless pain:



My plan still is:

  • Open the wretched TMX file in CafeTran's Edit TMX mode
  • Export as HTML (other export format may be possible)
  • Open the HTML in Word, save as .docx
  • Select both Word columns one by one, and open them in CafeTran one by one, sentence segmentation enabled
  • Save
  • Align (auto)

    Using the CafeTran segmentation rules is a lot easier than trying to write a rather complicated regex, but the main advantage is, that it will be compliant with the source document(s).

    I'm still trying to avoid aligning - even though it shouldn't present a problem - but so far no luck.

    Cheers,

    Hans

    Direct link Reply with quote
     

  • Michael Joseph Wdowiak Beijer  Identity Verified
    United Kingdom
    Local time: 06:45
    Member (2009)
    Dutch to English
    + ...
    I'm puzzled May 27

    CafeTran Training wrote:

    Sometimes you receive a legacy TMX file that has been segmented paragraph-wise, while you want to use it for your current translation project that has been segmented after every sentence.

    Here are two videos that show you how to re-segment the legacy TMX file (note that TU attributes will get lost, but how relevant is that for legacy TMX files anyway?)

    Re-segmenting a TM - Part 1
    https://youtu.be/h7xEoARMKB0

    Re-segmenting a TM - Part 2
    https://youtu.be/gEBbA4okhdk

    [Edited at 2017-05-25 18:10 GMT]


    To be honest, I don't understand what you're doing. When you re-segment a TMX file that was segmented on paragraphs in the past, what do you do with instances where the translator merged several sentences in the src into a single sentence in the target? Your whole system seems to be based on the assumption that there will always be a one-to-one correspondence.

    Michael


    Direct link Reply with quote
     

    CafeTran Training
    Netherlands
    Local time: 07:45
    Member (2016)
    TOPIC STARTER
    Not a problem May 28

    Michael Joseph Wdowiak Beijer wrote:

    To be honest, I don't understand what you're doing. When you re-segment a TMX file that was segmented on paragraphs in the past, what do you do with instances where the translator merged several sentences in the src into a single sentence in the target? Your whole system seems to be based on the assumption that there will always be a one-to-one correspondence.

    Michael


    That's true, Michael. For the cases that you mention, you'll have to use a slightly different procedure:

    https://cafetran.freshdesk.com/support/discussions/topics/6000048974

    Have a nice Sunday!

    Hans


    Direct link Reply with quote
     

    CafeTran Training
    Netherlands
    Local time: 07:45
    Member (2016)
    TOPIC STARTER
    Improve a little more, and more, and more ... May 28

    Meta Arkadia wrote:

    Using regexes to achieve this will be an endless pain:



    Our whole life can be an endless pain. Luckily, it doesn't have to...

    Adapting the necessary regular expressions for different types of texts (e.g. German legal texts versus English marketing documents) offers a nice way to optimise the result of the re-segmentation.

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    Direct link Reply with quote
     

    Meta Arkadia
    Local time: 12:45
    English to Indonesian
    + ...
    Regexes, and quite a bit more May 28

    CafeTran Training wrote:
    How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    CafeTran does use regular expressions, but well-structured ones in a mark-up language, srx files. What you're trying to do is create one regex for all languages. It's doomed.

    My suggestion is to use CafeTran's segmentation rules. That makes sense, because we're going to use the resulting TMX file in CafeTran. And then align the two segmented files. That makes sense, because aligners use "smart" rules nowadays. Just ask Andras. The aligned file will still have to be checked, though, which isn't very difficult, and probably not very time-consuming.

    Cheers,

    Hans


    Direct link Reply with quote
     

    Jean Dimitriadis
    France
    Local time: 07:45
    Member (2015)
    English to French
    + ...
    SRX & Regex May 28

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    BTW, I think you are right.

    In the SRX (Segmentation Rules eXchange) specification file format, it is stated "the segmentation rules themselves are represented using regular expressions." - http://www.ttt.org/oscarstandards/srx/srx10.html

    CT uses SRX.


    Direct link Reply with quote
     

    CafeTran Training
    Netherlands
    Local time: 07:45
    Member (2016)
    TOPIC STARTER
    Demo purposese only May 28

    Meta Arkadia wrote:

    What you're trying to do is create one regex for all languages.


    Of course not. I'm only demonstrating a technique. No way that I'm trying to create a universal regex for segmenting. Like I wrote: different languages and different text types require different segmentation rules.


    Direct link Reply with quote
     

    CafeTran Training
    Netherlands
    Local time: 07:45
    Member (2016)
    TOPIC STARTER
    Educated guess May 28

    Jean Dimitriadis wrote:

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    BTW, I think you are right.

    In the SRX (Segmentation Rules eXchange) specification file format, it is stated "the segmentation rules themselves are represented using regular expressions." - http://www.ttt.org/oscarstandards/srx/srx10.html

    CT uses SRX.


    Yes, Jean, it was an educated guess (I already had a look at these rules in the CT folder long ago).

    @Van den Broek: Actually it was me who suggested to use CT's segmenting rules. But you're welcome on my party too, Hans. (Please bring your own drinks.)

    And, BTW, please feel free to use your own, preferred solution. I'm merely demonstrating some techniques here. If you don't like them, don't use them. Perhaps others can use some snippets that will come in handy someday.


    Direct link Reply with quote
     

    Meta Arkadia
    Local time: 12:45
    English to Indonesian
    + ...
    Well... May 28

    CafeTran Training wrote:
    Like I wrote: different languages and different text types require different segmentation rules.


    So far, you used one regex for both source language and target language.

    It cannot be done. Not your way.

    Cheers,

    Hans


    Direct link Reply with quote
     


    To report site rules violations or get help, contact a site moderator:

    Moderator(s) of this forum
    Natalie[Call to this topic]

    You can also contact site staff by submitting a support request »

    Using CafeTran to re-segment a legacy TMX file

    Advanced search






    LSP.expert
    You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

    How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

    More info »
    Wordfast Pro
    Translation Memory Software for Any Platform

    Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

    More info »



    Forums
    • All of ProZ.com
    • Term search
    • Jobs
    • Forums
    • Multiple search