Using CafeTran to re-segment a legacy TMX file
Thread poster: CafeTran Training (X)

CafeTran Training (X)
Netherlands
Local time: 16:24
May 25, 2017

Sometimes you receive a legacy TMX file that has been segmented paragraph-wise, while you want to use it for your current translation project that has been segmented after every sentence.

Here are two videos that show you how to re-segment the legacy TMX file (note that TU attributes will get lost, but how relevant is that for legacy TMX files anyway?)

Re-segmenting a TM - Part 1
https://youtu.be/h7xEoARMKB0

Re-segmenting a TM - Part 2
https://youtu.be/gEBbA4okhdk

[Edited at 2017-05-25 18:10 GMT]


 

CafeTran Training (X)
Netherlands
Local time: 16:24
TOPIC STARTER
Improved regular expression May 27, 2017

Here's an improved regular expression, that will cover texts like:

v5yyyftesl7ry4idwy2r.png

Expression: (?<=[a-z%\d\\)][\.\!\?]) (?=([a-z]?[A-Z]))

Result:

dpifoncr3p3iaz7deiw6.png

[Edited at 2017-05-27 08:27 GMT]


 

Meta Arkadia
Local time: 21:24
English to Indonesian
+ ...
Improve it a little more. And then again... May 27, 2017

CafeTran Training wrote:
Here's an improved regular expression


Using regexes to achieve this will be an endless pain:

Paragraph%20TMX.pngParagraph%20TMX.png

My plan still is:

  • Open the wretched TMX file in CafeTran's Edit TMX mode
  • Export as HTML (other export format may be possible)
  • Open the HTML in Word, save as .docx
  • Select both Word columns one by one, and open them in CafeTran one by one, sentence segmentation enabled
  • Save
  • Align (auto)

    Using the CafeTran segmentation rules is a lot easier than trying to write a rather complicated regex, but the main advantage is, that it will be compliant with the source document(s).

    I'm still trying to avoid aligning - even though it shouldn't present a problem - but so far no luck.

    Cheers,

    Hans

     

  • Michael Beijer  Identity Verified
    United Kingdom
    Local time: 15:24
    Member (2009)
    Dutch to English
    + ...
    I'm puzzled May 27, 2017

    CafeTran Training wrote:

    Sometimes you receive a legacy TMX file that has been segmented paragraph-wise, while you want to use it for your current translation project that has been segmented after every sentence.

    Here are two videos that show you how to re-segment the legacy TMX file (note that TU attributes will get lost, but how relevant is that for legacy TMX files anyway?)

    Re-segmenting a TM - Part 1
    https://youtu.be/h7xEoARMKB0

    Re-segmenting a TM - Part 2
    https://youtu.be/gEBbA4okhdk

    [Edited at 2017-05-25 18:10 GMT]


    To be honest, I don't understand what you're doing. When you re-segment a TMX file that was segmented on paragraphs in the past, what do you do with instances where the translator merged several sentences in the src into a single sentence in the target? Your whole system seems to be based on the assumption that there will always be a one-to-one correspondence.

    Michael


     

    CafeTran Training (X)
    Netherlands
    Local time: 16:24
    TOPIC STARTER
    Not a problem May 28, 2017

    Michael Joseph Wdowiak Beijer wrote:

    To be honest, I don't understand what you're doing. When you re-segment a TMX file that was segmented on paragraphs in the past, what do you do with instances where the translator merged several sentences in the src into a single sentence in the target? Your whole system seems to be based on the assumption that there will always be a one-to-one correspondence.

    Michael


    That's true, Michael. For the cases that you mention, you'll have to use a slightly different procedure:

    https://cafetran.freshdesk.com/support/discussions/topics/6000048974

    Have a nice Sunday!

    Hans


     

    CafeTran Training (X)
    Netherlands
    Local time: 16:24
    TOPIC STARTER
    Improve a little more, and more, and more ... May 28, 2017

    Meta Arkadia wrote:

    Using regexes to achieve this will be an endless pain:



    Our whole life can be an endless pain. Luckily, it doesn't have to...

    Adapting the necessary regular expressions for different types of texts (e.g. German legal texts versus English marketing documents) offers a nice way to optimise the result of the re-segmentation.

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


     

    Meta Arkadia
    Local time: 21:24
    English to Indonesian
    + ...
    Regexes, and quite a bit more May 28, 2017

    CafeTran Training wrote:
    How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    CafeTran does use regular expressions, but well-structured ones in a mark-up language, srx files. What you're trying to do is create one regex for all languages. It's doomed.

    My suggestion is to use CafeTran's segmentation rules. That makes sense, because we're going to use the resulting TMX file in CafeTran. And then align the two segmented files. That makes sense, because aligners use "smart" rules nowadays. Just ask Andras. The aligned file will still have to be checked, though, which isn't very difficult, and probably not very time-consuming.

    Cheers,

    Hans


     

    Jean Dimitriadis  Identity Verified
    France
    Local time: 16:24
    Member (2015)
    English to French
    + ...
    SRX & Regex May 28, 2017

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    BTW, I think you are right.

    In the SRX (Segmentation Rules eXchange) specification file format, it is stated "the segmentation rules themselves are represented using regular expressions." - http://www.ttt.org/oscarstandards/srx/srx10.html

    CT uses SRX.


     

    CafeTran Training (X)
    Netherlands
    Local time: 16:24
    TOPIC STARTER
    Demo purposese only May 28, 2017

    Meta Arkadia wrote:

    What you're trying to do is create one regex for all languages.


    Of course not. I'm only demonstrating a technique. No way that I'm trying to create a universal regex for segmenting. Like I wrote: different languages and different text types require different segmentation rules.


     

    CafeTran Training (X)
    Netherlands
    Local time: 16:24
    TOPIC STARTER
    Educated guess May 28, 2017

    Jean Dimitriadis wrote:

    BTW: How do you think CafeTran segments source texts during import? My guess is that it uses regular expressions for this.


    BTW, I think you are right.

    In the SRX (Segmentation Rules eXchange) specification file format, it is stated "the segmentation rules themselves are represented using regular expressions." - http://www.ttt.org/oscarstandards/srx/srx10.html

    CT uses SRX.


    Yes, Jean, it was an educated guess (I already had a look at these rules in the CT folder long ago).

    @Van den Broek: Actually it was me who suggested to use CT's segmenting rules. But you're welcome on my party too, Hans. (Please bring your own drinks.)

    And, BTW, please feel free to use your own, preferred solution. I'm merely demonstrating some techniques here. If you don't like them, don't use them. Perhaps others can use some snippets that will come in handy someday.


     

    Meta Arkadia
    Local time: 21:24
    English to Indonesian
    + ...
    Well... May 28, 2017

    CafeTran Training wrote:
    Like I wrote: different languages and different text types require different segmentation rules.


    So far, you used one regex for both source language and target language.

    It cannot be done. Not your way.

    Cheers,

    Hans


     


    To report site rules violations or get help, contact a site moderator:

    Moderator(s) of this forum
    Natalie[Call to this topic]

    You can also contact site staff by submitting a support request »

    Using CafeTran to re-segment a legacy TMX file

    Advanced search






    memoQ translator pro
    Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

    With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

    More info »
    PerfectIt consistency checker
    Faster Checking, Greater Accuracy

    PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

    More info »



    Forums
    • All of ProZ.com
    • Term search
    • Jobs
    • Forums
    • Multiple search