segmenting xliff files
Thread poster: Pascal Zotto

Pascal Zotto  Identity Verified
Austria
Local time: 16:35
Member (2009)
Dutch to Letzeburgesch
+ ...
Nov 15, 2016

Hi,

I regularly get xliff files from a client. Being xliff files they are already pre-segmented.

The problem is that segmentation could be better but due to a lack of client's knowledge of how to correctly set segmentations I need Trados 2015 to add additional segmentation that would make translations way faster due to higher repetition of sub-strings. Asking the client for better segmentation is not an option!

Normally you can achieve this via the TM settings > language resources > segmentation rules but it seems I don't get the correct regex to achieve it in this case as no further segmenation takes place.

Or does this simply not work for xliff files?

text contains string:


Direct link Reply with quote
 

Nora Diaz  Identity Verified
Mexico
Local time: 08:35
Member (2002)
English to Spanish
+ ...
... Nov 15, 2016

If the file comes segmented from your client, then you can't segment it further. How about splitting those segments? You can use the same regex rules to help automate the splitting somewhat with an Autohotkey script.

Here's an example of an Autohotkey script I created for this:

; Split AFTER the Regex pattern. The regex pattern will first need to be copied to the clipboard
#c::
InputBox,Var,Split segment AFTER the regex pattern, How many times do you want to run the split sequence?
loop, % Var
{
Send {F6}
Send, ^f
Sleep 200
Send, ^v
Sleep 200
Send !s
Sleep 200
Send !u
Sleep 200
Send !n
Sleep 200
Send !u
Sleep 200
Send {Esc}
Sleep 100
Send {Right}
Sleep 200
Send {Left}
Sleep 500
Send !t
Sleep 500
}
Return


; Split BEFORE the Regex pattern. The regex pattern will first need to be copied to the clipboard
#v::
InputBox,Var,Split segment BEFORE the regex pattern, How many times do you want to run the split sequence?
loop, % Var
{
Send {F6}
Send, ^f
Sleep 200
Send, ^v
Sleep 200
Send !s
Sleep 200
Send !u
Sleep 200
Send !n
Sleep 200
Send !u
Sleep 200
Send {Esc}
Sleep 100
Send {Left}
Sleep 200
Send {Right}
Sleep 500
Send !t
Sleep 500
}
Return

I should clarify that I use Alt+T as my splitting shortcut in Studio, so that's what "Send !t" matches. Also, the script checks the Use: Regex option and then unchecks it after running the search, so you should start with a Find and Replace box with all the options in their default state.

[Edited at 2016-11-15 19:00 GMT]


Direct link Reply with quote
 

Pascal Zotto  Identity Verified
Austria
Local time: 16:35
Member (2009)
Dutch to Letzeburgesch
+ ...
TOPIC STARTER
autohotkey … Nov 16, 2016

didn't think of using that one.

I guess it would work if the given strings I need to look for would not be converted to tags in the final sdlxliff. How can I get rid of conversion to tags for some string parts?

Also some information for the new users of Autohotkey (http://ahkscript.org/):
(there is a huge tutorial page there as well)

; Split AFTER the Regex pattern. The regex pattern will first need to be copied to the clipboard
#c::
InputBox,Var,Split segment AFTER the regex pattern, How many times do you want to run the split sequence?


here # and c can be replaced by whatever trigger or letter you want:

triggers:
# stands for Windows logo key
! stands for Alt
+ stands for Shift
^ stands for Control

you can also combine them but not use them doubled (++c):
#!c
!+r
#!+x
or whatever you want

Sleep 500
Send !t
Sleep 500
}


the letter of the commands between might also need to be adapted in order to trigger the correct fields, depending on the language you run Trados with, as shortcuts to commands might change.

Nora's example is for English UI.

German UI:
#c::
InputBox,Var,Split segment AFTER the regex pattern, How many times do you want to run the split sequence?
loop, % Var
{
Send {F6}
Send, ^f ;open find menu
Sleep 200
Send, ^v ;paste from clipboard
Sleep 200
Send !a ;switch to source
Sleep 200
Send !v ;activate use
Sleep 200
Send !w ;Find Next
Sleep 200
Send !v ;deactivate use
Sleep 200
Send {Esc} ;quit find window
Sleep 100
Send {Right} ;a
Sleep 200
Send {Left} ;a
Sleep 500
Send !+t ;apply segmentation
Sleep 500
}
Return

; Split BEFORE the Regex pattern. The regex pattern will first need to be copied to the clipboard
#v::
InputBox,Var,Split segment BEFORE the regex pattern, How many times do you want to run the split sequence?
loop, % Var
{
Send {F6}
Send, ^f ;open find menu
Sleep 200
Send, ^v ;paste from clipboard
Sleep 200
Send !a ;switch to source
Sleep 200
Send !v ;activate use
Sleep 200
Send !w ;Find Next
Sleep 200
Send !v ;deactivate use
Sleep 200
Send {Esc} ;quit find window
Sleep 100
Send {Right} ;not sure what Right does
Sleep 200
Send {Left} ;not sure what Left does
Sleep 500
Send !+t ;apply segmentation
Sleep 500
}
Return

As Nora said "Send !t" stands for CTRL+t for the segment at cursor position shortcut. If you have another shortcut set for this command you will need to replace this part:
for me it is: CTRL+Shift+t -> Send ^+t

Another hint: As long as the script is running you cannot change to another window/programm as the script will then run in that window and maybe activate hotkeys there and change things but it will not change anything in Trados until you go back to Trados window. This will be important if you have very long files and let it loop a lots of times e.g. 30 or way more.

Nora: As I have some Authotkey mapped to !n giving the ñ, does this interfere with the !n in this script? (Never used it for so long scripts or with same shortcut letters, so I don't know about this.)

I tried this but it segments one char to far on the right for #v and one to far to the left for #c.
example for both: (regex for Customer)
Customer xy gets segmented
for #v
C
ustomer xy

for #c
Custome
r xy

Thanks,
Pascal

[Bearbeitet am 2016-11-16 12:11 GMT]

[Bearbeitet am 2016-11-16 12:39 GMT]


Direct link Reply with quote
 

Nora Diaz  Identity Verified
Mexico
Local time: 08:35
Member (2002)
English to Spanish
+ ...
Sample regex pattern Nov 16, 2016

Hi Pascal,

Sorry, I should have provided an example. The reason the script is segmenting one character off in your example is that I use this to segment xml tags followed or preceded by one character (letter) for example in a case like this:

Capture

Here, after the string is found, the cursor is moved to the right to deselect the string, then one character to the left to place the cursor at the right location for the split.

You can adjust the use of the right and left arrows to place the cursor where you need it before splitting the segment.

By the way, great tips on adapting for different language GUIs!

Best,

Nora


Direct link Reply with quote
 

Pascal Zotto  Identity Verified
Austria
Local time: 16:35
Member (2009)
Dutch to Letzeburgesch
+ ...
TOPIC STARTER
that explains it Nov 16, 2016

so in fact I just need to delete the last right or left and the sleep after it, if I search for a regex without a tag.

Do you maybe know how to be able to search for tags (in terms of placeholders) in Trados or how to prevent these to be transformed to a placeholder?

Another idea I just had is to use a batch search and replace in all the source xliff files at once (e.g. with Textcrawler) and replace with the same tag preceded by some short string (e.g. splithere) that you then could search for in Trados to have our segmentation script working on that one. Splitting once before and once after that string so we have it as an own segment that will then automatically be translated as splithere by Trados. And after you are done with your translation you just take the target xliff and batch search and replace for all "splithere" strings and replace them by "" (nothing).

Or would you have another way for solving this?

regards,
Pascal


Direct link Reply with quote
 

Nora Diaz  Identity Verified
Mexico
Local time: 08:35
Member (2002)
English to Spanish
+ ...
Regular Studio tags? Nov 16, 2016

Hi Pascal,

Can you share a small sample of your text so I can get a better idea of what you mean re: the tags? Are they regular Studio tags?

Also, a note of caution about a possible issue if the script is run with certain parameters. For example, if you run the script and use "TEXT" as the search string in the example below, with no regex pattern, the following will happen:

Segment 1 TEXT and words
Segment 2 TEXT plus a word

The first instance of the script will split segment 1 between TEXT and "and", so you will get:

Segment 1a TEXT
Segment 1b and words

Now, when the second instance of the script in the loop attempts to run, it will find TEXT in Segment 1a and nothing will happen, because there is nothing to split. I think when that happens, the script will loop until the end in this position, and will never get to segment 2. For this reason, I think it's best to use a regex pattern instead of just a plain string. For example, if instead you make your search string TEXT\s, with a space being required after TEXT, then Segment 1a will not longer match the search criteria and the script will move on to segment 2, as desired.

The external search and replace operation it sounds like an interesting workaround, although I guess there would be more of a chance to damage the file, but if done carefully, it should work, I think.


Direct link Reply with quote
 

Pascal Zotto  Identity Verified
Austria
Local time: 16:35
Member (2009)
Dutch to Letzeburgesch
+ ...
TOPIC STARTER
yes regular Studio tags Nov 16, 2016

Hi Nora,

Yes, I mean regular Studio Tags.

example

Thanks for the hint with using the space behind normal test strings.

I use Text Crawler (there are many more apps that this one) as it is straight forward and very easy to get used to it. I mainly use it when I need to replace a TM or termbase or the path to these in all my projects (with around 250 clients there are at least as many permanent projects on my harddisk) at once. You can filter by file type, include subfolders and it then first searches all files for plain text or regex, offers a preview of the string containing your search terms, then you can select which files to work on and replace the searched text with a new one. All this without even opening the files.

Of course always keep a recent copy of the files you work on as, as you said, you can mess up a lot just by adding a char to much or having one too less in your replacement text.

[Bearbeitet am 2016-11-16 21:44 GMT]


Direct link Reply with quote
 

Nora Diaz  Identity Verified
Mexico
Local time: 08:35
Member (2002)
English to Spanish
+ ...
Line breaks? Nov 17, 2016

Hi Pascal,

I don't know of any way to search for the tags, but looking at your example it looks as there may be some soft returns in there that you could use in your regex pattern, is that correct?


Direct link Reply with quote
 

Pascal Zotto  Identity Verified
Austria
Local time: 16:35
Member (2009)
Dutch to Letzeburgesch
+ ...
TOPIC STARTER
no returns or linebreaks Nov 17, 2016

It's only that the tags are that large that Trados brings the next word into the next line automatically.

Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

segmenting xliff files

Advanced search







SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search