Find/replace thousand separators and decimal points using regex within Transit
Thread poster: Steveling2
Steveling2
United Kingdom
May 11

Hello,

I've been asked at work to look into ways of finding and replacing thousand separators and decimal points in the target text within Transit (Version 4.0 SP7 Build 1268.14). Essentially we're trying to emulate the auto-localization function in Trados (and no, we can't switch to another software). The suggestions in the Regular Expressions section of Transit's Advanced Features manual work to some extent. For example,

Find: #([0-9]+)0\.#([0-9]+)1
Replace: #0,#1

will convert:
1.0 to 1,0
1.00 to 1,00
1.000 to 1,000

But then it gets trickier:
1.000.000 becomes 1,000.000

What's more, dates are also converted:
01.01.1900 becomes 01,01.1900

I've now spent quite some time experimenting, and as Transit appears to use its own flavour of regex there's no point in using a site like regexhero.net for testing. Is there a solution that doesn't ignore a second (or even third) thousand separator, but does ignore the sequence xx.xx.xxxx?

Thanks,
Steve


Direct link Reply with quote
 
CafeTran Training
Netherlands
Local time: 08:15
Member (2016)
Exclusion? May 11

Can't you exclude the years and thousand separator from the search string?

Something like:

Find: #([0-9]+)0\.#([0-9]+)1(!.\((19|20)[0-9][0-9])|([0-9]+)))

(You'll have to tweak it.)

You can also close the language pairs in Transit and use a more advanced regular expressions tool like WildEdit: http://textpad.com/products/wildedit/index.html

[Edited at 2017-05-11 15:21 GMT]


Direct link Reply with quote
 
Steveling2
United Kingdom
TOPIC STARTER
Some progress with thousand separators May 12

Thank you for your quick reply. I've made some progress on the thousand separators. A small change from

#([0-9]+)0\.#([0-9]+)1
to
#([0-9])0\.#([0-9])1

finds every dot (or comma, as required) between numbers. Using

#0,#1

in Replace changes 1.000.000 to 1,000,000 rather than 1,000.000. However, it will also find and change dates in the xx.xx.xxxx format. Your suggestion of exclusion (or Negation, as Transit calls it) sounded promising. Based on your string I can use

([0-9][0-9]\.[0-9][0-9]\.[0-9]+)

to find the format xx.xx.xxx (I tweaked your code so that it will find any year). But I haven't found a way of using this to exlude dates in this format, e.g.

#([0-9]+)0\.#([0-9])1(![0-9][0-9]\.[0-9][0-9]\.[0-9]+)
or
(![0-9][0-9]\.[0-9][0-9]\.[0-9]+)#([0-9]+)0\.#([0-9])1

And another problem springs to mind – in files where there are both separators and decimal points, the separators will be the same as the source language's decimal point after the conversion. Then a regex will be required that will convert x,x and ignore any variant of x,xxx (and that's assuming there are only one or two digits after the decimal point in the file).

I seem to be spending a whole lot of time chasing my tail here. And I haven't even looked at date conversion, e.g. 12/05/2017 to 12 March 2017 (or March 12, 2017). I can't be the first Transit user to want to do this, so I can only assume that the lack of an existing solution means that this can't be done in Transit, or at least not in single find/replace action. Looks like automating these changes needs to be done pre-import.


Direct link Reply with quote
 
Extra Consult
Belgium
Local time: 08:15
Member (2008)
English
groupings May 12

We normally do this by searching a single digit followed by a group of 3 digits to find thousand separators:
(\d)\.(\d{3}) for dot or (\d)\s(\d{3}) for space. Then replace by $1,$2 or $1.$2 as you need.

If you want to make sure you only get this group, you can single them out by the beginning and end-of line markers (^ and $) as such:
^(\d).(\d{3})$ - this will make sure it only find grouping of a single digit followed by a group of 3 digits that are not started or ended by another character.

Hope this helps;


Direct link Reply with quote
 
Steveling2
United Kingdom
TOPIC STARTER
Flavours of regex May 12

Thanks for your reply. Unfortunately the flavour of regex that Transit uses doesn't include the {n} quantifier, so the options are to use + (1-n instances) or to actually enter the occurences, in this case

([0-9])\.([0-9][0-9][0-9])

to find 1.000. But this will ignore the second separator in 1.000.000, giving you 1,000.000

And would your string not need to be (\d{1,3})\.(\d{3}) to find 1,000 and/or 10,000 and/or 100,000? Having just typed up this question, I thought of an answer, which is to use alternatives

(\.([0-9][0-9][0-9])|\.([0-9]+)\.|\.([0-9]+)\.([0-9]+)\.)

This will find 1.000 / 1.000.000 / 1.000.000, but the string is getting messy, and how will alternatives work with variables (do they even)? Fortunately, while testing this with variables I hit on the best solution (so far – with regex you never know):

\.([0-9][0-9][0-9])

By leaving out the number(s) before the separator I get around the issue of the second or third (etc.) separator being ignored. And around the fact that my attempt at using alternatives and variables was a fail:

(\.#([0-9][0-9][0-9])1|\.#([0-9]+)2\.|\.#([0-9]+)3\.#([0-9]+)4\.) is invalid

All I need in the Replace field is \,#1 to change all my thousand separators.
Now I just need to find a way to exclude xx.xx.xxx as I'm now left with xx.xx,xxxx
Using an exclude at the end of my string

\.#([0-9][0-9][0-9])1(![0-9][0-9]\.[0-9][0-9]\.[0-9]+)

excludes nothing, and sticking it at the beginning

(![0-9][0-9]\.[0-9][0-9]\.[0-9]+)\.#([0-9][0-9][0-9])1

finds nothing, which is a bit of a surprise, as the exclusion surely only applies to the string within the same brackets? If I find a way to do it, I will add it to my post.


Direct link Reply with quote
 
Steveling2
United Kingdom
TOPIC STARTER
A solution, if far from perfect May 30

So, having decided that we don't care about changing dates, solving the problem becomes a little easier. Proceed as follows:

1. Open Find/Replace and copy/paste the following regular expressions:

Find field: #(.)1\,#([0-9]+)2
Replace field: #1\*#2

This changes a comma decimal to an Asterisk (you can use different character, just replace the asterisk in the second regex with the character of your choice). This step is needed, otherwise you will end up with commas for both the DE decimal marker and the EN thousand separator.

Save the regex with a useful name, e.g. 1_Comma2Asterisk_Decimal

2. Now copy/paste

Find field: \.#([0-9][0-9][0-9])1
Replace field: \,#1


Direct link Reply with quote
 

Dan Lucas  Identity Verified
United Kingdom
Local time: 07:15
Member (2014)
Japanese to English
External regex May 30

Steveling2 wrote:
Proceed as follows:

I realize this comes too late to help Steve, who has already sunk quite a bit of time in this, but in the hope of being useful to others following in his footsteps, I use a stand-alone grep utility (specifically PowerGREP 5) to "preprocess" Transit NXT files (*.ENG).

It works without any apparent problems - but don't process angle brackets! - saves me a lot of time, and allows me to avoid Transit's own, rather clunky, regex implementation.

Dan


Direct link Reply with quote
 
Steveling2
United Kingdom
TOPIC STARTER
A solution, if far from perfect - the missing steps May 31

OK, so the last steps were missed off

2. ...concluded

This converts all instances of comma followed by three numbers (most likely to be a thousand separator). It's also at this point that step 1 makes sense. If you had already converted the comma decimal marker to a dot (e.g. 1,5% > 1.5%) rather than to an asterisk, you would now have both 1.5% (EN decimal marker) and 1.000.000 (DE thousand separator) in the same text. Instead you should have 1*5% and 1.000.000

Now that you have pasted the regexes in step two, save them with a meaningful name, e.g.

2_Dot2Comma_Separator

3. In the final step, copy/paste

Find field: #(.)1\*#([0-9]+)2
Replace field: #1\.#2

to convert the asterisks (or you chosen character). Again, save this regex with a meaningful name, e.g.

3_Asterisk2Dot_Decimal

Adding the number of the step to the regex name makes easier it to remember the sequence in which to run them. Simply load and run each regex in order (remembering to tick the "Regular expression" box) - you can try it using the Replace button first, and once you're confident it works for you, you can use Replace all.

I did have a quick look into creating a macro to run these steps with a keyboard shortcut, but it was a bit flaky and you have to manually add the macro for each user, it is not available centrally, unlike the saved regexes, which every user can see.

There are some things to note:
1. If the comma or dot is between markup tags, then the markup will be broken
2. Superscript numbers are treated as normal numbers
3. Dates in the xx.xx.xxxx format become xx.xx,xxxx
4. Pretranslated segments will be changed unless you filter them out using the segment filter options on the View tab before running the regexes


Direct link Reply with quote
 
CafeTran Training
Netherlands
Local time: 08:15
Member (2016)
STAR should fix this May 31

1 and 2 are really killing. Actually, 3 isn't nice either (dates should be recognised and protected). 4 is more of less the responsibility of the user.

Naming the regex by steps is what I did too. However, I saved them in MacroToolworks. For better updating and more flexibility.

[Edited at 2017-05-31 16:53 GMT]


Direct link Reply with quote
 
Steveling2
United Kingdom
TOPIC STARTER
External regex Jun 1

@Dan

Thanks for your recommendation. I'll see how my users get on with the regexes I've set up in Transit. If it really becomes a problem, we might consider a non-Transit solution.

Out of curiosity, if you're manipulating the text post-import, i.e. the *.ENG file, how do you deal with pretranslated segments where the decimal markers and separators are already in the correct format?


Direct link Reply with quote
 

Dan Lucas  Identity Verified
United Kingdom
Local time: 07:15
Member (2014)
Japanese to English
Not an issue Jun 1

Steveling2 wrote:
Out of curiosity, if you're manipulating the text post-import, i.e. the *.ENG file, how do you deal with pretranslated segments where the decimal markers and separators are already in the correct format?

Steve, in my case I am mostly replacing double-byte Japanese characters (such as "wide" digits) with their "narrow" Latin character set equivalents. Pre-translated segments are already in a western character set, so there is no match, and no replacement is made.

Also, the client assures me that it doesn't matter what I do to a pre-translated segment, because any changes to such segments are junked when the project goes back into their system. I've been doing this for quite some time now, and no problems have been reported and no complaints have been made.

Regards,
Dan


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maya Gorgoshidze[Call to this topic]

You can also contact site staff by submitting a support request »

Find/replace thousand separators and decimal points using regex within Transit

Advanced search






LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search