Help with RegEx needed to check numbers (SDL Studio)
Thread poster: Andrei Vybornov

Andrei Vybornov  Identity Verified
Russian Federation
Local time: 12:23
Member (2008)
English to Russian
+ ...
Apr 23, 2010

I was trying to QA a project with a lot of segments like this (fictitious):

Check switch (B31), actuator harness connector (X25) and connector (X33).

There is a lot of B31, X33 and the like in the file and I would like to catch all cases where I might have copied them wrong from the source.
First of all, I was very surprised to discover, that Studio did not report anything if I put X34 instead of X33, although I did tick the "Check numbers" box. Very strange! Maybe Studio does not consider such alphanumeric combinations as numbers at all and checks only ‘standalone’ numbers. If so, I would like to know what other limitations there may be, because I naturally expected every ‘digit’ to be checked.

Well, I then decided to write a RegEx to check such alphanumeric codes specifically.
This is what I eventually came up with:

RegEx source: [A-Z][0-9]{1,8}
RegEx target: $0
Condition: Grouped search expression – report if source matches but not target

It seemed to work at first, but then I discovered that in segments with several alphanumeric codes (see example above) it only checks the first code and ignores others. So in the example above nothing will be reported if I put (X26) instead of (X25) because (B31) is Ok.

Any Ideas how I shall change the RegEx above to make it work?

Any help will be much appreciated.

Kind regards,
Andrei


 

Vito Smolej
Germany
Local time: 10:23
Member (2004)
English to Slovenian
+ ...
3 ideas Apr 23, 2010

Andrei Vybornov wrote:

I was trying to QA a project with a lot of segments like this (fictitious):

Check switch (B31), actuator harness connector (X25) and connector (X33).
....
Any help will be much appreciated.

Kind regards,
Andrei


A: if you can edit the source you could insert a blank between the character and the double-digit number (you can edit the target back when done). Then the numbers are placeables and its easier to handle them.

B: One other way - a sort of a check sum method - is count the instances: if you have 623 X25s in the source and 622 of them in the target with one lonely X26 (or one too many) you know what to correct. Of course finding the needle in this haystack could be a nightmare.

C: I would extract the fingerprint with VB in excel ... one segment > one line >A: B31X25X33. Then its just a string comparison.

re Regex - you may have a variable number of instances, so you may have to test for one first, and then for two ... Dont know if it would work at all...
regards

smo

[Edited at 2010-04-23 20:28 GMT]


 

Andrei Vybornov  Identity Verified
Russian Federation
Local time: 12:23
Member (2008)
English to Russian
+ ...
TOPIC STARTER
Thank you for the attempt Apr 24, 2010

Thank you Vito, but none of these ideas will actually work.

A: I can’t edit the source in Studio, but even if I could I wouldn’t. These codes shall remain the way they are. If you meant placing a space between a letter and the following digits temporarily in order to be able to run the QA and then delete this space again, I am afraid it is way too much trouble, especially if I try to automate these replacements (inserting space and then removing it) because there is always a chance that spaces will be inserted and deleted not where they were intended to be.

B: Checksum does not guarantee anything at all. I may have correct totals for the whole document, but not for individual segments.

C: Seems to be the most viable option, but I do not know how to "extract the fingerprint". And then again it has nothing to do with Studio. This method can be applied to any document with source and target files. Not even translated with any TM tool.

D: …variable number of instances. That’s exactly the point! I do not quite understand what you mean by "testing for one first". If you mean that instead of [A-Z][0-9]{1,8} I should search for e.g. B31 and then for X25, it is out of the question. I have codes ranging from A1 to X100, so you can imagine how many individual checks I would have to run. If by "testing for one first" you meant that I should check the first occurrences in all segments, then the second… I wish I knew how to do it. When I run the QA check it does not stop at each code in each segment so that I could confirm it and move to the next code in that same segment. No, it just checks the entire document and if there is more than one code in a segment it would only check the first one and move on to the next segment. All I want is to make it check and report all codes in each segment.
Still have no idea how to do it icon_frown.gif

Thank you for the attempt anyway. Much appreciated

Kind regards,
Andrei


 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 10:23
Member (2004)
English to Polish
Repeating sequences Apr 24, 2010

Andrei Vybornov wrote:

D: …variable number of instances. That’s exactly the point! I do not quite understand what you mean by "testing for one first". If you mean that instead of [A-Z][0-9]{1,8} I should search for e.g. B31 and then for X25, it is out of the question. I have codes ranging from A1 to X100, so you can imagine how many individual checks I would have to run. If by "testing for one first" you meant that I should check the first occurrences in all segments, then the second… I wish I knew how to do it. When I run the QA check it does not stop at each code in each segment so that I could confirm it and move to the next code in that same segment. No, it just checks the entire document and if there is more than one code in a segment it would only check the first one and move on to the next segment. All I want is to make it check and report all codes in each segment.
Still have no idea how to do it icon_frown.gif


What Vito meant was to have one regex for one sequence in the segment, another regex for two sequences in the segment, etc. It all depends on the maximum number of sequences you can have in th segment.

I am not familiar with the QA syntax (or regex flavor) in Trados, so I cannot give you a specific regex, but for example for three sequences it might be:

([A-Z][0-9]{1,8})(.*)([A-Z][0-9]{1,8})(.*)([A-Z][0-9]{1,8})
$1(match anything here)$3(match anything here)$5

Of course, if the QA does not have an option to accept anything in between the matches, it won't work.

EDIT:

I've read the description here:
http://producthelp.sdl.com/SDL%20Trados%20Studio/client_en/Ref/O-T/Verification/QA_RE.htm

and this might work:

([A-Z][0-9]{1,8})(.*?)([A-Z][0-9]{1,8})(.*?)([A-Z][0-9]{1,8})
$1(.*?)$3(.*?)$5


[Edited at 2010-04-24 10:35 GMT]


 

Andrei Vybornov  Identity Verified
Russian Federation
Local time: 12:23
Member (2008)
English to Russian
+ ...
TOPIC STARTER
Good try too Apr 24, 2010

Thank you Jabberwock,

But I guess it will not work if the order of words (and codes) in the target segment is different.
If I write a RegEx to spot a certain pattern of characters, I would expect to catch all occurrences, not just the first one. I remember reading something about "eager", "lazy" and "greedy" things at http://www.regular-expressions.info (referenced by Studio Help). Thought it might be related and hoped that someone would clarify that.
PS: I could not get your link work. Is it in some 'private' area of SDL? I mean maybe you are subscribed to SDL assistance and it is only visible to those subscribed?

Kind regards,
Andrei


 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 10:23
Member (2004)
English to Polish
Workarounds, not solutions Apr 24, 2010

Andrei Vybornov wrote:
But I guess it will not work if the order of words (and codes) in the target segment is different.


Of course, you are right.


If I write a RegEx to spot a certain pattern of characters, I would expect to catch all occurrences, not just the first one. I remember reading something about "eager", "lazy" and "greedy" things at http://www.regular-expressions.info (referenced by Studio Help).


I suppose it is a flaw in the RegEx implementation of Studio. As far as I know, the option whether the match is to be single or repeated goes beyond a particular regex (for example, in Perl you have to specify the /g modifier). Laziness/greediness does not apply here, as there are no wildcards in your expression.


PS: I could not get your link work. Is it in some 'private' area of SDL? I mean maybe you are subscribed to SDL assistance and it is only visible to those subscribed?


I have no idea - I do not have an assistance contract (only 2007 license) and I have found the link with Google.


 

Andrei Vybornov  Identity Verified
Russian Federation
Local time: 12:23
Member (2008)
English to Russian
+ ...
TOPIC STARTER
I was not quite right Apr 24, 2010

I have checked it again on a newly created file and it appears that I did not present the problem correctly. In fact Studio’s QA Checker does check all occurrences. What it does not do is a reverse check. Which is only natural because I need to set a different condition for that.

But let me explain.

Example 1:
Source: Check switch (B31), actuator harness connector (X25) and connector (X33).
Target: Check switch (B31), actuator harness connector (X26) and connector (X33).

The problem is reported correctly: found in source but not in target (X25)

Example 1:
Source: Check switch (B31), actuator harness connector (X25) and connector (X33).
Target: Check switch (B31), actuator harness connector (X25) (X26) and connector (X33).

The problem is not reported

I wrongfully assumed that this problem should be reported. My bad.

At first I thought this could be fixed by creating a new RegEx with a different (reverse) condition. Something like:
Condition: Grouped search expression – report if target matches but not source

Alas, there is now such thing.

Switching RegEx source with RegEx target does not help either:

RegEx source: $0
RegEx target: [A-Z][0-9]{1,8}

I have found a solution though, that seems to work.
In addition to the first RegEx (see my first post) I created this one:

RegEx source: [A-Z][0-9]{1,8}
RegEx source: [A-Z][0-9]{1,8}
Condition: Report if both source and target match but with different count

I am not sure if it will cover all possible scenarios, but it works so far.


Thank you all!
Kind regards,
Andrei


 

Vito Smolej
Germany
Local time: 10:23
Member (2004)
English to Slovenian
+ ...
checksum method... Apr 24, 2010

Andrei Vybornov wrote:

...
Condition: Report if both source and target match but with different count
...


if ever I saw one. To paraphrase: Checksum does not guarantee anything at all. I may have correct totals for the whole ... segments, but individual entries may be different. Yes. But it helps (g). The same applies to my suggestion, even if admittedly crude (it helped me in some cases).

I hope it helped you land the described check.

Regards


Vito

[Edited at 2010-04-24 17:11 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Help with RegEx needed to check numbers (SDL Studio)

Advanced search







WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running, helps experienced users make the most of the powerful features, ensures new

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search