Pages in topic:   [1 2] >
Comparing two files with strings and sorting the target file
Thread poster: Dejan Škrebić

Dejan Škrebić  Identity Verified
Bosnia and Herzegovina
Local time: 18:50
English to Serbian
+ ...
Jan 25, 2011

I have two files and need to perform not usual comparison.

Both files are raw text files. One of them is for source (English) language, the other for target language. Both files contain strings in the following form:

MenuItem mtAccount 'Active account'
or:
Panel AccountPopupPanel '', 'For Account Popup'

The source file has strings like 'For Account Popup' in English, the target file in target language. The source file is newer and has more strings. Most of the extra strings are new, but there might be some missing (obsolete) strings also. Also, the strings in the files are not in the same order.

I would like to compare these two files and sort the target file according the source file, not changing the order of the source file. If it is possible to have blank rows (in target file) instead of new strings, that would help. If not, simple row-after-row order would suffice.

I guess this might be possible with use of MS Excel, UltraEdit or some other "simple" tool, but if there is any tool to do it I am willing to learn how to use it.


Direct link Reply with quote
 
FarkasAndras
Local time: 18:50
English to Hungarian
+ ...
Thoughts Jan 25, 2011

This is definitely possible with perl, a programming language designed largely for text manipulation. Of course python, php, C, java and countless other languages can also do the job, but perl is the best suited to the task and it's also the only language I know:)

There is one important thing you left out: are the string IDs unique? I.e. does "MenuItem mtAccount" occur only once in either file?

Now, learning enough perl to be able to do this would likely take you several days, and that's probably overkill, but perhaps I'll try and rig something up. I'm pretty sure you could just load the string IDs from file 1 into an array, put the IDs and strings from file 2 into a hash and mate up the two. Or just iterate through the files in loops.
One big problem is how you separate the string IDs from the strings. The single quotes may be sufficient, but if string IDs may contain single quote characters, you're sort of screwed. Also, if the data contains random junk like the '', in your second example, you're even more screwed.

Anyway, upload the txt files to some online storage service and post links if they're not confidential. That would help test things if I decide to have a crack at this.


Direct link Reply with quote
 
Bernard Lieber  Identity Verified
Local time: 18:50
English to French
+ ...
ExamDiff Pro Jan 25, 2011

Try this link:

http://www.prestosoft.com/edp_examdiffpro.asp

You might be able to download a trail version. Is a pretty powerful tool might suit your need


Direct link Reply with quote
 
Terry Richards
France
Local time: 18:50
French to English
+ ...
In Excel... Jan 25, 2011

1) Open the source file with Excel, it will put one line per cell in column A. You have to select "All files" in the open dialog and (maybe) clear all the delimiter options if it asks you.
2) Open the target file in another copy of Excel
3) Copy/paste Column A of the second file to column B of the first file
4) Sort the two columns independantly (or sort them before the copy/paste)

T.


Direct link Reply with quote
 

Dejan Škrebić  Identity Verified
Bosnia and Herzegovina
Local time: 18:50
English to Serbian
+ ...
TOPIC STARTER
A new idea Jan 25, 2011

@FarkasAndras
Thanks for the suggestion. The files are not confidential, and if I don't manage to do it myself soon, I will ask for your help again.
The strings' IDs are unique. The only repetitive strings are ENDFORM, which can be easily and automatically cut out and put in later on.

@Bernard Lieber
Thanks for the link. I downloaded it and tried. Seems to be a great tool, but I haven't solved my problem still.

@Terry Richards
Thank for the suggestion, but if I understand well, that would change the order in my source file also, which is what I don't want.

But, contemplating on Terry's suggestion (whether I understood him well or not) I found out (or at least think so), a simple procedure. What I actually need is a good sorting tool where I can define a new, custom, sorting order. As the new order I would enter my complete source file. Excel has that option, but its custom list is limited in volume and my source file is about 10,000 lines long. Now, I've cut the excessive columns out of it (for sorting I don't need more than first 10-15 characters), but it's still 10,000 lines. Now I'm off to see if there is so powerful 'sorter' out there


Direct link Reply with quote
 
FarkasAndras
Local time: 18:50
English to Hungarian
+ ...
Whoa Jan 25, 2011

Dejan Škrebić wrote:

What I actually need is a good sorting tool where I can define a new, custom, sorting order. As the new order I would enter my complete source file. Excel has that option

That's news to me. Care to share?


Direct link Reply with quote
 

RieM  Identity Verified
United States
Local time: 12:50
English to Japanese
+ ...
vlookup function Jan 25, 2011

If you are comfortable with Excel functions, try Vlookup, but there are some prerequisites:

1. when you import both files to Excel sheets (doesn't matter whether they are on the same sheet or separate sheets), ID and value have to go into a separate cell. Looks like your files have two types of "delimiters" and so you probably replace them with something consistent, using UltraEdit or other nifty text editor.

2. target text should be sorted in alphabetical order, but I don't think this is necessary as long as using exact match - this sorting can be done before loading the file onto Excel.

Assuming Row1 has labels, and the Column A has Source (=English) string ID, B has Source text, C to F are Blank, G has Target string ID, H has Target text, then go to C2, and type

Vlookup(A2,$G$2:$H$(last row # of target list),1,False)

Next, go to D2 and type
Vlookup(A2, $G$2:$H$(last row # of target list),2,False)

Copy these functions down to the last row.

They mean, if there is a string Id that matches A2, the ID gets copied to C2, and corresponding text gets copied to D2. If both C2 and D2 return #N/A, it simply means Not available - no match.

There are other lookup/search/match functions in Excel, and this is just one example. You can also fancy it up by nesting it within other function(s).

I haven't tested the above myself and this has just come up to my mind, but since I often use Excel functions in the similar situations (this and countif, exact, etc) in order to sort out software strings, they should do what you would like to accomplish, or part of it.

Good luck!

Rie


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 18:50
Member (2004)
English to Polish
Lookup Jan 25, 2011

You could use MATCH function in Excel to align the source with the target. I.e. first column would be source file, second would be target file, the third would be formula that would pull the match to the second column from the first column...

http://en.kioskea.net/forum/affich-122981-partial-cell-match-of-company-names-in-excel

The problem would be getting the length for comparison right - too short will give you false matches, too long will grab the target string and the match will fail...


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 18:50
Member (2004)
English to Polish
Formula Jan 25, 2011

The exact formula would be:

=INDIRECT(CONCATENATE("B";MATCH("*"&LEFT(A1; 8)&"*";$B$1:$B$10000;0)))

A1 is the source column cell, this will change automatically when you expand the formula.
8 is the length of the match.
Absolute B1:B10000 is the range for the target column.

What it does is: find the number of the row from the range which contains target which starts with the same string as the source, then return the value of target for that row.


[Edited at 2011-01-25 23:31 GMT]


Direct link Reply with quote
 

Daniel Grau  Identity Verified
Argentina
English to Spanish
Here's my suggestion Jan 26, 2011

As long as you have a way of restoring the original order of the strings, you can shuffle and sort them any way you want. The short recipe:

1) Open the files in Word and format all lines as automatically numbered items:

   1. MenuItem mtAccount 'Active account'
   2. Panel AccountPopupPanel '', 'For Account Popup'
   3. ...

2) Save As Text Only, so the automatic numbers become part of the text.

3) Reopen the files, sort by the second field and compare. This should be saved as a .doc file.

And this is when you realize you need a better separator than a "." (period) as a boundary between automatic numbers from the strings. So we go back to step 1, where you need to Customize the automatic numbering format to something like this:

1 MySeparator MenuItem mtAccount 'Active account'
2 MySeparator Panel AccountPopupPanel '', 'For Account Popup'
3 MySeparator ...

Then, at step 3, do a global replacement (MySeparator > tab) and you'll have your two fields to sort by.

Once you have the final, corrected file, you do this:

4) Sort by the numerical field to restore the original ordering.

5) To delete the numbers, use Convert Text to Table, select the column and delete.

6) Save As Text Only.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 18:51
Member (2004)
English to Polish
Not so easy... Jan 26, 2011

@Daniel,

The problem is that there are more source strings than target strings... So sorting them this way will get them misaligned.

For example, the source might be:

Source string one
Source string three
Source string two

but the target has only:

Target string one
Target string two

as the source string three is new.

If you just sort without aligning, you get:

Source string one Target string one
Source string three Target string two
Source string two

which is not what you want...

[Edited at 2011-01-26 07:18 GMT]


Direct link Reply with quote
 
FarkasAndras
Local time: 18:51
English to Hungarian
+ ...
IDs Jan 26, 2011

Jabberwock wrote:

The exact formula would be:

=INDIRECT(CONCATENATE("B";MATCH("*"&LEFT(A1; 8)&"*";$B$1:$B$10000;0)))

A1 is the source column cell, this will change automatically when you expand the formula.
8 is the length of the match.
Absolute B1:B10000 is the range for the target column.

What it does is: find the number of the row from the range which contains target which starts with the same string as the source, then return the value of target for that row.



That's pretty nice. I would not use a string length based guess to separate the ID and the string, though. That's very likely to cause a lot trouble. The OP's example is:
Panel AccountPopupPanel
You'll probably also get
Panel AccountPopupWindow
or
Panel AccountPopupPanelButton1
or any number of similar IDs that will cause false matches, while there may be short IDs that are not matched by this process.

If the ' ' is uniqe, you could just separate the ID and the string off at the first '.
Even MS Word can do this: replace 'with \t'. The closing ' will be pushed to a new cell, and if the stirng contains a ', there will be further tabs inserted, but that's all easily fixed as long as the string ID doesn't contain a ' character. Of course with perl you could just replace the first ' on every line, avoiding the issue.
Copy-paste to Excel and you now have IDs and strings in separate cells, so you can match them precisely with a slightly modified formula.

Of course one would have to see the data to decide how exactly to split the records up, and of course, all these methods sort of fail if you have too much data. Excel 2003 only has 65000 rows and even if you have Excel 2007 or up, you may have trouble if there is a lot of data (500,000 and up). Perl or other programming languages will do 50,000,000 if needed.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 18:51
Member (2004)
English to Polish
Matching Jan 26, 2011

The formula I have given above can be improved so that the content is not matched against a string of arbitrary length, but against a common portion.

For that purpose you would have to remove (with regular expression) all text from the source string after the first quote and put it along the source column.

So the columns would look like this:

First column (source): MenuItem mtAccount 'Active account'
Second column (shortened): MenuItem mtAccount
Third column (target): (might not be the same row!) MenuItem mtAccount 'Aktywne konto'
Fourth column (matching formula): =INDIRECT(CONCATENATE("C";MATCH(B1&"*";$C$1:$C$10000;0)))

If you match in the target column the text which is contained in the second column, it should return the translation of the given source string.


Direct link Reply with quote
 

István Hirsch  Identity Verified
Local time: 18:51
English to Hungarian
in Word Jan 26, 2011

If Excel is not convenient for the purpose, try this. It works for small sample files as described, however, differences in settings etc. can raise further questions to solve.

1. Open the source file in Word, create a table with two columns, put source segments in column B (one cell one segment), and number the lines in column A. (Do not use Word Numbering function. Instead, create the column of numbers in Excel and copy/paste this column into column A in Word.)
2. Open the target file and copy/paste the target segments into column B, right under the source segments.
3. Select the target segments you have just pasted, go to Edit/Replace, check Wildcard checkbox and
Find: ([!^13])(*){1,}(’)(*){1,}(’)
Replace with: ^t\1\2#\3\4\5
(as for the quotation marks, it is advisable to copy them from your text into the Find box, rather than entering them by typing).
4. Select column B, go to Table/Sorting and sort the table (in decreasing order) according to column B.
5. Select the table, go to Table/Converting and choosing the tabulator as delimiter, first convert Table to Text, then Text to Table (Number of columns: 3).
6. Delete cell C1 to level source and target columns, and delete # from column C.
7. Select column A (numbers) and restore the original order of source segments by Table/Sorting in increasing order.


Direct link Reply with quote
 

Dejan Škrebić  Identity Verified
Bosnia and Herzegovina
Local time: 18:51
English to Serbian
+ ...
TOPIC STARTER
Seems like good idea... Jan 26, 2011

Hello all,

Thank you all for your help. Some of the advices were really helpful. Finally, I found out that for my case Daniel's advice is the most helpful, a bit customized.

Here's what I've done:

1. Loaded source to Excel,
2. Added left column and filled it with values 1 to ~10000,
3. Sorted the source by column B, alphabetically,
4. Sorted target, alphabetically,
5. Copied target into source column C,
6. Going manually through target column deleting obsolete strings and adding new strings.

Step 6 is important here because:
a. Although I had said opposite, there are strings with same ID. Most of them have the same content also (like Button btOk '', '&OK'), but some of them have the same ID and different content (similar, but different due to context)
b. I have opportunity to check if the strings with same identifier have consistent localisation (eg. Button btOk '', '&OK'; Button btOk '', '&All right'; etc)

Excel is the best tool for this case because a) it easily handles 10,000 strings b) it allows to add or delete a cell in column C without interfering with the content of columns A and B.

There are three steps to do when I finish step 6:
7. Re-sort all content by column A (ordinal numbers for original sequence),
8. Delete column A,
9. Export content to original files.


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Comparing two files with strings and sorting the target file

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
PDF Translation - the Easy Way
TransPDF converts your PDFs to XLIFF ready for professional translation.

TransPDF converts your PDFs to XLIFF ready for professional translation. It also puts your translations back into the PDF to make new PDFs. Quicker and more accurate than hand-editing PDF. Includes free use of Infix PDF Editor with your translated PDFs.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search