Pages in topic:   [1 2] >
Converting multilingual Termbase into Excel document
Thread poster: Jim Thomson

Jim Thomson
Austria
Local time: 07:14
German to English
+ ...
Mar 10, 2010

Has anyone out there found a relatively simple way of converting multilingual Termbases into Excel files?
I find if there are any missing term entries for a particular language, then all succeeding terms don't appear in their respective "language" column in the Excel doc; they get shifted over to the left.
Is there any way of getting Multiterm to treat empty fields nevertheless as "blanks" when doing the export?
Anyone who can come up with a nice solution for this (why hasn't SDL Trados?!) deserves a Nobel prize of some kind.


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
Try XML Mar 10, 2010

Jim Thomson wrote:

Is there any way of getting Multiterm to treat empty fields nevertheless as "blanks" when doing the export?


No. The tab delimited export is broken.

Jim Thomson wrote:

Anyone who can come up with a nice solution for this (why hasn't SDL Trados?!) deserves a Nobel prize of some kind.

I do have a solution of sorts. Do an XML export, open it with a text editor and check the tags it uses. With some ingenuity, you can use search and replace to fashion it into a tab delimited txt, especially if the structure is simple (not too many fields, no fields with the same name in different parts of the structure). I think I described this in a little more detail in a previous thread here.

When I have some time off work, I will probably write a script that automates this to an extent, maybe someone else already has. Full automation is difficult because of the complex entry structures Multiterm allows.
Anyway, if it's not urgent for you, you could just let me know what tags your export contains and I could write a script for them. If the structure is something trivial like 3 languages and one "comments" field for each entry, I could whip something up soon.


 

Daniel García
English to Spanish
+ ...
Word table export? Mar 10, 2010

Have you tried the Word table export? As far as I remember, you can use it from the Export command from MultiTerm or from Microsoft Word itself using the MultiTerm toolbar.

Depending on the structure of your termbase and on what you are trying to obtain, it might be an easy solution.

Daniel


 

Jim Thomson
Austria
Local time: 07:14
German to English
+ ...
TOPIC STARTER
Modifying XML when some languages are not complete seems not so simple Mar 11, 2010

Word creates a nice looking dictionary, but it's not use for importing into Excel. The client provided us with their own Excel glossary, which they'd like back once we've expanded it by adding other languages, etc. Converting and importing it is easy; giving it back to them in a similar format appears nigh on impossible.

FarkasAndras, I'd read your previous post on this topic and although I fully understand the logic of it, the practice is however not so simple. Our termbase has 6 languages, but not all terms are complete (i.e. translations of some terms are missing for some languages). Using the default Multiterm XML export, the file, when viewed in Editor or Word is a mess. Full of unnecessary "transac" information about who created the term, when, who then modified it, etc. Also, any missing languages don't seem to be referred to at all, so I'm not sure how the resulting "cleaned up" xml or txt file can be made to line up neatly in the corresponding language columns in Excel.

Is it possible to create a custom XML export that omits all the info about who created the term, etc.?

FarkasAndras, I very much appreciate your offer to write some kind of script should you have the time, but as you say, every Termbase export is different. I'd show you this particular one if there was some way of attaching files in this forum.

Jim


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
Not too difficult Mar 11, 2010

Jim Thomson wrote:

Using the default Multiterm XML export, the file, when viewed in Editor or Word is a mess. Full of unnecessary "transac" information about who created the term, when, who then modified it, etc. Also, any missing languages don't seem to be referred to at all, so I'm not sure how the resulting "cleaned up" xml or txt file can be made to line up neatly in the corresponding language columns in Excel.


The XML is a mess for human eyes, but it's strictly organised and pretty easy to convert to pretty much any other format. I know there is a lot of "rubbish" in there, but that's not a problem.
I know about the missing terms in various languages, that's the whole point of this workaround (the tab delimited export can't handle them, as you painfully found out).

You just start by putting each new entry ("concept") in a separate line by first removing all line breaks and then replacing all occurrences of <concept> (?) with a line break.
That takes care of separating each entry. Then you pick out the terms with search and replace, one language at a time, leaving all the rubbish behind. When there is no English term for a particular entry, there will be an empty line in the English data set there. You basically end up with 6 separate term lists that are in sync. All you need to do is paste them to adjacent columns in an Excel sheet.

Anyway, it looks like I will have the afternoon off today so I might just get this done.
You can upload your tmx export to mediafire.com or rapidshare.com and post a link if you want to.
What I need to know is whether there is some fancy tree structure you would want to preserve. If it's just the terms in the 6 languages, no synonyms, no picklist fields, no text fields, then the task trivially simple. In case of a complicated structure, some individual fine-tuning would be needed to conserve everything.


 

Jim Thomson
Austria
Local time: 07:14
German to English
+ ...
TOPIC STARTER
Still a little bit difficult Mar 11, 2010

FarkasAndras wrote:

Then you pick out the terms with search and replace, one language at a time, leaving all the rubbish behind. When there is no English term for a particular entry, there will be an empty line in the English data set there. You basically end up with 6 separate term lists that are in sync. All you need to do is paste them to adjacent columns in an Excel sheet.


This is the step that I'm not quite clear about. How do you "pick out" the terms using search and replace?

I've uploaded the xml file to our file server. You can login and retrieve it here:

http://www.word-connection.at/english/client-login.html

Login: guest

Password: guest

The termbase has 7 languages but 1 (Spanish) is completely empty. It also has a picklist (Product range), but this export has already been filtered for the product range "Solar", so there's no need for this field to appear in the Excel doc. Just the terms are required.

I'd be very interested (not to mention very thankful!!) to hear how you do it.

Jim


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
Script Mar 11, 2010

Jim Thomson wrote:

FarkasAndras wrote:

Then you pick out the terms with search and replace, one language at a time, leaving all the rubbish behind. When there is no English term for a particular entry, there will be an empty line in the English data set there. You basically end up with 6 separate term lists that are in sync. All you need to do is paste them to adjacent columns in an Excel sheet.


This is the step that I'm not quite clear about. How do you "pick out" the terms using search and replace?

I've uploaded the xml file to our file server. You can login and retrieve it here:

http://www.word-connection.at/english/client-login.html

Login: guest

Password: guest

The termbase has 7 languages but 1 (Spanish) is completely empty. It also has a picklist (Product range), but this export has already been filtered for the product range "Solar", so there's no need for this field to appear in the Excel doc. Just the terms are required.

I'd be very interested (not to mention very thankful!!) to hear how you do it.

Jim


Well, I didn't go into detail because I figure out the details once I started fiddling with the files. I think I described this in some detail in the old thread, though.
Anyway, I played with this during the afternoon. I have a description, a script and your file.



Description of manual process for those interested:

-> means search and replace, \t and \n stand for tab and line break characters in Notepad++

- Each entry in separate line: delete all line breaks, then <conceptgrp> -> \n
- Synonyms: </term></termGrp><termGrp><term> -> xxxsynonymxxx
- Save under different name.
- Extract languages one by one, e.g. <language type="German" lang="DE"></language><termGrp><term> -> /t
Check for variants, e.g. the language tag is sometimes closed within the first tag itself:
<language type="English" lang="EN"></language><termGrp><term>
<language type="English" lang="EN"/><termGrp><term>

Multiterm also likes to keep you on your toes by varying the order or the properties, for fun and giggles as far as I can tell:
<language lang="EN" type="English"/><termGrp><term>
<language lang="EN" type="English"></language><termGrp><term>

Then copy to Excel, copy 2nd column back to text editor, replace </term> with \t, copy to Excel, keep 1st column only.
Return to saved version for next language.

- Handle synonyms, i.e. replace xxxsynonymsxxx with whatever suits your fancy.



Script:
I integrated this into my aligner project as I couldn't be bothered to make a separate page for it and it's somewhat related anyway. I tucked it away in the scripts folder so that it wouldn't get in the way for people who just use the aligner. To use, copy MT_XML_converter.bat and deletedupes.bat from scripts to one level further up and run there.

Download:
sourceforge.net/projects/aligner

If someone wants to beta test this (or the aligner), I'm interested in your feedback.



I PMed you a download link for your file.
Obviously, I can take no responsibility for it... I'm pretty certain it's A-OK, but if it's not, I wash my hands. Most entries were duplicated several times, I filtered the dupes out. The unfiltered version is on worksheet 2. Synonyms are separated by xxxsynonymxxx, replace as needed.


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
Note and flaw, script off for now Mar 11, 2010

Just in case this wasn't clear to everyone: if you have a multiterm termbase you want to export into a spreadsheet format, you can use the script available from sourceforge.net.
It runs in Windows XP (and probably other flavors of Windows as well). Download the zip package, unpack, copy the two bat files from scripts to the main aligner_etc. folder, read the readme, export the termbase (tested on Multiterm 7, default export definition), convert the xml file to UTF-8, copy it to the main folder and run the .bat script.

Text fields and picklist values will not be extracted, only the terms themselves.



There was a bit of a hiccup with the script, which I have now fixed. If you downloaded the first version, replace it with the new one.
Now it should work well. Still no support in extracting other fields, but the all index fields and synonyms within should be handled correctly. You can check by counting (with search and replace) how many times <term> occurs in the XML and then comparing that to the number of entries+synonyms in the txts.

[Edited at 2010-03-11 20:39 GMT]


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
New version Mar 13, 2010

New version uploaded to sourceforge.net/projects/aligner/
Includes bug fixes and the ability to extract text fields (source, notes, comments etc.)


 

Daniel García
English to Spanish
+ ...
It also works in Windows 7 64 bits Mar 13, 2010


It runs in Windows XP (and probably other flavors of Windows as well).


Hi, Andras,

Just to confirm that it works well in Windows 7 64 bits. It's very neat.icon_smile.gif

I guess the original must have found it useful.

Daniel


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
Thanks Mar 13, 2010

Daniel García wrote:


It runs in Windows XP (and probably other flavors of Windows as well).


Hi, Andras,

Just to confirm that it works well in Windows 7 64 bits. It's very neat.icon_smile.gif

I guess the original must have found it useful.

Daniel



That's nice to know. Microsoft does change the commands that can be used in command line scripts sometimes, and I also had no idea if sed works in a 64-bit environment.

If anyone else has feedback on this script or the aligner, keep it coming here or through sourceforge.


 

Grzegorz Gryc  Identity Verified
Local time: 07:14
French to Polish
+ ...
Some workarounds... Mar 14, 2010

FarkasAndras wrote:

Jim Thomson wrote:

Is there any way of getting Multiterm to treat empty fields nevertheless as "blanks" when doing the export?


No. The tab delimited export is broken.


It's just riculous they claim they have a working tab delimited export filter...

You can get a free Across licence.
Generally, their Multiterm XML import filter and CSV/TSV export filters do the job for multilingual termbases.
For bilingual XML you may use Apsic Xbench.

Cheers
GG


 

Jim Thomson
Austria
Local time: 07:14
German to English
+ ...
TOPIC STARTER
Utility not finding the right path to XML file Mar 15, 2010

FarkasAndras wrote:

Download the zip package, unpack, copy the two bat files from scripts to the main aligner_etc. folder, read the readme, export the termbase (tested on Multiterm 7, default export definition), convert the xml file to UTF-8, copy it to the main folder and run the .bat script.


Did all of the above, but I'm getting the error message that "the system cannot find the path" (translated from the German). Have I not created the right folder/path structure? You talk of "the main folder" but I'm not sure what that is as I haven't used your aligner tool before.

Many thanks.
Jim


 

FarkasAndras
Local time: 07:14
English to Hungarian
+ ...
file names, folders Mar 15, 2010

Jim Thomson wrote:

FarkasAndras wrote:

Download the zip package, unpack, copy the two bat files from scripts to the main aligner_etc. folder, read the readme, export the termbase (tested on Multiterm 7, default export definition), convert the xml file to UTF-8, copy it to the main folder and run the .bat script.


Did all of the above, but I'm getting the error message that "the system cannot find the path" (translated from the German). Have I not created the right folder/path structure? You talk of "the main folder" but I'm not sure what that is as I haven't used your aligner tool before.

Many thanks.
Jim

Well, yes, either there is a typo in the file name you gave, or something is not in the right folder. Or maybe the file name contains a space.

- Don't use spaces in file names. BTW don't use accented letters, either. Rename your file if it contains any such characters. Also, leave the dot and the extension off as the script tells you to. I.e. type "SOLAR_complete".
- The folder aligner_1.0_ver.8 is where you should copy the files MT_XML_converter.bat, deletedupes.bat and your UTF-8 XML, alongside the scripts folder and aligner_1.0_ver.8 etc.
(I didn't want to put the exact folder name in the readme as I keep renaming the folder with each new script version and don't want to have to update the readme each time. Perhaps I will just call it "aligner" from the next release on to make the readme a bit easier to follow.)

To Grzegorz: I didn't know that about Across. If it can handle all the fields, including the ones that occur multiple times at different points of the tree structure (English/Context, French/Context etc.) and provides a tab delimited export, then it can handle some termbases my script can't.
I don't install any new software unless I absolutely must... for people with a similar policy the script can still be useful. It doesn't read or write any files outside of its own folder and needs no install, i.e. it doesn't touch the registry.

[Edited at 2010-03-15 09:11 GMT]

[Edited at 2010-03-15 09:20 GMT]


 

Daniel García
English to Spanish
+ ...
Try removing "scripts\" Mar 15, 2010

Jim Thomson wrote:

Did all of the above, but I'm getting the error message that "the system cannot find the path" (translated from the German). Have I not created the right folder/path structure? You talk of "the main folder" but I'm not sure what that is as I haven't used your aligner tool before.


Hi, Jim,

Try editing the .bat file with Notepad and remove "scripts\" from the beginning of each line that has it.

For instance change this line:

scripts\tr\tr -d \r %file%_mod1.xml

to this:

tr\tr -d \r %file%_mod1.xml

...and so on.

When I ran the script for the first time, I got the error you mentioned several times and removing "scripts\" from the beginning of all lines that had it solved my problem.

I tought it was something very specific of my environment but it must be a general issue. Perhaps it's related to the Windows version... :?

Daniel


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Converting multilingual Termbase into Excel document

Advanced search







memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search