MultiTerm Convert: non-ASCII chars lost
Thread poster: Jussi Rosti

Jussi Rosti  Identity Verified
Finland
Local time: 18:53
Member (2005)
English to Finnish
+ ...
Jan 5, 2007

I'm trying to create a MT from a tab delimited text file. Otherwise all goes well, but non-ASCII characters are lost.

Example:

"Lisää tähän" becomes "Lis thn"

I'd appreciate your help!


Direct link Reply with quote
 

Jussi Rosti  Identity Verified
Finland
Local time: 18:53
Member (2005)
English to Finnish
+ ...
TOPIC STARTER
Figured out a workaround Jan 5, 2007

1) I replaced the Finnish characters in the txt file with a tag. (eg. ä with -aaaaa
2) did the conversion
3) reversed the tagging process after which xml file was ok
4) imported the xml to MT

This solved the problem.

Anyway, any hints how to correct the problem in first place?


Direct link Reply with quote
 

Piotr Bienkowski  Identity Verified
Poland
Local time: 17:53
Member (2005)
English to Polish
+ ...
Encoding of the text file is the answer Jan 5, 2007

Jussi Rosti wrote:

I'm trying to create a MT from a tab delimited text file. Otherwise all goes well, but non-ASCII characters are lost.

Example:

"Lisää tähän" becomes "Lis thn"

I'd appreciate your help!


Hi Jussi,

You must know what encoding is the source txt file. If you are not sure, you can install the jEdit text editor (it's free), and open the file in it, it will tell you the encoding of the open file in the bottom right corner.

When you use Multiterm Convert, you should specify the encoding of the source file.

If it still does not work, you should be aware that Multiterm convert likes the Unicode (UTF-16) encoding best, so you could try converting the file to Unicode, using another free utility, Rainbow, see: http://okapi.sourceforge.net/Release/Rainbow/ReadMe.htm, and only then feed it to Multiterm Convert

Hope this helps.

Piotr


Direct link Reply with quote
 

Jussi Rosti  Identity Verified
Finland
Local time: 18:53
Member (2005)
English to Finnish
+ ...
TOPIC STARTER
How to specify the enconding in MT Convert? Jan 5, 2007

Dziękuję za rady, Piotr!

Piotr Bienkowski wrote:
You must know what encoding is the source txt file.


Hmmm.... windows I guess (it's standard Excel export).


When you use Multiterm Convert, you should specify the encoding of the source file.


How to do that? There is quite a little options that can be set.
This was my first thought, too.



If it still does not work, you should be aware that Multiterm convert likes the Unicode (UTF-16) encoding best, so you could try converting the file to Unicode, using another free utility, Rainbow, see: http://okapi.sourceforge.net/Release/Rainbow/ReadMe.htm, and only then feed it to Multiterm Convert


My second try was to export the text in Unicode text. This didn't change anything.


Direct link Reply with quote
 

Piotr Bienkowski  Identity Verified
Poland
Local time: 17:53
Member (2005)
English to Polish
+ ...
Correction Jan 5, 2007

Just checked and there is no way to specify a code page in Multiterm Convert.

So either your TXT file should be in Unicode, or you should convert it to an Excel file and go from there.

I have yet another approach, which I use for bilingual glossaries: I use a custom made perl script to convert the txt file into Multiterm's XML import format, and then I convert the XML file into Unicode, which is accepted smoothly by Multiterm.

I did not find a way to make the Perl script write directly to Unicode.

Regards,

Piotr


Direct link Reply with quote
 

Jussi Rosti  Identity Verified
Finland
Local time: 18:53
Member (2005)
English to Finnish
+ ...
TOPIC STARTER
Thanks for your help Jan 5, 2007

Piotr Bienkowski wrote:

Just checked and there is no way to specify a code page in Multiterm Convert.


So either your TXT file should be in Unicode, or you should convert it to an Excel file and go from there.

[/quote]

I tried exporting also in unicode, but apparently it didn't work for me...

Since the MT is so unflexible, I guess my workaround is good enough for my Finnish purposes. It's quite easy, since I need just encode-decode four chars (upper and lower case ä & ö).

As for languages like Polish with bigger number of "foreign" characters, a Perl script may be a good way to handle the conversion... thanks for the idea! After resigning from software business I sometimes forget how perfect tool Perl is for a linguist...


Direct link Reply with quote
 
dnitzpon
Germany
Local time: 17:53
Dutch to German
+ ...
UFT-8 works in MT2009 May 23, 2014

I don't want to dig up old threads, but this one was on the first page of search results when I encountered the problem myself, so maybe someone else finds this useful:

I had the same problem with Win encoding for German special characters, but saving the export file as UTF-8 made MT Convert 2009 process the file without problems.

It is .... err, strange... however, that MT Convert does obviously neither detect the encoding correctly nor does it offer any option to specify it...


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

MultiTerm Convert: non-ASCII chars lost

Advanced search







WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search