Latin-1 trouble with unicode (UTF-8) on XP
Thread poster: Dirk Bayer

Dirk Bayer
Local time: 07:57
English to German
+ ...
Feb 23, 2012

I have recently installed and set up OmegaT 2.3.0_1 for English-German translations on Windows XP (service pack 3). I made a glossary (".tab" file) according to the instructions I found which stipulated to use UTF-8 encoding and carriage return / linefeed combos for linebreaks.

However, the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig"). Installing ostensibly true unicode fonts like Bitstream Vera Sans and then setting them as OmegaT's font instead of the "Dialog" font made no difference. Applying those fonts in OpenOffice when viewing the target files likewise made no difference. A CodePage 1252 version of the same glossary displays correctly.

It seems OmegaT, OpenOffice, etc. only display CP 1252 on my PC and typing through the familiar US International Keyboard also only produces CP 1252 encoding. How can I produce UTF-8 encoding if clients ask for it?

I would be most grateful for help with this.


 

Didier Briel  Identity Verified
France
Local time: 13:57
Member (2007)
English to French
+ ...
Use the right extension for UTF-8 Feb 23, 2012

Dirk Bayer wrote:

I have recently installed and set up OmegaT 2.3.0_1 for English-German translations on Windows XP (service pack 3). I made a glossary (".tab" file) according to the instructions I found which stipulated to use UTF-8 encoding and carriage return / linefeed combos for linebreaks.

However, the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig").

The documentation says (Glossaries > File format):
Glossary files can be either in system default encoding (and indicated by the extension .tab) or in UTF-8 (the extension .utf8).

So, simply rename your glossary with a .utf8 extension.

Didier


 

esperantisto  Identity Verified
Local time: 15:57
Member (2006)
English to Russian
+ ...
Or .txt Feb 23, 2012

I have no problem with OmegaT 2.3.x/2.5.x using .txt extension for my glossaries in UTF-8.

the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig").


Your glossary is actually in UTF-8 and has the correct format, simply change the extension.


 

Dirk Bayer
Local time: 07:57
English to German
+ ...
TOPIC STARTER
Oops and thanks! Feb 24, 2012

Oops! How did I miss that? Next question: where can I get a good egg remover for my face? icon_wink.gif

Seriously, thanks a million for the prompt and excellent responses! icon_smile.gif

Simply using an ".utf8" extension on the UTF-8 encoded glossary seems to have worked. Even typing non-English characters in OmegaT now using my preferred method (US International Keyboard) seems to create the same results as insertion from the glossaries, no matter if I use the CP 1252 version of the glossary (with ".tab" file name extension) or the UTF-8 verson (with ".utf8" file name extension), and exporting an OmegaT-produced odt file to a utf8 cleartext file from OpenOffice now produces the same output as OmegaT does from a ".utf8" source file to a utf8 target file. -- It seems as if I could even use good old CP 1252 encoded glossaries (with ".tab" extension!) and leave the unicode worries entirely to OmegaT...

Only my previous attempt to use the UTF-8 verson of the glossary with an erroneous ".tab" file name extension seems to produce the weird results I saw previously: garbled displays in both OmegaT and OpenOffice plus a 4-byte string for a single non-English character upon exporting to utf8 cleartext from the OpenOffice odt file created by OmegaT.


Remaining questions for any takers:

1.) I can't say I understand how inserting the same code sequences from glossaries with different file name extensions creates such different results (in the days before Unicode a code sequence was what it was and you only had to apply the matching font to it), but consistent glyph displays and consistent underlying codes (as revealed in the cleartext files on a unicode-ignorant classic Mac platform) at least suggest that the intended output is now being created. I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files since they seem to react identically to font changes no matter what glossary was used in their creation as long as the glossary file name matched the glossary's encoding.

2.) I wonder whether normalizing glossaries to use only straight quotes will now result in matches whether or not the source files contain straight or curly quotes, or whether such normalizing will even be necessary. I previously had mixed results.

This is an amazing forum. icon_smile.gif


 

esperantisto  Identity Verified
Local time: 15:57
Member (2006)
English to Russian
+ ...
OOo files… Feb 24, 2012

Dirk Bayer wrote:
I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files


OOo files are and have always been Unicode-based, there’s nothing to verify.


 

Didier Briel  Identity Verified
France
Local time: 13:57
Member (2007)
English to French
+ ...
UTF-8 is UTF-8 Feb 24, 2012

Dirk Bayer wrote:
Simply using an ".utf8" extension on the UTF-8 encoded glossary seems to have worked. Even typing non-English characters in OmegaT now using my preferred method (US International Keyboard) seems to create the same results as insertion from the glossaries, no matter if I use the CP 1252 version of the glossary (with ".tab" file name extension) or the UTF-8 verson (with ".utf8" file name extension), and exporting an OmegaT-produced odt file to a utf8 cleartext file from OpenOffice now produces the same output as OmegaT does from a ".utf8" source file to a utf8 target file. -- It seems as if I could even use good old CP 1252 encoded glossaries (with ".tab" extension!) and leave the unicode worries entirely to OmegaT...

OmegaT handles everything in UTF-8 internally. So, if the input is correctly identified, the output will be correct, assuming it can handle the required characters. I.e., you cannot produce CP 1252 files containing Japanese.


I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files since they seem to react identically to font changes no matter what glossary was used in their creation as long as the glossary file name matched the glossary's encoding.

There's nothing to verify: all OpenOffice.org files are in UTF-8.

I wonder whether normalizing glossaries to use only straight quotes will now result in matches whether or not the source files contain straight or curly quotes, or whether such normalizing will even be necessary. I previously had mixed results.

It depends on plenty of things.
In short, OmegaT has no specific function to understand that a straight quote is the same as a curly one.


This is an amazing forum. icon_smile.gif

For advanced discussion on OmegaT, the Yahoo support group would still be more suitable.

Didier


 

Dirk Bayer
Local time: 07:57
English to German
+ ...
TOPIC STARTER
Thanks again! Feb 24, 2012

These are very useful and reassuring confirmations.

Many thanks to both of you. icon_smile.gif


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Latin-1 trouble with unicode (UTF-8) on XP

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search