Mobile menu

what's the simplest way to convert a Word file to Unicode?
Thread poster: Vito Smolej

Vito Smolej
Germany
Local time: 10:37
Member (2004)
English to Slovenian
+ ...
Nov 13, 2008

I got again one of those DOC files where the Slovenian č (tsch in German) is steadfastly understood as é. The issue of course is the 8bit Western-1 vs 8bit Western-2 (or CE) character set.

The only sensible long-term solution of course is to have it all in Unicode.

How do you make Word understand that you want the document (with all its frills and formats etc) stored and handled in Unicode?

I have my convoluted way of doing it, but I am sure there must be a more simple and elegant way to achieve this.


TiA

Vito


Direct link Reply with quote
 

Jing Nie
China
Local time: 16:37
Member (2011)
English to Chinese
+ ...
You may embed the fonts in the WORD file Nov 14, 2008

Hi Vito,
You may embed the fonts in the WORD fle.

File>Save as
>Choose the down arrow at the right side of "tools" at the up-right corner
>save options
>embed truetype fonts.
>check the necessary options.

Since I do not use a English version of WORD, I do not know if the menus are just as above in English.



Vito Smolej wrote:

I got again one of those DOC files where the Slovenian č (tsch in German) is steadfastly understood as é. The issue of course is the 8bit Western-1 vs 8bit Western-2 (or CE) character set.

The only sensible long-term solution of course is to have it all in Unicode.

How do you make Word understand that you want the document (with all its frills and formats etc) stored and handled in Unicode?

I have my convoluted way of doing it, but I am sure there must be a more simple and elegant way to achieve this.


TiA

Vito


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 11:37
Member (2006)
English to Russian
+ ...
Modern Microsoft Word formats (i.e. 97-2003 and 2007) are all Unicode Nov 14, 2008

You probably receive Word 6.0/95 with the language set incorrectly. We here (those who work with Belarusian/Russian/Ukrainian) face this issue regularly. However, the solution is quite simple: write and run a macro replacing Western European characters with Eastern European ones. I can't remember, how you could do that with Word/VBA, but here's a piece of code for OOo Basic to give you a clue, just replace chars in mCP1251 with respective Eastern European:

Code:
REM  *****  Recode from cp1252 to cp1251 for Word and Excel files without language set.
REM ***** Authors Dmitry G. Mastrukov and A. Novodroskii 2002, code corrercted by Dmitri Gabinski
REM ***** GPL license


Dim mCP1252(123) As String
Dim mCP1251(123) As String

Sub Init
mCP1252() = Array("€","‚","¸","„","…","†","‡","ˆ","‰","Š","‹","Œ","Ž", _
"‘","’","“","”","•","–","—","™","š","›","œ","¡","ž", _
"Ÿ"," ","¡","¢","£","¤","¥","¦","§","¨","©","ª","«", _
"¬","­","®","¯","°","±","²","³","´","µ","¶","·","¸", _
"¹","º","»","¼","½","¾","¿","À","Á","Â","Ã","Ä","Å", _
"Æ","Ç","È","É","Ê","Ë","Ì","Í","Î","Ï","Ð","Ñ","Ò", _
"Ó","Ô","Õ","Ö","×","Ø","Ù","Ú","Û","Ü","Ý","Þ","ß", _
"à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì", _
"í","î","ï","ð","ñ","ò","ó","ô","õ","ö","÷","ø","ù", _
"ú","û","ü","ý","þ","ÿ" )
mCP1251() = Array("Ђ","‚","ё","„","…","†","‡","€","‰","Љ","‹","Њ","Ћ", _
"‘","’","“","”","•","–","—","™","љ","›","њ","ќ","ћ", _
"џ"," ","Ў","ў","Ј","¤","Ґ","¦","§","Ё","©","Є","«", _
"¬","­","®","Ї","°","±","І","і","ґ","µ","¶","·","ё", _
"№","є","»","ј","Ѕ","ѕ","ї","А","Б","В","Г","Д","Е", _
"Ж","З","И","Й","К","Л","М","Н","О","П","Р","С","Т", _
"У","Ф","Х","Ц","Ч","Ш","Щ","Ъ","Ы","Ь","Э","Ю","Я", _
"а","б","в","г","д","е","ж","з","и","й","к","л","м", _
"н","о","п","р","с","т","у","ф","х","ц","ч","ш","щ", _
"ъ","ы","ь","э","ю","я" )
End Sub

Sub RecodeAllWriter
Dim n As Long
Dim oDocument As Object
Dim oReplace As Object
Init()
oDocument = ThisComponent
oReplace = oDocument.createReplaceDescriptor
For n = lbound(mCP1252()) To ubound(mCP1252())
oReplace.SearchString = mCP1252(n)
oReplace.ReplaceString = mCP1251(n)
oReplace.SearchCaseSensitive = TRUE
oDocument.replaceAll(oReplace)
Next n
MsgBox "Преобразовано"
End Sub

Sub RecodeAllCalc
Dim n As Long
Dim m As Long
Dim oDocument As Object
Dim oReplace As Object
Init()
On error goto ex
oDocument = ThisComponent
m = 0
While 1 = 1
oReplace = oDocument.Sheets(m).createReplaceDescriptor
For n = lbound(mCP1252()) To ubound(mCP1252())
oReplace.SearchString = mCP1252(n)
oReplace.ReplaceString = mCP1251(n)
oReplace.SearchCaseSensitive = TRUE
oDocument.Sheets(m).replaceAll(oReplace)
Next n
m = m + 1
Wend
ex:
MsgBox "Преобразовано"
End Sub



(Run RecodeAllWriter on a text document or RecodeAllCalc on a spreadsheet, then save to any Unicode-based format; ODF is a good option, as you, probably, know )

[Edited at 2008-11-14 07:29 GMT]


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 11:37
Member (2006)
English to Russian
+ ...
Or should it be Southern European? Nov 14, 2008

Sorry, mixed up Slovenian and Slovak

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 10:37
Member (2006)
English to Afrikaans
+ ...
Fiasco, fiasco Nov 14, 2008

Vito Smolej wrote:
I got again one of those DOC files where the Slovenian č (tsch in German) is steadfastly understood as é. The issue of course is the 8bit Western-1 vs 8bit Western-2 (or CE) character set.


Well, it all depends on how this fiasco started. And if it is indeed a fiasco, it stands to reason that the solution will be convoluted, no? How can this happen?

1. Someone edited the file using a non-Unicode font, a bitmapped font or an iconic font.
2. Someone opened a TXT file in MS Word and selected the wrong encoding.
3. Someone opened the TXT file in a text editor, and selected the wrong encoding, and copied/pasted the text in MS Word...

How do you suspect this happens? But send me the file and I'll have a look.

What operating system are you using?


Direct link Reply with quote
 

Vito Smolej
Germany
Local time: 10:37
Member (2004)
English to Slovenian
+ ...
TOPIC STARTER
You got it, Sam... Nov 14, 2008

Well, it all depends on how this fiasco started. And if it is indeed a fiasco, it stands to reason that the solution will be convoluted, no? How can this happen?

I have absolutely no idea ... Its a doc file from the agent (asking for a dutch spell checker...) and I succeeded to kill the 8bit version by exporting to txt, setting UTF8 and reimporting. So far I got away with this murder.

The hateful point is that with Trados exports you always get all those pages and pages and pages of fonts (because in some dark prehistoric time you used Lucida on some text in Swahili, you know what I mean). And eventually you still get that hateful é jumping in your face, saying "you'll never get rid of me, NEVER ..."

The alternative which I used several times with success, was to save the file as RTF while requesting explicitly (takes VB code) that "coding:=msoEncodingUTF8" or something of this sort. It did not (!) work this time.

Said to myself "...now, this should be simpler than that..." - so I have written it up here.

regards

Vito

PS: The system is XP, word 2003 ... pretty much plain-vanilla situation. And thanks everybody for trying out to help me. I feel flattered by suggestions. But I would rather spend time translating for pay than for fun.


[Edited at 2008-11-14 12:40 GMT]


Direct link Reply with quote
 

Vito Smolej
Germany
Local time: 10:37
Member (2004)
English to Slovenian
+ ...
TOPIC STARTER
Just came in - the original story Nov 14, 2008

Well, it all depends on how this fiasco started. And if it is indeed a fiasco, it stands to reason that the solution will be convoluted, no? How can this happen?



Hi Vito,

Thanks for all files.
I think the encoding was "wrong" in the word files I sent you because I copied this text from excel into word.
And now I need to copy it again into excel.
Can you tell me how you changed the encoding ?

Now we know - on the second thought maybe not

regards

Vito

[Edited at 2008-11-14 13:04 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

what's the simplest way to convert a Word file to Unicode?

Advanced search






WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs