Pages in topic:   [1 2] >
Error in coding of TM in TMX format from memoQ
Thread poster: Alaina Brantner
Alaina Brantner
United States
Local time: 09:55
Spanish to English
Apr 6, 2016

I am running into an error when trying to open in Trados a TM exported from memoQ in TMX format.

When I try to open this memoQ-created TMX file (created using version 6.0.67), I get the following error:
The translation memory ________, could not be loaded: 'Data at the root level is invalid. Line 1, position 40.'

When I open the TMX TM using Notepad and find line 1, position 40, what I'm seeing there is a vertical rectangle shaped character between the and the . I can remove that character, and then try to open the TMX TM in Trados again, but the rectangles are throughout the file between the source and target, so when I try to open the TM again in Trados, it just flags the next occurrence of that issue.

Has anyone seen this before? My understanding is that memoQ is supposed to be almost fully compatible with Trados, so I'm not understanding why I'm getting this error when trying to open a TMX from memoQ. Is there a setting for memoQ TMs that I should have selected to avoid these rectangles being populated into a file, and are these even the cause of this error?

I checked on the Trados forum as well and a kind responder indicated that this might be an error of having the TM in unicode and not UTF8? Is there a way to export a TM to TMX in UTF8 in memoQ?

http://www.proz.com/post/2539567#2539567

Any help is much appreciated!


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
a few pointers Apr 6, 2016

Alaina Brantner wrote:

I am running into an error when trying to open in Trados a TM exported from memoQ in TMX format.

When I try to open this memoQ-created TMX file (created using version 6.0.67), I get the following error:
The translation memory ________, could not be loaded: 'Data at the root level is invalid. Line 1, position 40.'

When I open the TMX TM using Notepad and find line 1, position 40, what I'm seeing there is a vertical rectangle shaped character between the and the . I can remove that character, and then try to open the TMX TM in Trados again, but the rectangles are throughout the file between the source and target, so when I try to open the TM again in Trados, it just flags the next occurrence of that issue.

Has anyone seen this before? My understanding is that memoQ is supposed to be almost fully compatible with Trados, so I'm not understanding why I'm getting this error when trying to open a TMX from memoQ. Is there a setting for memoQ TMs that I should have selected to avoid these rectangles being populated into a file, and are these even the cause of this error?

I checked on the Trados forum as well and a kind responder indicated that this might be an error of having the TM in unicode and not UTF8? Is there a way to export a TM to TMX in UTF8 in memoQ?

http://www.proz.com/post/2539567#2539567

Any help is much appreciated!


Open the TMX in a good text editor (if you don't have one yet, try the free Notepad++) and change the encoding to UTF-8. For some silly reason, memoQ still exports its TMXs to UTF-16 LE with signature. Then try to import it again.

If that doesn't work, run the TMX through any of these TMX editor, which will 99% of the time fix the problem:

• Heartsome's free TMX editor (https://github.com/heartsome/tmxeditor8 )(I recommend this one)
• Apsic Xbench (free or paid)(http://www.xbench.net/ )*
• Okapi Olifant TMX editor (free)(http://okapi.sourceforge.net/downloads.html )

Michael

* note that Xbench might mess up any metadata such as Client/Subject fields. They promised me they are working on fixing this.

[Edited at 2016-04-06 21:26 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
@Alaina: Apr 6, 2016

If you want you can just send the TMX to me and I will have a look at it. I have Studio 2015 (and memoQ 2015), and so can try it myself and see what is going on.

Michael

[Edited at 2016-04-06 21:27 GMT]


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
Bad file Apr 7, 2016

Alaina Brantner wrote:

When I open the TMX TM using Notepad and find line 1, position 40, what I'm seeing there is a vertical rectangle shaped character between the <source> and the <target>


You have a broken file. A valid TMX file must start with a conventional XML header, such as:

Code:
<?xml version="1.0" encoding="UTF-8"?>



Thus, you (or someone who has created it) have to re-export the translation memory to TMX. Of course, you can also try pasting the available data to an existing valid TMX file, but the above fact may also mean that some data have been lost in export.


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
Not the encoding Apr 7, 2016

Michael J.W. Beijer wrote:

For some silly reason, memoQ still exports its TMXs to UTF-16 LE with signature.


I don’t understand what you mean under ‘signature’ here, but UTF-16 LE is an absolutely valid option for encoding of TMX files (unless Trados is so badly broken that it can’t understand it). Anyway, the issue seems to be not about the encoding, see my comment above.

[Edited at 2016-04-07 08:42 GMT]


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
:) Apr 7, 2016

esperantisto wrote:

Michael J.W. Beijer wrote:

For some silly reason, memoQ still exports its TMXs to UTF-16 LE with signature.


I don’t understand what you mean under ‘signature’ here, but UTF-16 LE is an absolutely valid option for encoding of TMX files (unless Trados is so badly broken that it can’t understand it). Anyway, the issue seems to be not about the encoding, see my comment above.

[Edited at 2016-04-07 08:42 GMT]


By "signature", I meant "byte order mark" (BOM), which is what it is called in EmEditor.

Of course it's valid, but since everyone is moving towards UTF-8 these days, by continuing to export to UTF-16 LE, Kilgray is making it a little harder for people who need to go back and forth between several programs. Also, Kilgra isn't entirely consistent with it's choice for UTF-16 LE, but that's another story, and not really relevant here.

MJWB


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
@Alaina: Apr 7, 2016

Could you maybe make a screenshot of all those "vertical rectangle shaped character"? That is, exactly where are they located throughout the file?

MJWB


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
BOM Apr 7, 2016

Michael J.W. Beijer wrote:

By "signature", I meant "byte order mark" (BOM)


A BOM is mandatory for UTF-16 (in order to differentiate between LE and BE), thus, its absence would result in problems (but not necessarily, as many programs assume LE when no BOM is present), not its presence.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
Thanks for the lesson. Apr 7, 2016

esperantisto wrote:

Michael J.W. Beijer wrote:

By "signature", I meant "byte order mark" (BOM)


A BOM is mandatory for UTF-16 (in order to differentiate between LE and BE), thus, its absence would result in problems (but not necessarily, as many programs assume LE when no BOM is present), not its presence.


Still wish Kilgray'd export their TMXs as UTF-8 though, BOM or no BOM.

MJWB


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
No difference Apr 7, 2016

Michael J.W. Beijer wrote:

Still wish Kilgray'd export their TMXs as UTF-8 though, BOM or no BOM.


UTF-8 or UTF-16 are just different representations of Unicode, thus, what’s the difference? A valid TMX file can be read (by a good program) in any encoding.


Direct link Reply with quote
 

Kevin Dias
Local time: 00:55
SITE STAFF
File size Apr 7, 2016

esperantisto wrote:
UTF-8 or UTF-16 are just different representations of Unicode, thus, what’s the difference? A valid TMX file can be read (by a good program) in any encoding.


The file size of UTF-16 files ends up being about double that of UTF-8. I agree with Michael and wish CAT tools would export their TMX files as UTF-8.


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
Depends Apr 7, 2016

Kevin Dias wrote:

The file size of UTF-16 files ends up being about double that of UTF-8.


That is generally incorrect. It only happens for Western-European languages and for Eastern-European ones using Latin script. For Cyrillic or Greek, virtually no difference as the characters are two-byte in any case (or, well, some characters are three-byte for non-Slavic Cyrillic alphabets), and for the Far-Eastern scripts UTF-8 (with three bytes per character) means even bigger file sizes as compared to UTF-16 (two bytes).

But the file size is not very important, because the decoded data take the same memory anyway.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
What's the difference? Apr 7, 2016

esperantisto wrote:

Michael J.W. Beijer wrote:

Still wish Kilgray'd export their TMXs as UTF-8 though, BOM or no BOM.


UTF-8 or UTF-16 are just different representations of Unicode, thus, what’s the difference? A valid TMX file can be read (by a good program) in any encoding.



Well, for one, I wouldn't have to convert the darned thing every time it came out of memoQ.

But I can see you have your mind set on disagreeing, so will leave it at this.


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 18:55
Member (2006)
English to Russian
+ ...
Why? Apr 7, 2016

Michael J.W. Beijer wrote:

Well, for one, I wouldn't have to convert the darned thing every time it came out of memoQ.


Could you explain why do you have to convert? I have had no problems using TMX files from memoQ in OmegaT or Wordfast Anywhere without any conversion, although OmegaT uses UTF-8 as its working encoding and so seems WFA.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 15:55
Member (2009)
Dutch to English
+ ...
TMLookup expects UTF-8 + memoQ mailing list thread Apr 7, 2016

esperantisto wrote:

Michael J.W. Beijer wrote:

Well, for one, I wouldn't have to convert the darned thing every time it came out of memoQ.


Could you explain why do you have to convert? I have had no problems using TMX files from memoQ in OmegaT or Wordfast Anywhere without any conversion, although OmegaT uses UTF-8 as its working encoding and so seems WFA.


Every time I want to import a memoQ-generated TMX into TMLookup, e.g., I need to convert the UTF-16 file to UTF-8, as TMLookup expects UTF-8. There are also a few other programs that expect UTF-8, rather than Utf-16, just can't think of them at the moment. LF Aligner? Most of András Farkas' other small tools? LogiTerm maybe?

~

See also my thread in the memoQ mailing list:

https://groups.yahoo.com/neo/groups/memoQ/conversations/topics/43131

Subject: Why does memoQ export its TMXs in UTF-16LE?

michaelbeijer:

This is really annoying, as most programs expect UTF-8 these days.

Is there any reason not to switch to UTF-8?

Michael
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
gergely_vandor:

Hi Michael,

Are you saying that other tools have trouble importing an UTF-16 TMX file as it is? Do you have examples? That would be really silly, as most legacy tools (old Trados, old Wordfast, etc) all used UTF-16. memoQ also uses UTF-16, and it is the #2 mainstream tool. But not supporting a mainstream encoding like UTF-16 is in itself very silly. Probably back in the day we chose UTF-16 to conform to the old Trados and Wordfast, the major tools of the time. (I'm not sure about DVX.)

I'm an UTF-8 fan myself, it is the best encoding in most cases. But not always. Interestingly, for Japanese and probably other Eastern languages, UTF-16 is more efficient (smaller file sizes for the same content typically), and their own Shift-JIS encoding, which is actually extremely widespread, is even more efficient.

I think we could easily switch our TMX export to UTF-8 but we haven't seen enough reason to do so. I love to explain that in software development, when deciding what to develop next, the question is not "why not?", but "why?". The "why not" question has an infinite number of right answers, and you have very finite resources, and you want a good balance between the effort and the benefits. So if your resources are infinite and/or you can stop time, you can ask yourself "why not", but in all other cases, you need very good justification to develop or modify anything.

BR,
Gergely
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
gergely_vandor:

Hi All,

I've now also checked that in memoQ translation packages, the TMX file is already UTF-8, most probably for the smaller file size. Our TMX export was most probably decided to be UTF-16 to ensure maximum compatibility, because at that time, everybody used UTF-16.

For those who aren't experts and find this scary, UTF-16 and UTF-8 are both Unicode encodings and can represent any character you would need. UTF-8 yields smaller files for most languages (not all).

Gergely
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
Grzegorz Gryc:

Hi

Friday, February 19, 2016, 2:13:26 PM, you wrote:

> (...) Probably back in the day we chose UTF-16 to conform to
> the old Trados and Wordfast, the major tools of the time. (I'm not
> sure about DVX.)

In the DVX TMX export wizard, you can select the UTF flavour you
prefer.

BTW, DVX handles UTF-16 TMX files exported from memoQ.

Cheers
GG
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
michaelbeijer:

Hi Gergely,

Hmm, off the top of my head, the main reason I find it slightly annoying at the moment is TMLookup expects UTF-8, so anything exported from memoQ needs to be converted before I can import it into my TMLookup database.

I also keep all my own TMXs and glossaries in UTF-8, so seeing UTF-16 always makes me feel a little ... uncomfortable.

Michael
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *
michaelbeijer:

Hmm, since your packages already use UTF-8, and most people prefer UTF-8 these days, how about switching to ... UTF-8?

Michael


These are the UTF-16 file-saving options in EmEditor:



[Edited at 2016-04-07 10:34 GMT]


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Error in coding of TM in TMX format from memoQ

Advanced search






SDL Trados Studio 2017 for €415/$495 with free eLearning
Get the cheapest prices for SDL Trados Studio 2017 on ProZ.com

Join this translator’s group buy brought to you by ProZ.com and buy SDL Trados Studio 2017 Freelance for only €415 / $495 / £325 / ¥60,000 For this month only – receive SDL Trados Studio 2017 - Getting Started eLearning for FREE.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search