Pages in topic:   [1 2] >
TMX fixer
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
May 31, 2011

G'day everyone

I'm looking for a program that can fix a non-compliant TMX file for me in a jiffy. I'm using Windows XP Pro SP2.

My TMX file contains illegal characters and it contains unterminated entities in some segments. The program in which I want to use the TMX file isn't intelligent enough to simply ignore these errors -- no, the program stops *(and dies) at every error.

What I need is a program that will fix the TMX. In other words, it should seek out illegal characters and simply delete them (or replace them with placeholder characters), and it should seek out unterminated entities and simply delete or dummy-fix them (e.g. by changing a lone ampersand to an entitised ampersand). Or... it should simply delete the segments that are non-conformant.

Do you know of such a program?

Thanks
Samuel


[Edited at 2011-05-31 11:54 GMT]


Direct link Reply with quote
 

Natalie  Identity Verified
Poland
Local time: 09:06
Member (2002)
English to Russian
+ ...

Moderator of this forum
Why not Olifant? May 31, 2011

Hi Samuel, Olifant can do this for you:


How to open a TM that contains invalid XML characters?

Some TMX files may come with control characters that are invalid in XML document (often found in RTF-generated style sheet section). When one or more characters is in the TMX file, an error "hexadecimal value 0xHH is an invalid character" occurs when opening or importing the TMX file.

To open such TMX file in Olifant:
Select the Import command from the File menu.
Select the path of the TMX file to import.
Click Open.
An TMX Import Options dialog box opens.
Make sure the option Check for invalid characters is set.
Click OK.

A temporary copy of the file is created, where each invalid XML character will be replaced by _#xHHHH_ where HHHH is the Unicode hexadecimal value of the character. Then the temporary file is open. Note that these characters are left like this after: Olifant does not convert them back automatically when saving the file. You have to decide yourselves what you want to do with them.

You should use this option only when needed, as it increases the time it takes to open the file.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Steps? May 31, 2011

Natalie wrote:
Olifant can do this for you.


Can you please tell me which version of Olifant, and what the steps are?

Because I have used Olifant myself (version 308) and it isn't really helpful. In fact, I have a WF TM of 15000 segments, but Olifant only imports 5000 of the segments. And I have checked some of the segments that were no imported -- they contain no errors.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 08:06
Member (2009)
Dutch to English
+ ...
Just a wild guess, but May 31, 2011

I seem to remember running a few TMs (.tmx) through Xbench, which fixed them in the sense that they could then be imported into a program that had previously balked.

Michael


Direct link Reply with quote
 

Natalie  Identity Verified
Poland
Local time: 09:06
Member (2002)
English to Russian
+ ...

Moderator of this forum
-- May 31, 2011

I use ver.3.0.8 and quite recently opened tmx files with 78900 and 92K segments without any problems; I haven't heard about any size limitations for tmx files.

The steps are described above (this is a quotation from Olifant help). You can look through your TM and decide if you wish to replace these letters or just delete the segments.


Direct link Reply with quote
 

Piotr Bienkowski  Identity Verified
Poland
Local time: 09:06
Member (2005)
English to Polish
+ ...
Okapi Rainbow? May 31, 2011

http://okapi.opentag.com

It has Fix illegal characters under XML utilities, and since TMX is XML it should work for TMX too.

As for unterminated entities, not sure whether Rainbow can fix it, but for example UltraEdit does parse XML (and hence TMX) files, and takes you directly to the place where the problem is, so that you can fix it manually.

Regards,

Piotr


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Manually is not an option May 31, 2011

Piotr Bienkowski wrote:
...for example UltraEdit does parse XML (and hence TMX) files, and takes you directly to the place where the problem is, so that you can fix it manually.


Imagine a file with 15000 segments, with 1000 problem segments. Manual editing is not an option, sorry.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Nathalie May 31, 2011

Natalie wrote:
I use ver.3.0.8 ... The steps are described above (this is a quotation from Olifant help).


Thanks, I didn't realise the quoted text was the steps.

You can look through your TM and decide if you wish to replace these letters or just delete the segments.


Thanks, but I don't want to "look through the TM". I want to spend as close to zero time on TM admin and most of my time on actual translation.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Alternatively May 31, 2011

Alternatively, does anyone know of a list of the most commonly encountered invalid/illegal XML characters? If I can have such a list, then I can write my own checker.

Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 09:06
Member (2007)
English to French
+ ...
The list is short May 31, 2011

Samuel Murray wrote:

Alternatively, does anyone know of a list of the most commonly encountered invalid/illegal XML characters? If I can have such a list, then I can write my own checker.


From bligner source code:

s/&/& amp;/g;
s/lesser_than_char/& lt;/g;
s/greater_than_char/& gt;/g;
s/"/& quot;/g;
s/[\x00-\x08]|\x0B|\x0C|[\x0E-\x1F]//g;

I have added a space between the & and the entity name, otherwise the forum code interprets it. I had also to use a name for the lesser and greater than characters.

(In some circumstances " might not be invalid, for instance, but let's make it simple.)

Note that Rainbow will already clean these, as Piotr has written.

Didier

[Edited at 2011-05-31 15:56 GMT]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Didier May 31, 2011

Didier Briel wrote:
From bligner source code:
s/[\x00-\x08]|\x0B|\x0C|[\x0E-\x1F]//g;


Thanks, this is useful.

s/&/& a m p ;/g;


In my case, since my files are already TMXes, I'm going to have to figure out how to check if something is a valid entity if it starts with an ampersand. This means finding out what characters are not allowed in an entity.


Direct link Reply with quote
 
FarkasAndras
Local time: 09:06
English to Hungarian
+ ...
Unusual characters in forum May 31, 2011

Didier Briel wrote:

I have added a space between the & and the entity name, otherwise the forum code interprets it. I had also to use a name for the lesser and greater than characters.


You can just use the character entities in such situations.
So you can write & and it comes through as &
Similarly, you can do < and get < (in this post, I added yet another amp; to make these show up correctly in the forum).
Thanks for the list, though. Perhaps I'll make a simple TMX fixer .exe for the uninitiated among us who might need it. With your list, it'll take 5 minutes.


Direct link Reply with quote
 
FarkasAndras
Local time: 09:06
English to Hungarian
+ ...
That works, too May 31, 2011

Samuel Murray wrote:


s/&/& a m p ;/g;


In my case, since my files are already TMXes, I'm going to have to figure out how to check if something is a valid entity if it starts with an ampersand. This means finding out what characters are not allowed in an entity.


You don't necessarily need to check for the validity of character entities. Replacing all & with && "fixes" illegal character entities by converting them to an ampersand entity followed by some random characters. It should make your file pass the import filter, but of course it will also break any correct character entities that might be in the file. At that point, you might as well do s/&\S+//, of course.
If there are clear trends in how your entities are broken, you could try more refined solutions, or you could just pass the file through some filter that converts entities to literal characters, then remove corrupted character entities and convert back. TMX only allows a very limited set of character entities anyway (amp, lt, gt, quot, apos and maybe their numerical versions), so this would be the correct solution.


Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 09:06
Member (2007)
English to French
+ ...
The list of XML entities is short May 31, 2011

Samuel Murray wrote:
In my case, since my files are already TMXes, I'm going to have to figure out how to check if something is a valid entity if it starts with an ampersand. This means finding out what characters are not allowed in an entity.

In addition to the list I gave, the only one missing was '
Since it is TMX, no other entities can be defined, so you just have to deal with pure XML entities, as defined in the standard:
XML pre-defined entities

FarkasAndras wrote:
You can just use the character entities in such situations.
So you can write & and it comes through as &

Thank you for the reminder.
(I knew, but I was being lazy.)

Didier


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 09:06
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Limited number of entities... May 31, 2011

FarkasAndras wrote:
TMX only allows a very limited set of character entities anyway (amp, lt, gt, quot, apos and maybe their numerical versions)...


Oh, yeah, I seem to remember something like that now, yes. Well, that is good news then. Makes my life a little simpler. My own TMX fixer will probably just delete misbehaving segments outright, to keep the quality of the TM as high as possible.


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

TMX fixer

Advanced search







PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search