Help identify mystery character in SDLXLIFF file (for MS Word)
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
Aug 20, 2015

Hello everyone

Can anyone please tell me what is the character between the two brackets in this file:
http://wikisend.com/download/340442/mystery%20character.zip
and how I can type that character in MS Word, and how I can do a find/replace action with that character in MS Word?

Thanks


Direct link Reply with quote
 

Elif Baykara  Identity Verified
Turkey
Local time: 21:47
Member (2015)
German to Turkish
+ ...
A grey square in a box? Aug 20, 2015

I have downloaded the file and this is what I get.

I tried to match it in some font types such as webdings.. No success.. It maybe another substitute for characters which cannot be shown in a certain font type, just like the empty box..


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
EF BB BF Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file...?


My hex editor tells me the character is EF BB BF, which, incidentally, is the same as the character at the start of the file, i.e. a UTF8 byte order mark. However, in my SDLXLIFF file, this character occurs in places where I might expect bullets in a bullet list.

I can see the character in MS Word, but I can't copy it to the clipboard, so I'm going to have to learn how to type it in the find/replace box to be able to manipulate it.


Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 19:47
German to Swedish
+ ...
Would this help Aug 20, 2015

http://wordribbon.tips.net/T009167_Searching_for_Multi-Byte_Hex_Codes.html

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Joakim Aug 20, 2015

Joakim Braun wrote:
Would this help?


Nope, I've tried that already, sorry.

The one method is to convert the hex code to decimal (e.g. using an online converter) and then type ^u0000 (where 0000 is the decimal code) in the search box. However, the decimal code for BB BF is 48063, and for EF BB BF it is 15711167, and neither of these codes find the mystery character.

The other method is to type the hex code and then press Alt+X. This method only works if the hex code has four letters/numbers, not more. The Alt+X conversion of BB BF is 뮿, which is not my character either.


Direct link Reply with quote
 

Adrien Esparron
Local time: 19:47
Member (2007)
German to French
+ ...
In my Ms Word Aug 20, 2015

The mystery character looks like that :

><

I can copy it and replace it with what I want (just a char or a word).

I use for that the Windows encoding by default.

Hope this helps!

Regards


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Adrien Aug 20, 2015

Adrien Esparron wrote:
The mystery character looks like that:
><
I use for that the Windows encoding by default.


The encoding of the text file is UTF8 with BOM (sorry, perhaps I should have mentioned it, but MS Word is usually pretty good at guessing text files' encoding and I had thought that all installations of MS Word will successfully identify the file as UTF8 with BOM).

If you open a text file that is encoded in one encoding as if it is encoded in another encoding (i.e., what you have done), then different characters will be displayed. If you want to see how this character is displayed in MS Word, then don't select "Windows (Default)" as the encoding, but "Other encoding: Unicode (UTF8)" when opening the file in MS Word.


Direct link Reply with quote
 
Joakim Braun  Identity Verified
Sweden
Local time: 19:47
German to Swedish
+ ...
Reversed byte order? Aug 20, 2015

Samuel Murray wrote:

The Alt+X conversion of BB BF is 뮿, which is not my character either.




And BF BB?
Reversed byte order - worth a try.


Direct link Reply with quote
 

Robin Levey
Chile
Local time: 15:47
Spanish to English
+ ...
Zero-Width No-Break Space Aug 20, 2015

In UTF-8, EF BB BF is a zero-width no-break space (see: http://www.fileformat.info/info/charset/UTF-8/list.htm?start=43024 )

In Word, the equivalent character is called “No-Space Non Break” and on my system (Word 2000 / Win XP* ) it can be inserted it into a document via the “Insert Symbol” dialogue, “Special Characters” tab, last item in the list. It displays differently to what we see in Samuel’s link, and it has a different hex code: E2 80 8D (again, 3 hex bytes …).

After assigning a key code to this NSNB character (it doesn’t have one by default) I can insert it into a document and replicate something very similar to Samuel’s problem. In contrast to other special characters (eg. ©) I cannot insert this character directly into the “Find” box, using the assigned shortcut, nor can I copy-paste it from the document, as a single character. However, if I know, for example, that it is always preceded by a 'p' and always followed by a ‘q’ I can search for ‘p?q’ (copy-pasted as a 3-character group) and it finds that combination – including the NSNB represented by the ? wildcard for one required character – correctly.

Samuel has said that his mystery character appears in places where he might expect to find a bullet (and there's indeed some typographical logic in the use of this special character in that situation), so maybe there’s a fixed pattern, similar to the one I’ve used above, that he can exploit to do the search. IOW, if Word accepts that the ? wildcard can find Word's E2 80 8D, maybe it will also find Samuel's EF BB BF.

* Other combinations of Word and Windows may give different (or zero) mileage.

HTH
RL


Direct link Reply with quote
 

Dan Lucas  Identity Verified
United Kingdom
Local time: 18:47
Member (2014)
Japanese to English
Zero width no-break space? Aug 20, 2015

Samuel Murray wrote:
Can anyone please tell me what is the character between the two brackets in this file:

Emacs thinks it's a ZERO WIDTH NO-BREAK SPACE, as per following dump:

position: 3 of 3 (67%), column: 0
character:  (displayed as ) (codepoint 65279, #o177377, #xfeff)
preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xFEFF
script: arabic
syntax: w which means: word
to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
buffer code: #xEF #xBB #xBF
file code: not encodable by coding system iso-latin-1-dos
display: no font available

Character code properties: customize what to show
name: ZERO WIDTH NO-BREAK SPACE
old-name: BYTE ORDER MARK
general-category: Cf (Other, Format)
decomposition: (65279) ('')

As for find and replace, I think ^uxxxx is how to find unicode in MS Word. Perhaps ^u65279 is worth trying.

Regards
Dan


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Dan licked it (one solution) Aug 20, 2015

Dan Lucas wrote:
Emacs thinks...
... decomposition: (65279) ('')
Perhaps ^u65279 is worth trying.


^u65279 in the Find field works. Thanks, Dan.

So...

To repeat this solution with other characters, one has to either use Emacs, or... reduce a copy of the file to that character only (with known characters on either side of it) and save it as UTF8 plain text, then open it in a hex editor (such as Geoffrey Prewett's 150 KB one), take note of the hex code (in my case EFBBBF), and then find the corresponding HTML entity hex code (in my case #xfeff) and HTML entity decimal code (in my case 65279). One can do this here:

http://www.google.com/search?q=site:.fileformat.info/info/unicode/char/%20efbbbf (for "efbbbf")

To type this character in MS Word, type the HTML entity hex code and press Alt+X (i.e. type FEFF and press Alt+X). To find this character in the find/replace dialog, and presumably also find it in a macro, use the HTML entity decimal code preceded by "u^".


[Edited at 2015-08-21 07:42 GMT]


Direct link Reply with quote
 

Stepan Konev  Identity Verified
Russian Federation
Local time: 21:47
English to Russian
Select and Ctrl+H Aug 20, 2015

Samuel Murray wrote:
To repeat this solution with other characters


To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.

I see similar sign in some places within a text translated by qTranslate+Google. But I don't have any idea why this happens...


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:47
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Stepan (another, simpler solution) Aug 21, 2015

Stepan Konev wrote:
To find such charachters without a code you need to select it and press Ctrl+H in MS Word. ... The Replace fields appear empty, but when you click Replace all, all such chars disappear.


Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

This gave me an idea, though, which works: to find the HTML entity decimal code for the mystery character, simply record a macro with it. In other words, start recording the macro, then select the character, press Ctrl+H, replace it with anything, and stop recording the macro. Then step into the macro, and you'll see the HTML entity decimal code for it: in my case, ChrW(65279).

==

By the way, those Google Translate characters in your screenshot (which I suspect is inserted by Google to help them identify machine translated text while they crawl the web for translations), I simply remove using a macro:

Sub gt_removechars()
With ActiveDocument.Content.Find
.ClearFormatting
.Replacement.ClearFormatting
.Execute FindText:=ChrW(8203), ReplaceWith:="", _
Replace:=wdReplaceAll
End With
End Sub

Samuel


Direct link Reply with quote
 

Dan Lucas  Identity Verified
United Kingdom
Local time: 18:47
Member (2014)
Japanese to English
Hah, useful Aug 21, 2015

Samuel Murray wrote:
Thanks, I didn't even know that I could auto-populate the Find field in MS Word by first selecting the term and then pressing Ctrl+H.

I wasn't consciously aware of that either! Since ctrl+h is a common shortcut for find and replace (SDL Studio, Notepad++ and many more) I must have used it in Word without really thinking about it many times. Thank you to Stepan for explicitly pointing it out.

Regards
Dan


Direct link Reply with quote
 
Elizabeth Joy Pitt de Morales  Identity Verified
Local time: 19:47
Member (2007)
Spanish to English
+ ...
Thanks! Aug 21, 2015

Stepan Konev wrote:

To find such charachters without a code you need to select it and press Ctrl+H in MS Word. Do not try to copy/paste it – to no avail. Just select and Ctrl+H. The Replace fields appear empty, but when you click Replace all, all such chars disappear.



This is extremely valuable information. Thank you!


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Help identify mystery character in SDLXLIFF file (for MS Word)

Advanced search






SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search