Polish forum to be converted to Unicode
Thread poster: Andrew Wright (X)
Andrew Wright (X)
Andrew Wright (X)
United States
Local time: 09:50
English
Feb 7, 2006

Hello forum goers,

We will be converting this forum to using Unciode as the default character set some time later today. There won't be any need to change your browser's default encoding, it should recognize the page's character set automatically once the conversion takes place.

Once this is done, you will notice that older posts may not currently be legible. If you view these posts you will see a link reading "Is the text in this post garbled? Click here", or an icon o
... See more
Hello forum goers,

We will be converting this forum to using Unciode as the default character set some time later today. There won't be any need to change your browser's default encoding, it should recognize the page's character set automatically once the conversion takes place.

Once this is done, you will notice that older posts may not currently be legible. If you view these posts you will see a link reading "Is the text in this post garbled? Click here", or an icon of a speech bubble containing a question mark. Clicking either of these will bring you to a dailgue that will ask you a few questions to determine how to convert that post into Unicode.

For more information on Unicode, see the recently revised Character Set and Localization

FAQ: http://www.proz.com/faq/localization

For more information on the conversion tool see the about page:
http://www.proz.com/?sp=charset_issues

If there is anything else that is not fully explained, please let me know.

Andrew Wright
Site Staff
Collapse


 
Andrew Wright (X)
Andrew Wright (X)
United States
Local time: 09:50
English
TOPIC STARTER
Auto converter enabled Feb 8, 2006

I've enabled an auto converter for old posts in this forum. If everything worked right it should automatically convert most older posts into unicode and any that didn't work can still be manuaully converted via the conversion tool.

Of course, I don't actually speak Polish so I need someone here to tell me if everything looks ok.

-Andrew Wright


 
Magda Dziadosz
Magda Dziadosz  Identity Verified
Poland
Local time: 15:50
Member (2004)
English to Polish
+ ...
Looks OK! Feb 8, 2006

Hi Andrew,
I've made a random check in past threads and it looks very well!
I only noticed one post still a bit garbled, but was able to convert it manually with no prob.

We will let you know if there are any problems (hopefully not )

Best,
Magda


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 15:50
Member (2004)
English to Polish
SITE LOCALIZER
Not quite... Feb 8, 2006

I have checked several popular topics at random... While most of it looks OK, there are many "š" letters, which do not appear in Polish (s with caron, not to be confused with "ś", s with breve, which is certainly Polish!). It should be converted to "ą" (a with ogonek).

See, for examp
... See more
I have checked several popular topics at random... While most of it looks OK, there are many "š" letters, which do not appear in Polish (s with caron, not to be confused with "ś", s with breve, which is certainly Polish!). It should be converted to "ą" (a with ogonek).

See, for example:
http://www.proz.com/topic/39408
http://www.proz.com/topic/39355
http://www.proz.com/topic/37594
Collapse


 
Gwidon Naskrent
Gwidon Naskrent  Identity Verified
Poland
Local time: 15:50
English to Polish
+ ...
š Feb 8, 2006

there are many "š" letters, which do not appear in Polish


This is what happens when you try to read text encoded in CP-1250 and the browser applies the ISO-8859-2 standard. Other affected letters include ź and ą.

Unrelatedly, ś is not s with breve, but s with acute.


 
Magda Dziadosz
Magda Dziadosz  Identity Verified
Poland
Local time: 15:50
Member (2004)
English to Polish
+ ...
Converting them manually Feb 8, 2006

Hi Jabber,
I've seen such as well and they need to be converted manually following the "Is this post garbled..." link. It seems that Windows-1250 coding version needs to be selected each time.

Strange, that your post here seems garbled, too.... ?

Magda


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 15:50
Member (2004)
English to Polish
SITE LOCALIZER
The basis for conversion? Feb 8, 2006

Well, it means that the pages were automatically converted from ISO-8859-2. However, it seems that it had been agreed (see: http://www.proz.com/post/176328#176328 ) that the Polish forum will use Win-1250 so I suppose _most_ of the posts will be in that coding.

Therefore, I think that Win-1250 should be used as the conversion base. Of course, it would be even better to detect which cod
... See more
Well, it means that the pages were automatically converted from ISO-8859-2. However, it seems that it had been agreed (see: http://www.proz.com/post/176328#176328 ) that the Polish forum will use Win-1250 so I suppose _most_ of the posts will be in that coding.

Therefore, I think that Win-1250 should be used as the conversion base. Of course, it would be even better to detect which coding has originally been used (it should not be that hard, as some of the codes are exclusive...).
Collapse


 
Andrew Wright (X)
Andrew Wright (X)
United States
Local time: 09:50
English
TOPIC STARTER
Default switch Feb 8, 2006

Ok, I just switched the default conversion base from ISO-8859-2 to Windows-1250. Let me know if that works out better or worse.

As a side note, detecting the differences between most single-byte character sets (wihch includes all the ISO-8859-# and windows-125# sets) is a difficult or impossible task for a computer. In Windows-1250 a certain character might be displayed as 'š' and in another set it might dsplay as 'ś'. But to the computer all this looks like is "10011011", with
... See more
Ok, I just switched the default conversion base from ISO-8859-2 to Windows-1250. Let me know if that works out better or worse.

As a side note, detecting the differences between most single-byte character sets (wihch includes all the ISO-8859-# and windows-125# sets) is a difficult or impossible task for a computer. In Windows-1250 a certain character might be displayed as 'š' and in another set it might dsplay as 'ś'. But to the computer all this looks like is "10011011", without a character set the computer doesn't know what letter this byte is supposed to be. There are ways to guess based on analysis of the text versus the linguistic properties of the language, but those are too complex to implement here.

Andrew Wright
Collapse


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 15:50
Member (2004)
English to Polish
SITE LOCALIZER
No text analysis, just codes... Feb 8, 2006

It's just a number of codes, it should be fairly easy to detect which are converted incorrectly, assuming that the text was written in Polish (which is a safe assumption for this forum). For example, if after conversion from Win 1250 the given post contains characters: ± or ¶, it means it should originally be converted from ISO.

In other words, if the Polish post contains the character with the code 177 (±), it was written in ISO. On the other hand, if it contains character with
... See more
It's just a number of codes, it should be fairly easy to detect which are converted incorrectly, assuming that the text was written in Polish (which is a safe assumption for this forum). For example, if after conversion from Win 1250 the given post contains characters: ± or ¶, it means it should originally be converted from ISO.

In other words, if the Polish post contains the character with the code 177 (±), it was written in ISO. On the other hand, if it contains character with the code 154 (š), we might be pretty sure that it was written in Win 1250. (I hope I got the codes right! Only got win character map here to check...)

I have selected this pair, as it represents the quite frequent Polish letter "ą" in both code pages. There are other pairs, as well...



[Edited at 2006-02-08 18:13]
Collapse


 
Gwidon Naskrent
Gwidon Naskrent  Identity Verified
Poland
Local time: 15:50
English to Polish
+ ...
The full story Feb 9, 2006

You can find the full listing of the Polish letter codes (plus the "section sign") in five encodings at http://www.republika.pl/elgec/m2.htm

Only CP1250 (aka Windows EE) and ISO-8859-2 (aka ISO Latin 2) are used nowadays, though.


[Edited at 2006-02-09 11:18]


 
Andrew Wright (X)
Andrew Wright (X)
United States
Local time: 09:50
English
TOPIC STARTER
Addition to the system Feb 9, 2006

Hello again,

Just now I've added a new feature to the system to the migration system.

At the moment what is happening in this forum is that old data is being pulled from the database but converted to unicode before it is sent to your browser for display. This is nice, because most older posts remain visible. However, the down side to this approach is that old data remains in non-unicode format in the database which means that it will not match a search if the search
... See more
Hello again,

Just now I've added a new feature to the system to the migration system.

At the moment what is happening in this forum is that old data is being pulled from the database but converted to unicode before it is sent to your browser for display. This is nice, because most older posts remain visible. However, the down side to this approach is that old data remains in non-unicode format in the database which means that it will not match a search if the search is done in unicode.

So we needed a way to tell the system that the posts that are automatically converted correctly are legible. But we also didn't want to make this process tedious for whoever was doing it.

So today I've added a link that goes next to the "Is this text garbled?" link that reads "Is the text in this post legible?". Clicking this link will use Javascript to let the conversion system that the automatic conversion was correct, but thanks to the Javascript the user won't actually have to leave the page they are viewing to do it.

If clicking on the link generates any errors on the page, please copy the error and post it in this topic or email it directly to me.

Thanks,
Andrew Wright
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Monika Jakacka Márquez[Call to this topic]

You can also contact site staff by submitting a support request »

Polish forum to be converted to Unicode






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »