Mobile menu

How to guess language in cyrillic script?
Thread poster: Jan Sundström

Jan Sundström  Identity Verified
Sweden
Local time: 18:24
English to Swedish
+ ...
Jan 24, 2008

Hi all,

Is there any guide or smart overview how you distinguish between different languages, if you have a paper in your hand with cyrillic script?

Sometimes I get documents where I can't say whether it's Russian, Azeri, Mongolian, Macedonian or [insert exotic language here]...?

It would be very useful to have a quick reference chart, what to look for, to identify which language it is.

If we received the documents as files on the computer it would be easy to cut/paste a sentence and search on the internet, or use language guessing software.
But these are mostly diplomas or forms with handwritten entries, stamps, stickers etc, which makes it cumbersome to scan, OCR etc.

I found this extensive alphabet list:
http://en.wikipedia.org/wiki/List_of_Cyrillic_letters

But I'm looking for a set of hard and fast rules, that I can use on the spot. Like: "if you see the letter Y, you can be sure it's the language X".

Is there any website or guide for this, or am I wishing for the impossible?!

/Jan


Direct link Reply with quote
 

Rossi Ignatova  Identity Verified
Local time: 17:24
Spanish to Bulgarian
+ ...
Possibly helpful link Jan 24, 2008

Hi Jan,

You may wish to try this link

http://www.library.yale.edu/cataloging/music/cyrillic.htm

Kind regards,

Rossi Ignatova


Direct link Reply with quote
 
Marek Daroszewski (MrMarDar)  Identity Verified
Local time: 18:24
English to Polish
+ ...
Language identifier Jan 24, 2008

You might want to try this site:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser

It works for a few languages I have tired out of curiosity.

Best,
Marek


Direct link Reply with quote
 
mjbjosh
Local time: 18:24
English to Latvian
+ ...
Depends on the writer Jan 24, 2008

I am not familiar with all the languages that you named (also, I think Azeri is using a modified Latin alphabet), but I think it depends on the writer. For example, when I am writing in Russian, I tend to use a "t" that resembles the Greek "t" rather than the Cyrillic one that looks like a Latin "m". Or Greek "d" for that matter, which looks in Cyrillic rather like the Latin "g".

[Edited at 2008-01-24 21:44]


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 20:24
Member (2006)
English to Russian
+ ...
I doubt that simple hard rules can be derived. Jan 25, 2008

a) Even for the same language, there may be huge differences in historical view: the pre-revolutionary Russian script drastically differs from the modern.
b) One matter are Slavic languages that use Cyrillic generically and developed it in their own ways each (although, of course, under great influence of Russian), and the other matter are non-Slavic languages of the ex-USSR + Mongolian: their scripts were developed from Russian and are more uniform on one hand but more complex on the other hand.

Well, learn languages, not scripts! It's just like for Latin.

However, many languages have specific letters. Just a couple of tips:

1. If your see Ўў, this may be Belarusian, Uzbek or some language of the Extreme North of the Russian Federation. I know nothing about the latter, but for the first two:
a) if you also see Ии, that's Uzbek;
b) otherwise, it's Belarusian.

Note: If it's a text from the 20s of the XXth century, Ў may be also in Ossetin, but I doubt you'll encounter it.

2. If Ӕӕ, Ossetin.

3. If Її, Ukrainian (or Ruthenian, but it's a minor language with no official status, not recognized as a separate language in Ukraine).

4. If Ӂӂ, Moldovan (Romanian).


Direct link Reply with quote
 

Radica Schenck  Identity Verified
Germany
Local time: 18:24
English to Macedonian
+ ...
F7 for texts in soft copy Jan 26, 2008

If you received the text, say, in Word, select a word from the text, then press F7 and you will get a pop up message like this: "There is no Thesaurus available for (eg.) Macedonian"...


As for the table on wikipedia, it's also a very good source: it tells you, for example, that the Macedonian alphabet is the only alphabet that has the letters Ќ and Ѓ...

The same conclusion for Ћ and Ђ for Serbian...

Good luck!


Direct link Reply with quote
 
Victor Quero  Identity Verified
Local time: 18:24
Serbo-Croat to Spanish
+ ...
Some hints Jan 31, 2008

1. Only Ukrainian and Belarussian use the letter I i.

2. If you find the letter Є є, it's Ukrainian. 100% sure, since no other language uses it(unless it's a text in Old Church Slavonic, but that would be very odd, and you would recognize it for the medieval look and the letter Ѣ ѣ). Also, only Ukrainian uses Ï ï, and it does NOT use Ъ ъ neither Ы ы.

Note: There's a small language called Rusyn, which some considere a dialect of Ukrainian. If you find Є є and Ï ï but also Ъ ъ and Ы ы, it must be Rusyn.

3. If you find a language with both I i and Ў ў, it's Belarussian. It does NOT use Ъ ъ neither Щ щ.

4. For Slavic languages, the letter J j is only used by Serbian and Macedonian. (There's a small dialect of Sami which also uses it, but you would recognize it for some letters with a comma-like symbol attached: Ӊ ӊ, Ҋ ҋ, Ӆ ӆ).

5. Besides J j, only Serbian and Macedonian have the distinctive letters Љ љ and Њ њ

6. If you find a text with Ћ ћ and Ђ ђ, you can be 100% sure it's Serbian.

7. If you find a text with J j plus Ѓ ѓ and Ќ ќ, you can be 100% sure it's Macedonian.

8. I don't know much about non-Slavic languages which use Cyrillic, but they are often characterized by 'unusual' letters like Ә ә or Ä ä, and by modifications like Ғ ғ, Ұ ұ (the latter found in Kazakh).

9. If there is not any distinctive letter of the mentioned above (I, Є, J, Ў, Љ, Ћ, Ќ, neither Ә, Ғ, Ұ), then most likely it's Russian or Bulgarian.

10. To tell Russian from Bulgarian: Bulgarian uses very often the letter Ъ ъ, while in Russian it's only used in some specific cases. Bulgarian does not use Ë ë, but the combination ьо instead, which is very unusual in Russian (I would say it is only possible with certain foreign words). Unfortunately, Ë ë in Russian is most often written as simply E e.

Hope it helps...

[Editado a las 2008-01-31 12:05]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to guess language in cyrillic script?

Advanced search






memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs