How to guess language in cyrillic script?
Thread poster: Jan Sundström
Is there any guide or smart overview how you distinguish between different languages, if you have a paper in your hand with cyrillic script?
Sometimes I get documents where I can't say whether it's Russian, Azeri, Mongolian, Macedonian or [insert exotic language here]...?
It would be very useful to have a quick reference chart, what to look for, to identify which language it is.
If we received the documents as files on the computer it would be easy to cut/paste a sentence and search on the internet, or use language guessing software.
But these are mostly diplomas or forms with handwritten entries, stamps, stickers etc, which makes it cumbersome to scan, OCR etc.
I found this extensive alphabet list:
But I'm looking for a set of hard and fast rules, that I can use on the spot. Like: "if you see the letter Y, you can be sure it's the language X".
Is there any website or guide for this, or am I wishing for the impossible?!
| | mjbjosh
Local time: 19:25
English to Latvian
| Depends on the writer || Jan 24, 2008 |
I am not familiar with all the languages that you named (also, I think Azeri is using a modified Latin alphabet), but I think it depends on the writer. For example, when I am writing in Russian, I tend to use a "t" that resembles the Greek "t" rather than the Cyrillic one that looks like a Latin "m". Or Greek "d" for that matter, which looks in Cyrillic rather like the Latin "g".
[Edited at 2008-01-24 21:44]
| | esperantisto
Local time: 21:25
English to Russian
| I doubt that simple hard rules can be derived. || Jan 25, 2008 |
a) Even for the same language, there may be huge differences in historical view: the pre-revolutionary Russian script drastically differs from the modern.
b) One matter are Slavic languages that use Cyrillic generically and developed it in their own ways each (although, of course, under great influence of Russian), and the other matter are non-Slavic languages of the ex-USSR + Mongolian: their scripts were developed from Russian and are more uniform on one hand but more complex on the other hand.
Well, learn languages, not scripts! It's just like for Latin.
However, many languages have specific letters. Just a couple of tips:
1. If your see Ўў, this may be Belarusian, Uzbek or some language of the Extreme North of the Russian Federation. I know nothing about the latter, but for the first two:
a) if you also see Ии, that's Uzbek;
b) otherwise, it's Belarusian.
Note: If it's a text from the 20s of the XXth century, Ў may be also in Ossetin, but I doubt you'll encounter it.
2. If Ӕӕ, Ossetin.
3. If Її, Ukrainian (or Ruthenian, but it's a minor language with no official status, not recognized as a separate language in Ukraine).
4. If Ӂӂ, Moldovan (Romanian).
| || || |
| F7 for texts in soft copy || Jan 26, 2008 |
If you received the text, say, in Word, select a word from the text, then press F7 and you will get a pop up message like this: "There is no Thesaurus available for (eg.) Macedonian"...
As for the table on wikipedia, it's also a very good source: it tells you, for example, that the Macedonian alphabet is the only alphabet that has the letters Ќ and Ѓ...
The same conclusion for Ћ and Ђ for Serbian...
1. Only Ukrainian and Belarussian use the letter I i.
2. If you find the letter Є є, it's Ukrainian. 100% sure, since no other language uses it(unless it's a text in Old Church Slavonic, but that would be very odd, and you would recognize it for the medieval look and the letter Ѣ ѣ). Also, only Ukrainian uses Ï ï, and it does NOT use Ъ ъ neither Ы ы.
Note: There's a small language called Rusyn, which some considere a dialect of Ukrainian. If you find Є є and Ï ï but also Ъ ъ and Ы ы, it must be Rusyn.
3. If you find a language with both I i and Ў ў, it's Belarussian. It does NOT use Ъ ъ neither Щ щ.
4. For Slavic languages, the letter J j is only used by Serbian and Macedonian. (There's a small dialect of Sami which also uses it, but you would recognize it for some letters with a comma-like symbol attached: Ӊ ӊ, Ҋ ҋ, Ӆ ӆ).
5. Besides J j, only Serbian and Macedonian have the distinctive letters Љ љ and Њ њ
6. If you find a text with Ћ ћ and Ђ ђ, you can be 100% sure it's Serbian.
7. If you find a text with J j plus Ѓ ѓ and Ќ ќ, you can be 100% sure it's Macedonian.
8. I don't know much about non-Slavic languages which use Cyrillic, but they are often characterized by 'unusual' letters like Ә ә or Ä ä, and by modifications like Ғ ғ, Ұ ұ (the latter found in Kazakh).
9. If there is not any distinctive letter of the mentioned above (I, Є, J, Ў, Љ, Ћ, Ќ, neither Ә, Ғ, Ұ), then most likely it's Russian or Bulgarian.
10. To tell Russian from Bulgarian: Bulgarian uses very often the letter Ъ ъ, while in Russian it's only used in some specific cases. Bulgarian does not use Ë ë, but the combination ьо instead, which is very unusual in Russian (I would say it is only possible with certain foreign words). Unfortunately, Ë ë in Russian is most often written as simply E e.
Hope it helps...
[Editado a las 2008-01-31 12:05]
| || || |