Thread poster: Juan Martín Fernández Rowda
As you probably know, Statistical Machine Translation (SMT) needs considerably big amounts of text data to produce good translations. We are talking about millions of words. At the same time, SMT has the ability to translate millions of words relatively fast (VERY fast, in comparison to human translators). In this scenario, and speaking mainly from a linguist’s perspective, the challenge is how can one make any sense of all of these millions of words? What do you do if you want to find out whether a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How can you know what are the main topics covered in your corpus?
It’s unrealistic to try to understand your corpus by reading every single line or word.
Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.
I cover some analysis techniques and tips that I believe are useful and effective to understand your corpus better in this post:
There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »