Parallel corpus
Thread poster: mari pet

mari pet  Identity Verified
Spain
Local time: 22:01
Member (2012)
Spanish to Slovak
+ ...
Feb 12, 2012

Dear colleagues,

at moment I am working on my thesis in Bilingual lexicography, namely between Polish and Slovak. For my research I will need the parallel PL-SK corpus. To my best knowledge there does not exist any, which means I will have to create one. However I do not want to put in only texts translated by me, actually I would like to avoid them. So I am looking for parallel texts in Polish and Slovak, ideally if the PL text would be the source text and the Slovak text the target text.
Here is my request: I would like to know if somebody of you knows where can I find this kind of texts in Polish and Slovak or if you are a PL-SK translator and own some parallel texts PL-SK that I could use for my research if you be so kind and provide me some.
I remind that I will use them only for academic purposes.

Thank you for any advice.
marianna

[Editado a las 2012-02-12 17:14 GMT]


 

FarkasAndras
Local time: 22:01
English to Hungarian
+ ...
Links Feb 12, 2012

mari pet wrote:

For my research I will need a parallel PL-SK corpus. To my best knowledge there does not exist any, which means I will have to create one.

Polish and Slovak are both European languages of some significance, so there are corpora out there, even free ones. There are also parallel document sets that are a click or two away from becoming parallel corpora.
The most obvious source is EU legislation:
DGT-TM
Europarl corpus
The two should be well over a million sentence pairs.

There's also the EMEA corpus and I'm sure you could find more in 5 minutes.


 

mari pet  Identity Verified
Spain
Local time: 22:01
Member (2012)
Spanish to Slovak
+ ...
TOPIC STARTER
Thanks Feb 17, 2012

Thank you Farkas,

I had a short look on them and I am not 100% sure but it looks that these texts are not direct translations of each other but rather translations of english texts, aren't they?

Actually I am not so technically literate in doing it, so I will need to consult people in my university.

The other thing is that I wanted to avoid EU related and technical texts and concentrate myself better on popular or journalistic texts.

Anyway thank you.


 

FarkasAndras
Local time: 22:01
English to Hungarian
+ ...
Yes Feb 17, 2012

mari pet wrote:

Thank you Farkas,

I had a short look on them and I am not 100% sure but it looks that these texts are not direct translations of each other but rather translations of english texts, aren't they?

I'm afraid they are, mostly. Most texts are drafted in English and translated to all other languages from English. Some take a different path but obviously not a lot of them are translated directly from PL to SK or vice versa, and you have no way of finding any that were.


mari pet wrote:
I wanted to avoid EU related and technical texts and concentrate myself better on popular or journalistic texts.

Well, in that case, it'll be a bit tougher. Literature is one obvious source. Literary translations are usually (not always!) translated directly from the source, and older works are out of copyright, so you can just grab them off of the internet legally. I'm not sure if Project Gutenberg has books in your languages, look around.
Another convenient source is film subtitles. Fansubs are really easy to align and available in pretty large quantities in easy-to-process formats, but translation quality and legality is a bit questionable.
Once you have the texts, you just need to align them.

Also google around on the Polish and Slovakian internet for parallel corpora. Some universities compile and publish corpora for research, mostly for Machine Translation development.


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 21:01
Member (2009)
Dutch to English
+ ...
OPUS is always a good place to start... Feb 17, 2012

The 'OPUS open parallel corpus' has a few interesting pl-sk resources. Maybe the OpenSubtitles collections...?

-> opus.lingfil.uu.se


opus.jpg
(click above to go to the OPUS homepage)

Michael


 

mari pet  Identity Verified
Spain
Local time: 22:01
Member (2012)
Spanish to Slovak
+ ...
TOPIC STARTER
Aligner Feb 19, 2012

I did a testing alignment with two small .docx files (translated by me) and it went very well (the final excel file looks fine). Then I tried to do it with .txt files extracted from Opensubtitles. But aligning failed "probably due to one file being empty or very short" - I personally think that is because it is too long (I took the whole files, it has over million sentences). But I thinki that if I try to split it in smaller files, it should go well..
The other thing which I do not understand is that in the second case the system did not recognize language codes "pl" and "sk" and considered it "en" and "hu".

Thank you both.
Andras, your aligner is very helpful, thanks a lot.


 

FarkasAndras
Local time: 22:01
English to Hungarian
+ ...
misc Feb 19, 2012

mari pet wrote:

I did a testing alignment with two small .docx files (translated by me) and it went very well (the final excel file looks fine). Then I tried to do it with .txt files extracted from Opensubtitles. But aligning failed "probably due to one file being empty or very short" - I personally think that is because it is too long (I took the whole files, it has over million sentences). But I thinki that if I try to split it in smaller files, it should go well..
The other thing which I do not understand is that in the second case the system did not recognize language codes "pl" and "sk" and considered it "en" and "hu".


A million sentences is a lot... read the readme and switch on "chopping mode" before you try files of that size. With chopping mode on, it should handle 1m segments no problem - please report back. IIRC the largest file I've done so far was 400,000 segments.

At one point, the aligner says "x english sentences read" and "Y hungarian sentences read". Pay no attention to that, your language codes are accepted anyway; you can see for yourself in the log (it should say "Hunalign dictionary: pl-sk.dic"). The language codes are important for two reasons: the segmenter uses the appropriate abbreviation list for segmentation (in your case, it will default to english as there is no pl or sk list shipped with LF Aligner, and perhaps you'll want to align subs without segmenting them in the first place). The other reason why it's important to supply the correct language codes is that the autoalignment is done (in part) based on a dictionary. I ship pl-sk dictionary data with Hunalign, but you can replace/expand it if you want to.

If you're grabbing your data from opus, you won't need to align it yourself - they provide autoaligned files. The alignment done with LF Aligner might be marginally better than theirs in the case of general texts, because they use the same autoaligner as LF Aligner (hunalign), but I doubt that they use dictionaries... but with subtitles, it's probably best to stick with the alignments from OPUS. They have a special workflow for aligning subs that must work better than just feeding the texts to the aligner.


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 21:01
Member (2009)
Dutch to English
+ ...
@mari Feb 19, 2012

I don't understand exactly what you mean by '(...) .txt files extracted from Opensubtitles.'

Aren't there TMXs available from the OPUS site?

Or did you try and align some of the other files? Which ones are you having the problem with? The 'XCES/XML' or the 'Moses' files?

Michael


 

mari pet  Identity Verified
Spain
Local time: 22:01
Member (2012)
Spanish to Slovak
+ ...
TOPIC STARTER
Moses files Feb 20, 2012

I tried it with moses files. I thought that .tmx files could be used only with Translation Memories. And I don't know how to make corpus with them. Sorry, I am not so skilled in this.

 

FarkasAndras
Local time: 22:01
English to Hungarian
+ ...
Convert Feb 20, 2012

mari pet wrote:

I tried it with moses files. I thought that .tmx files could be used only with Translation Memories. And I don't know how to make corpus with them. Sorry, I am not so skilled in this.

You can convert tmx files to whatever format you'll need. Here's an article I wrote on how to do it. This procedure generates tab delimited txt files, which you can then process as needed. You can search them with xbench, which might be sufficient for your purposes; Xbench can give you counts of how many times a given word or expression occurs in the corpus, how many times it cooccurs with another expression in the same language or in the other language etc.


 

mari pet  Identity Verified
Spain
Local time: 22:01
Member (2012)
Spanish to Slovak
+ ...
TOPIC STARTER
Thank you! Feb 20, 2012

Andras, I will have a look at it in these days and will let you know.
A big THANK YOU for all your help and advices!
Regards


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Parallel corpus

Advanced search






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
SDL MultiTerm 2019
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2019 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2019 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search