Of what use can linguistic corpora be for a translator?
Thread poster: Dominika Lupa
Dominika Lupa
Dominika Lupa  Identity Verified
Poland
Local time: 22:44
English to Polish
+ ...
Jul 2, 2008

Of what use can linguistic corpora be for a translator?

The only idea that comes to my mind is, that the translator can treat it as a great source of terminological information, as it presents the proper structure of a text, its conventions when it comes to style, semantics and the like (of course in the scope of a chosen corpus).

But in that case - isn't the WWW already a better place to look for such than the time consuming creation of corpora?

Any ideas?
... See more
Of what use can linguistic corpora be for a translator?

The only idea that comes to my mind is, that the translator can treat it as a great source of terminological information, as it presents the proper structure of a text, its conventions when it comes to style, semantics and the like (of course in the scope of a chosen corpus).

But in that case - isn't the WWW already a better place to look for such than the time consuming creation of corpora?

Any ideas??
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 22:44
English to Hungarian
+ ...
depends Jul 2, 2008

If you can get your hands on large bilingual (multilingual) parallel corpora like the two EU corpora, they are of great use as a reference translation memory.

It's also fairly easy to "make" such resources if you can get good source material. E.g. I compiled a 70000 TU English-Hungarian bicorpus/TM of Hungarian legislation - an invaluable tool for translating legal texts.

If you're translating into a B language of yours, a corpus could come in very handy as a collocatio
... See more
If you can get your hands on large bilingual (multilingual) parallel corpora like the two EU corpora, they are of great use as a reference translation memory.

It's also fairly easy to "make" such resources if you can get good source material. E.g. I compiled a 70000 TU English-Hungarian bicorpus/TM of Hungarian legislation - an invaluable tool for translating legal texts.

If you're translating into a B language of yours, a corpus could come in very handy as a collocation/word usage dictionary, esp. if it is made up of/contains a lot of text from the field in question. I have never done that though.
The advantage over the net would be 1) better quality texts and 2) specialized texts.
Collapse


 
Niraja Nanjundan (X)
Niraja Nanjundan (X)  Identity Verified
Local time: 02:14
German to English
Parallel texts Jul 2, 2008

Dominika Lupa wrote:

isn't the WWW already a better place to look for such than the time consuming creation of corpora?



Possibly, searching the web for parallel texts in a specialised area could be useful. I don't know if they count as linguistic corpora,though.

[Edited at 2008-07-02 10:35]


 
Dominika Lupa
Dominika Lupa  Identity Verified
Poland
Local time: 22:44
English to Polish
+ ...
TOPIC STARTER
Parallel texts Jul 2, 2008

Indeed,
but what about monolingual corpora?

corpora are not necessarily bilingual always, aren't they?


 
Susan Welsh
Susan Welsh  Identity Verified
United States
Local time: 16:44
Russian to English
+ ...
What are corpora, anyway? Jul 2, 2008

I've looked about for an understanding of this term, but nothing seems to make much sense. Is it jargon-speak for a glossary?

Thanks!
Susan


 
Giles Watson
Giles Watson  Identity Verified
Italy
Local time: 22:44
Italian to English
In memoriam
A corpus is... Jul 2, 2008

... a collection of texts selected for the purposes of linguistic or terminological analysis.

Last year, the EC made available its translation memory for the entire acquis communautaire in twenty-two languages. The introductory page:

http://langtech.jrc.it/DGT-TM.html

will perhaps help you to understand what people mean by corpora (collections of texts in general)
... See more
... a collection of texts selected for the purposes of linguistic or terminological analysis.

Last year, the EC made available its translation memory for the entire acquis communautaire in twenty-two languages. The introductory page:

http://langtech.jrc.it/DGT-TM.html

will perhaps help you to understand what people mean by corpora (collections of texts in general) and parallel corpora (bi-texts, translation memories), as well as what you might want to do with them.

One problem with corpora is that the bigger they get, the more "noise" (matches irrelevant to the current search) they tend to contain. On the other hand, if they are more compact and manageable, they may exclude useful information ("selected" is the key concept in the definition above). On a less exalted scale, individual translators face a similar quandary when they are deciding how to organise their own translation memories, which are themselves parallel corpora. Some translators prefer a single "Big Mamma" TM with plenty of fields so that it can be filtered as required but others like separate, leaner memories for each language pair, sector, client and so on.

To answer Dominika's question, yes the WWW is an excellent searchable corpus but it is far from sufficient and not always terribly focused. There is also the issue of source evaluation - just because it googles doesn't mean it's right!

The corpora I use most often are my own translation memories. I also use glossaries too, of course, but these tend to be normative, i.e. lists of terms I need to use for a specific translation. The good thing about self-generated TMs is that they can be relied on for quality (one hopes) but they are not necessarily prescriptive. In a nutshell, TMs are invaluable for reminding you that either a) you have to translate a segment in a particular way; or b) you have already translated a segment in ways that perhaps didn't occur to you this time.

Giles

[Edited at 2008-07-02 13:32]
Collapse


Zolboo Batbold
 
Christine Andersen
Christine Andersen  Identity Verified
Denmark
Local time: 22:44
Member (2003)
Danish to English
+ ...
I use one now and then to check usage Jul 2, 2008

I work with closely related languages, and often Danish expressions are fine in English. The British National Corpus is a good place to check that an apparently identical expression really is identical, and not 'source language interference'. Lots of collocations are different...

According tothe Concise Oxford Dictionary:
corpus
· n. (pl. corpora or corpuses)
1 a body or collection of written texts. > a collection of written or spoken material in machine-readable
... See more
I work with closely related languages, and often Danish expressions are fine in English. The British National Corpus is a good place to check that an apparently identical expression really is identical, and not 'source language interference'. Lots of collocations are different...

According tothe Concise Oxford Dictionary:
corpus
· n. (pl. corpora or corpuses)
1 a body or collection of written texts. > a collection of written or spoken material in machine-readable form.

A corpus is a collection of representative texts and in some cases transcripts of spoken language, or perhaps a specific domain in the language, which shows words and expressions in context, collocations and lots more, depending on how you use it.

It can be used to compare how often various expressions are used, e.g. 'different to/different from/different than'.
My English teachers only allowed 'different from' - but outside the classroom, 'different to' was heard quite frequently.
The results in the British corpus are:
to: 483
from: 3281
than: 51

It looks as if my teachers had a point! But you would probably get a very different result from an American corpus, and it is useful to know which to use for your particular target group.

You can play with the British National Corpus here, free of charge:
http://sara.natcorp.ox.ac.uk/lookup.html

Or there are more facilities (which I have never used) here:
http://www.natcorp.ox.ac.uk/tools/sara/

And NO, the www. is NOT just as good, at least with English. You can never assume that because any particular expression gives a million - or any number of million - hits, it must be correct!
First you have to define 'correct' anyway - UK, US, or any of the other 18-20 varieties on the Microsoft spell checker, for instance

Googling may give results that you can use, again after sorting them. Often they will be very helpful, but there is no guarantee that the Web is representative, and anyone can hang a website out there, even if the language is a disgrace. That goes for any language, not just English!

Happy translating!
Collapse


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 22:44
Spanish to English
+ ...
two pertinent issues Jul 2, 2008

Dominika Lupa raised two very pertinent questions:

1. The difference between using a corpus and the WWW.

You don't know what you are mining in the WWW, there is a lot of stuff written by non-natives and non-experts, frequency counts are misleading, information may be unreliable, genres are confused ...

You choose - on the basis of reliable criteria - what goes into a corpus. Hence, if t
... See more
Dominika Lupa raised two very pertinent questions:

1. The difference between using a corpus and the WWW.

You don't know what you are mining in the WWW, there is a lot of stuff written by non-natives and non-experts, frequency counts are misleading, information may be unreliable, genres are confused ...

You choose - on the basis of reliable criteria - what goes into a corpus. Hence, if translating a peer-reviewed research article on breast cancer, your corpus would be similar texts of a similar genre (very important) written by a native (if possible).


2. The difference between monolingual and all the other corpora (bilingual, comparable etc).

Bilingual corpora typically imply translation at some stage, and that makes them equivalent, pretty much, to a translation memory. Comparable corpora are two (or more) parallel corpora created independently applying the same criteria and used to compare languages.

I don't see that, ultimately, bilingual corpora are particularly useful, except to see how someone previously translated something (so it's just like a translation memory).

I 'm a firm supporter of the use of a monolingual corpus as a model for guiding translations and also editing and revision, and also as a learning process for translators/editors/revisers.

If you want to read more, some colleagues and I are on the point of having an article published in JOSTRANS (the pending July issue) on the subject of using corpora to develop a specialism. http://www.jostrans.org/

We run a workshop (next in Croatia): http://www.metmeetings.org/content/abstracts/workshop_corpus.htm




[Edited at 2008-07-02 14:16]
Collapse


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 22:44
Spanish to English
+ ...
Not to be confused: TMs and corpora Jul 2, 2008

Giles Watson wrote:

... a collection of texts selected for the purposes of linguistic or terminological analysis.

Last year, the EC made available its translation memory for the entire acquis communautaire in twenty-two languages. The introductory page:

http://langtech.jrc.it/DGT-TM.html

will perhaps help you to understand what people mean by corpora (collections of texts in general) and parallel corpora (bi-texts, translation memories), as well as what you might want to do with them.

One problem with corpora is that the bigger they get, the more "noise" (matches irrelevant to the current search) they tend to contain. On the other hand, if they are more compact and manageable, they may exclude useful information ("selected" is the key concept in the definition above). On a less exalted scale, individual translators face a similar quandary when they are deciding how to organise their own translation memories, which are themselves parallel corpora. Some translators prefer a single "Big Mamma" TM with plenty of fields so that it can be filtered as required but others like separate, leaner memories for each language pair, sector, client and so on.

To answer Dominika's question, yes the WWW is an excellent searchable corpus but it is far from sufficient and not always terribly focused. There is also the issue of source evaluation - just because it googles doesn't mean it's right!

The corpora I use most often are my own translation memories. I also use glossaries too, of course, but these tend to be normative, i.e. lists of terms I need to use for a specific translation. The good thing about self-generated TMs is that they can be relied on for quality (one hopes) but they are not necessarily prescriptive. In a nutshell, TMs are invaluable for reminding you that either a) you have to translate a segment in a particular way; or b) you have already translated a segment in ways that perhaps didn't occur to you this time.

Giles

[Edited at 2008-07-02 13:32]


Giles, with all due respect, you are shifting the topic focus from corpora to translation memories:-)

Your definition is very broad, in fact, a corpus is ANY body of text (even a sheaf of photocopies), and the problem with corpus linguistics is that it is still perceived as an academic exercise aimed at analysing language. Not mimicking it, which is what a monolongual corpus for translation purposes is aimed at.

I also think that calling a TM a corpus is confusing issues. Yes, indeed, the TM is a body of text, but it is not created for analytical purposes. Yes, we may be guided by it, but although the source may be a good model, the target is not necessarily.

See my post, and the links. My colleagues and I are working hard to shift the use of corpora out of the academic (analytical) sphere and develop a methodology for translators.

We have created our OWN definition of what we call a TRANSLATION corpus (aimed at guiding translation): "a set of good models for your target text". We also know - from experience - that big isn't better, the corpora we work with are typically less than 1m words and that is perfectly adequate PROVIDED what goes in is carefully chosen. I might create a corpus from 10 PDFs cleaned and converted to TXT equal to 50000 words that might be perfectly adequate to translate 5000 words, provided those PDFs are selected with care and are good models for my task. It's not just about key words, it's about choosing reliable sources, identical genres, native speakers (where possible), etc.

There's no point in creating a corpus unless you apply criteria, as otherwise save yourself the trouble and use the WWW. WWW = big and crude, corpus = small and targeted.


Zolboo Batbold
 
Alan R King
Alan R King
Local time: 22:44
Basque to English
+ ...
In memoriam
Very interesting discussion Jul 2, 2008

This is turning into an interesting and informative discussion indeed. For now I shall just carry on "listening" - please do continue!

Alan


 
Giles Watson
Giles Watson  Identity Verified
Italy
Local time: 22:44
Italian to English
In memoriam
Corpora and TMs Jul 2, 2008

Lia Fail wrote:

We have created our OWN definition of what we call a TRANSLATION corpus



Ah, there's the rub.

Everybody has his or her or its (the JRC) own definition of what a corpus is. Since none was offered, I took the term in its most general, literal sense: a "body" of texts in one or more languages. It is certainly useful to distinguish bi-texts/translation memories from monolingual compilations but they are both subsets of corpora, not different animals. I also take your point about target texts sometimes being iffy models even if that is really a translation quality assurance and TM maintenance issue.

The corpus approach to teaching described in your article is interesting, though. Back in pre-WWW days when Cobuild, the BNC and other corpora were starting out, the content could be skewed or limited despite their compilers' most valiant efforts. With more material and computer power available, it makes sense to knock up a bespoke corpus and tailor it to your needs. Wordsmith has more to chew on nowadays

And please don't sell Google short. It's often a handy way of testing collocations, in particular, provided of course that you apply the right filters to your search and take the same precautions you would with any other documentary source.

Giles

[Edited at 2008-07-02 17:10]


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 22:44
Spanish to English
+ ...
translators / TMs vs field experts / corpora Jul 2, 2008

Giles Watson wrote:




I also take your point about target texts sometimes being iffy models even if that is really a translation quality assurance and TM maintenance issue.

(...)

And please don't sell Google short. It's often a handy way of testing collocations, in particular, provided of course that you apply the right filters to your search and take the same precautions you would with any other documentary source.

Giles

[Edited at 2008-07-02 17:10] [/quote]



On the issue of TMs, and taking the perspective I and my colleagues have adopted with regard to how to BEST use a corpus to guide translation (or editing or revision), the fact is that most TMs are created by translators, and with the best will in the world, they are translators not field experts. Translators don't have the detailed inside information about a discourse community and its genres that the "original" writer is likely to have ... unless they are active field experts AND translators at the same time, which is unusual.

Referring to our approach to corpus, it's because we want to mimic the original writer belonging to a discourse community to which we don't have access that we mine their texts. But before we mine their texts we make absolutely sure we learn a little about their discourse community and how they communicate among themselves (in medicine, for example, via editorials, letters to the editor, case reports, etc, that is, all slightly different genres). We might (if we feel inexpert) even enlist their help in choosing texts to enter the corpus (for example, I successfully edit rock mechanics texts on the basis of a corpus created with the help of my author).

Incidentally, the fact that we use a corpus has actually ensured the creation of a very high quality TM for one particular field I work in with a team of translators. In other words, a carefully mined corpus will lead to quality inputs to a TM that can be leveraged in the future.

That's the theory behind a monolingual corpus used to guide specialist translation. It's an investment in time and effort, and not always worth it, unless one is determined to develop a specialism. There are many shortcut approaches though. The shortest cut of all is Google (I ain't selling it short!), provided one is discerning, it can provide the answer one needs in seconds.


I should add that the notion of corpus is indeed a broad one (a "body" of text). The notion of "corpus linguistics" is an academic one, and the problem we translators and other possible users of corpora outside academia have is that we use corpora but not for the kind of deep linguistic analysis performed by academics. We do analyse, but in a different way (focused on translation), which is why we refer to "corpus-guided translation". In other words, as Giles has implied, we're a bit short on specific terminology to describe corpora other than the BNC and Cobuild and Brown's etc and to describe uses of corpora other than for descriptive and analytical purposes.

[Edited at 2008-07-02 18:45]

See Michael Wilkinson's article in JOSTRANS: http://www.jostrans.org/issue07/art_wilkinson.php
He's an academic but he's using corpora to teach translators (not to analyse text).



[Edited at 2008-07-02 18:52]


 
Giles Watson
Giles Watson  Identity Verified
Italy
Local time: 22:44
Italian to English
In memoriam
Teaching and translating Jul 2, 2008

Lia Fail wrote:

See Michael Wilkinson's article in JOSTRANS: http://www.jostrans.org/issue07/art_wilkinson.php
He's an academic but he's using corpora to teach translators (not to analyse text).



[Edited at 2008-07-02 18:52]


Hi again Lia,

No sweat!

I am utterly convinced that your approach to teaching corpora is a very useful technique for those translation trainers who can handle the software. If I have any doubts, they concern the availability of such paragons ("proven translation ability + specific software expertise + teaching skills" is quite a tall, or to put it another way, expensive, order!).

The other point is that in my main sector (food and wine), it is not at all rare to find good translators who are also original writers on the topic and who can provide valid quality assurance. Bi-texts are very useful indeed. If you insist on original writing only, you will have unrelated texts in the respective languages and this will tend to over-emphasise lexical or phrase-level correspondences. OK, this is a step forward for most translators, particularly if they are learners, but if your bi-texts are good enough, the corpus will also reflect thematic, discourse-level solutions.

One of the big drawbacks of CAT tools in general is their tendency to focus the translator's attention on segments that are at best one sentence long. In an F&W - or any other article-length, journalistic - translation, basic cohesion comes from the introduction of themes that extend over the entire text.

You knew this anyway, though. We're not really arguing: we just have different goals.

Giles


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 22:44
Spanish to English
+ ...
different fields Jul 2, 2008

Giles Watson wrote:

.... in my main sector (food and wine), it is not at all rare to find good translators who are also original writers on the topic and who can provide valid quality assurance.

....

If you insist on original writing only, you will have unrelated texts in the respective languages and this will tend to over-emphasise lexical or phrase-level correspondences.




Of course, when I think corpus I'm thinking largely of applications in my own areas, and I am not at all surprised to hear that there are expert F&W types who are translators ... for fairly obvious reasons:-) In medicine or engineering, however, that might be unlikely.

In medicine also, although a translation may indeed be guided by a relatively unrelated text, this is only relatively, as case reports for example, tend to build on, complement and supplement earlier case reports, so the translation of a case report on a specific disorder can be very well guided by a corpus of previously reported cases.

Indeed, we don't disagree, just that we work in very different fields. Although I still prefer not to refer to a TM as a corpus in the corpus linguistic sense:-)

We use quite a large corpus and a top quality translation memory to guide our team's translations in a particular area of medicine. I use the TM for terminology, but when I want to explore usage, phrases, co-text or context, I go to the corpus. When it's a new sub-field I create a sub-corpus to guide what I will eventually input into the TM.


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 22:44
Spanish to English
+ ...
Corpus article now published and available online Jul 13, 2008

I recently mentioned our upcoming publication:

http://www.jostrans.org/issue10/art_maher.php


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Of what use can linguistic corpora be for a translator?







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »