Anybody for Corpus Linguistics?
Thread poster: xxxLia Fail
xxxLia Fail  Identity Verified
Spain
Local time: 02:47
Spanish to English
+ ...
Sep 9, 2004

I wonder if anybody is interested in exploring corpus applications to translation?

I studied - and very much enjoyed - CL in 2002/2003. I was particularly interested in possible applications to translation (as opposed to language usage in a more specific sense), and explored this in my thesis. One of my conclusions was that corpus analysis tools - as developed to date - were not particularly suitable for the kind of searches translators have to make. Yet it was fascinating the amount of field-specific knowledge I was able to obtain from an analysis of 500,000 words on macroeconomics. I really felt myself to be quite the 'expert', not only in terms of terminology, but also in terms of my command of the style and language appropriate for this field. I didn't 'read' half a million words, I 'analysed' them!

Most CL discussions, in fact, exclude translators (so what's new!). At the same time, there is a minority interest in CL applications for translators, and a number of people are exploring this possible application.

I am still interested in CL, and have investigated a number of tools available from the web. However, I have two problems: first that I have little time, and second, that I am not exactly computer-friendly, meaning that messing around with software just bores me to death. If there was anybody else interested, I would feel a bit more motivated, as we could experiment and exchange ideas and discoveries.

When I have a minute, I will expand a bit more on the possibilities of CL (for those whose interest has been piqued) and add a couple of links.

Just some statements to begin with:

- the web itself is a huge corpus, but we haven't got tools more specific than search engines for searching it quickly and efficiently for our specific purposes. Search engines facilitate ad hoc searches (which is often sufficient for translators) but not systematic searches for terminology, for example. (Note that Trados TermExtract seems to offer this possibility; unfortunately, I haven't had time to experiment with it, and it looks complex ...my two perennial problems, as mentioned above!).

- a CL tool ideally should be able to analyse specific texts and extract 'clusters' of words that - due to frequency of occurrence - could rate as 'terms'. We could thus collect a set of texts on our subject area before translating, run a corpus analysis on it, and produce clusters that would, hopefully, equip us with field-relevant language. In other words, it would facilitate/complement the reading process.

Looking forward to seeing some interest:-)



[Edited at 2004-09-09 14:58]

[Edited at 2004-09-09 14:59]


Direct link Reply with quote
 

Elena Miguel  Identity Verified
Spain
Local time: 02:47
English to Spanish
+ ...
Do you know Wordsmith Sep 9, 2004

It provides you with frequency lists very usefull to assess things such as the type/token ratio in corpora.
Although these analyses have been mainly used in tasks dealing with the authorship of texts, they can also be highly benefitial to analyse specialized languages or jargon.
Regards.


Direct link Reply with quote
 

Patricia Fierro, M. Sc.  Identity Verified
Ecuador
Local time: 19:47
Member (2004)
English to Spanish
+ ...
A site with Spanish corpus information Sep 9, 2004

Hi,

I thought you might be interested in this site:

http://www.corpusdelespanol.org/

HTH


Direct link Reply with quote
 

Henry Dotterer
Local time: 20:47
SITE FOUNDER
I took a course and had similar impressions Sep 9, 2004

Interesting topic, Ailish.

There is a disconnect between the world of translation and the world of corpus linguistics. In many cases, people on both sides are doing the same thing, but calling it something different simply because they don't talk.

Corpus techniques are increasingly being used in translation productivity software, though. You are right to mention that TRADOS Term Extract is one tool in the mix (we'll have a sale later today, by the way). Fusion is another tool that uses corpus linguistics; it can "extract" what its developers call "expressions" (and you call "clusters") from a source file and then "generate" translations by identifying corresponding expressions in the target file. It performs impressively if the documents are large enough. You can try it for free. (There are others, too: Beetext, TermSeekInc, etc.)

On your point about mining parallel corpora on the web, people are working it, and even doing it themselves in a hands-on fashion. But as far as I know, no one has yet put together an application that will let you enter two URLs and do extraction/generation or segmentation/alignment. But people are working on it.

If you were not averse to programming, I would suggest you look at the natural language toolkit (nltk), which provides a number of free corpus tools (written in "python") that could be brought together to do useful things in translation. I mention it for others reading this thread.

At any rate, I share your enthusiasm for what is possible in this field. We will surely be impacted.


Direct link Reply with quote
 
Ramesh Madhavan  Identity Verified

Local time: 06:17
English
+ ...
Why don't we start one on our web site? I am willing to contribute for free. Sep 9, 2004



[Edited at 2004-09-09 14:06]


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 02:47
Spanish to English
+ ...
TOPIC STARTER
Thanks to all ...some replies Sep 9, 2004

Wordsmith was precisely the tool I used for my thesis.

I had the luxury of being a student and had the time to explore it - as well as having technical support in the lab - so I got to know it well. But to actually apply it to translation you have to CREATE the corpus first, and generally, translators don't have the time. And that's the major problem with corpus tools.

So the thing is to work with the web in some way - trying to avoid/minimise this time-consuming aspect of creating a corpus - but applying corpus analysis principles.

In reply to Henry about parallel corpora. To me that's problematic for a translator (i.e. someone who has a job to do within a tight deadline) becuase rather than find one TL text that more or less represents your field, you have to find TWO texts, one SL and one TL. And decide if they are sufficiently similar, and often, the same kinds of texts simply aren't available available. So I am willing to conform with obtaining a TL text that will 'brush up'/'enhance'/'inspire' my field-specific language (style and terminology), i.e. jerk knowledge from passive to active memory.

And the big bugbear, Henry! You said: "If you were not averse to programming, I would suggest you look at the natural language toolkit (nltk), which provides a number of free corpus tools (written in "python") that could be brought together to do useful things in translation. I mention it for others reading this thread."

I am pretty useless with computers!

It seems we both see the tantalising potential of corpus analysis for translation purposes - but it's like being on one side of a big river wanting to get across, but not having the means:-)

Now that I see some interest, I will find a moment to just put together what I have found in recent months that I - personally - would like experiment with.






[Edited at 2004-09-09 14:54]


Direct link Reply with quote
 

GoodWords  Identity Verified
Mexico
Local time: 19:47
Spanish to English
+ ...
Human Processing Sep 9, 2004

Ailish Maher wrote:
- the web itself is a huge corpus, but we haven't got tools more specific than search engines for searching it quickly and efficiently for our specific purposes. Search engines facilitate ad hoc searches (which is often sufficient for translators) but not systematic searches for terminology, for example. (Note that Trados TermExtract seems to offer this possibility; unfortunately, I haven't had time to experiment with it, and it looks complex ...my two perennial problems, as mentioned above!).

- a CL tool ideally should be able to analyse specific texts and extract 'clusters' of words that - due to frequency of occurrence - could rate as 'terms'. We could thus collect a set of texts on our subject area before translating, run a corpus analysis on it, and produce clusters that would, hopefully, equip us with field-relevant language. In other words, it would facilitate/complement the reading process.


Something that would be hard to automate intelligently---as difficult as automating translation itself, I suggest---is "negative examples".

I'm referring to that process by which we reject a particular word, spelling, phrase or usage because of the company it keeps; we see it predominantly (or exclusively) in texts of dubious literacy, or in non-native texts.


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 02:47
Spanish to English
+ ...
TOPIC STARTER
Text selection principles Sep 9, 2004

GoodWords wrote:

Something that would be hard to automate intelligently---as difficult as automating translation itself, I suggest---is "negative examples".

I'm referring to that process by which we reject a particular word, spelling, phrase or usage because of the company it keeps; we see it predominantly (or exclusively) in texts of dubious literacy, or in non-native texts.


Yes indeed, but one of the underlying principles of CA is careful and representative text selection, i.e. the output will only be as good as the texts you analyse. And obviously, from the point of view of obtaining fundamental background info to translate a text, the translator has to use discretion in the kind of texts he/she analyses.

One problem with parallel texts is that if they are exactly parallel, one of them is a translation of the other. Which is all very well if it's an EU text.......


Direct link Reply with quote
 

Heinrich Pesch  Identity Verified
Finland
Local time: 03:47
Member (2003)
Finnish to German
+ ...
Would not work for small languages Sep 10, 2004

The amount of text in English available on the net is huge, German, French, Spanish etc. still impressive, but for example Finnish: tiny, though useful when translating.
What you mean by corpus linguistics reduces to practically finding a parallel text of the subject at hand. In small language communities its the translators that are constantly creating these parallel texts to be used in future translations.
The experts very often do not publish anymore in there native language, but in English, so the very material that the linguist is looking for does not exist but is created by translators.


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 02:47
Spanish to English
+ ...
TOPIC STARTER
comment for Heinrich Sep 11, 2004

Heinrich Pesch wrote:

The amount of text in English available on the net is huge, German, French, Spanish etc. still impressive, but for example Finnish: tiny, though useful when translating.
What you mean by corpus linguistics reduces to practically finding a parallel text of the subject at hand. In small language communities its the translators that are constantly creating these parallel texts to be used in future translations.
The experts very often do not publish anymore in there native language, but in English, so the very material that the linguist is looking for does not exist but is created by translators.


Hi Heinrich,

Please see my next posting, due shortly. It would be interesting to see if anything I discovered that focuses on EN had applications to other languages.


Direct link Reply with quote
 
RobinB  Identity Verified
Germany
Local time: 02:47
German to English
New book on corpora in translation studies Sep 17, 2004

Ailish,

The following was circulated by e-mail the other day. Maybe of interest to you and others. Maeve is one of the best translation academics working in the UK today.

Robin

==========================================================
Introducing Corpora in Translation Studies

by Maeve Olohan

London & New York, Routledge, 2004.
ISBN 0-415-26885-0 (pbk)
ISBN 0-415-26884-2 (hbk)

"Finally! A consolidated and well-crafted resource explaining the basics (and beyond) of corpus-based translation studies. The comprehensive and critical analysis of work carried out in this area is combined with fascinating case studies and presented in an easy-to-follow manner. The overall package is a must-read for anyone who wants to learn more about the use of corpora in translation." (Lynne Bowker, University of Ottawa, Canada)

"We have been waiting for a book of this kind; a very readable text and a state-of-the-art account of what many would see as the future of empirical studies of translating. It will be of interest not just to those intending to work with corpora but to those who are curious about what such studies can show us about translator behaviour." (Ian Mason, Heriot-Watt University, UK)

The use of corpora in translation studies, both as a tool for translators and as a way of analysing the process of translation, is growing. This book provides a much-needed assessment of how the analysis of corpus data can make a contribution to the study of translation.

The book begins by tracing the introduction and development of corpus methods in translation studies and defining different types of corpora for translation research. Corpus design issues are then addressed and the use of corpora in researching aspects of the translation process is discussed. Tools for data extraction and analysis are introduced and some uses of corpora by translators and in translator training are also considered.

Featuring research questions, case studies, discussion points, methodological issues and assessment of research potential and limitations, the book provides a practical guide to using corpora in translation studies.

Offering a comprehensive account of the use of corpora by today's translators and researchers, Introducing Corpora in Translation Studies is the definitive guide to a fast-developing area of study.

Maeve Olohan is Senior Lecturer at the Centre for Translation & Intercultural Studies in the School of Languages, Linguistics and Cultures, University of Manchester, UK, where she is Programme Director for the MA in Translation Studies. She is editor of Intercultural Faultlines: Research Models in Translation Studies I (2000) and of four volumes of Translation Studies Abstracts (1999-2002).

http://www.monabaker.com/tsresources/newpubs_more.php?id=2271_0_4_0_M


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 02:47
Spanish to English
+ ...
TOPIC STARTER
EN corpora & web concordancers Sep 19, 2004

The corpus scenario is very different depending on which language one is interested in. My target language is English, and my major concern is ‘writing the language of the domain’, i.e. writing technically and linguistically correct English, so any research I do is wholly biased in favour of English. Nonetheless, some of the web concordancers listed below can be used for other languages.

1. ONLINE CORPORA in ENGLISH

British National Corpus

http://thetis.bl.uk/lookup.html

To check how a word is used. Very useful, for example, to generally see what ‘company’ a word keeps (e.g. to check which preposition might accompany a particular word). However, the language is standard, so it’s unlikely you will find technical terms.

Collins Cobuild

http://titania.cobuild.collins.co.uk/

This is similar to the BNC. As of writing, I wasn’t able download it, so I don’t know if the system is down or what…

Comments: Both are of limited use to native translators, because these corpora generally exclude technical language. On the other hand, they can be used to check usage, and are certainly useful for non-natives (e.g. to see how a general language word is used in context).

2. WEB AS CORPUS

Concordance System

http://www.impact.pe.kr/files4wiki/R_dopamine.html

This apparently is no longer available! It worked really well, although it was rather slow. You can see the example with ‘dopamine’ and how it brings up clusters, and provides a very complete description of how the word is used - extremely useful for a terminological analysis. The excellent thing about it is that it searches the web, and therefore covers all language, not just standard language as in the BNC/Cobuild. It also presents the information is a highly useful way for translators. (I have tried to get more information about it but failed…so if anyone knows anything/can find out anything?)

Turbo Lingo

http://www.staff.amu.edu.pl/~sipkadan/lingo.htm

This seems promising. See concordance and frequency results for an article obtained via Medline on ‘osteoporosis’ http://www.rheumatology.org/public/factsheets/osteopor_new.asp?aud=pat. Note that the results page is cluttered with concordances for non-content words. There is a possibility of applying stopwords (like ‘a’, ‘the’, ‘also’, 'is', etc. i.e. function words) but I haven’t had the time yet to experiment with this (I would need a stopwords list anyway – if anyone has one for EN?). Scroll right down for a frequency list, listing the words that occur with most frequency (again it would be more useful with the function words excluded). I haven’t had the opportunity yet to apply this to a more complex field…..

Spaceless

http://www.spaceless.com/concord/

Here you enter a website and it does various analyses. See results for the same article on ‘osteoporosis’, with keyword ‘bone’, ordered by frequency rather than alphabetically. I just found this recently, so haven’t experimented further, but it also looks promising. Apparently there is no stopword function.

KwiCFinder

http://miniappolis.com/KWiCFinder/KWiCFinder.html

This is a tool that you download, rather than use directly off the web. There are two tools: KwiCFinder itself, which concordances keywords-in-context; and kfNgram, which generates clusters (or n-grams). I have experimented minimally with the first tool, and it works OK, but not with the second one as yet.

ConcAPP

http://www.edict.com.hk/PUB/concapp/

Downloadable tool, similar to KwiCFinder. This one seems to work very like WordSmith.

Comments:

Turbo Lingo and Spaceless certainly seem to offer promise, KwiCFinder/kfNgram and ConcAPP to a lesser extent (as it’s just easier to work off the web). I haven’t had much time to experiment as yet, and the best experiments are ‘real’ ones (i.e. based on some technical job one has in hand), so it would be great if other people also experimented so we could compare notes.

Pity that Concordance System is ‘unavailable’, as it’s certainly the most useful presentation of information from a translators’ viewpoint.

Finally, as mentioned, my investigations are biased in favour of English as TL, but it would be interesting to see how these web concordancers work in other languages.

I hope to post something in the How-To section of ProZ in the near future on the subject, probably along the lines of a general introduction to CL for translators and focusing on the web as corpus.









[Edited at 2004-09-20 13:05]


Direct link Reply with quote
 

Jeff Allen  Identity Verified
France
Local time: 02:47
Member (2011)
Multiplelanguages
+ ...
reply on corpus linguistics and translation Oct 9, 2004

I actually do come from both worlds (translation as well as computational linguistics). I was a translator and language professor before joining the arena of developing and implementing both authoring and translation software/systems in 1995. During these past 10 years I was also technical director of the European Language Resources Distribution Agency. Corpus Linguistics was one of the main areas of focus for that position.

My web site (http://www.geocities.com/jeffallenpubs) provides a thematically organized view of all of my publications (about 200 articles, conf papers, book chapters, theses, etc) and discussion list threads (problably over 100) that explain these various tools and language processing methodologies from different end-user perspectives. Start reading the more recent ones first because they are more overview focused. Most of everything I wrote to 1998 was more technically focused, so save that for supplementary in-depth reading. This web site has been the attempt to bridge the gap between these various worlds that have been separate for a long time.

Hope this web site is information resource to you.

Regards,

Jeff Allen
http://www.geocities.com/jeffallenpubs

Member of Editorial Board
MultiLingual Computing & Technology magazine
http://www.multilingual.com/editorialBoard

--------------------
Ailish Maher wrote:

I wonder if anybody is interested in exploring corpus applications to translation?

Most CL discussions, in fact, exclude translators (so what's new!). At the same time, there is a minority interest in CL applications for translators, and a number of people are exploring this possible application.

[Edited at 2004-09-09 14:59]


[Edited at 2004-10-12 12:29]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro[Call to this topic]

You can also contact site staff by submitting a support request »

Anybody for Corpus Linguistics?

Advanced search







Across v6.3
Translation Toolkit and Sales Potential under One Roof

Apart from features that enable you to translate more efficiently, the new Across Translator Edition v6.3 comprises your crossMarket membership. The new online network for Across users assists you in exploring new sales potential and generating revenue.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs