https://www.proz.com/forum/translator_resources/328792-corpora_extraction_and_or_analysis_tools_for_professional_translators.html

Corpora extraction and/or analysis tools for professional translators
Thread poster: Emanuele Vacca

Emanuele Vacca
Italy
Local time: 07:10
English to Italian
+ ...
Sep 17, 2018

Dear colleagues,
During my studies in translation I learned about the importance of corpora analysis methods; lately, I have been doing some research in order to discover the main available tools, but after testing some of them I concluded that they are probably designed for linguistics researchers more than for translators, being very hard to use and only able to read perfectly aligned text-only files, whose creation would probably require hundreds of hours of work for every single
... See more
Dear colleagues,
During my studies in translation I learned about the importance of corpora analysis methods; lately, I have been doing some research in order to discover the main available tools, but after testing some of them I concluded that they are probably designed for linguistics researchers more than for translators, being very hard to use and only able to read perfectly aligned text-only files, whose creation would probably require hundreds of hours of work for every single text, depending on its length. I have also been looking for automatic tools for the extraction of corpora from the web; but again, I concluded that these are experimental and complex tools which are useless for practical translation purposes.
I am aware that the internet can be used as a giant corpus even without using any specific tool (e.g. EUR-Lex is indeed a great multilingual corpus and there are many multilingual websites), but I would like to ask if a tool of this kind exists, i.e. a corpus extraction or analysis tool practically usable by a freelance translator.
Thank you in advance.
Collapse


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 06:10
Member (2009)
Dutch to English
+ ...
I recommend tlCorpus: Sep 17, 2018

from their website:

• Easy-to-use
• Fully internationalized (full Unicode support; supports all languages)
• Immediate search word "collocation summary" feature
• Takes advantage of modern multi-core CPUs for speed gains!
• Auto-detection functionality of languages and encodings helps simplify setup
• Supports text files
• Supports HTML (Web) files
• Supports Microsoft Word .doc, .docx and .rtf files (currently Windows
... See more
from their website:

• Easy-to-use
• Fully internationalized (full Unicode support; supports all languages)
• Immediate search word "collocation summary" feature
• Takes advantage of modern multi-core CPUs for speed gains!
• Auto-detection functionality of languages and encodings helps simplify setup
• Supports text files
• Supports HTML (Web) files
• Supports Microsoft Word .doc, .docx and .rtf files (currently Windows-only)
• Supports PDF files (Thanks to Xpdf. For Mac, must install this pdftotext first.)
• Supports MOBI, EPUB, and CHM files (Requires the free calibre software installed.)
• Integrated 'Web download manager' grabs a file (or files) given a URL or URL pattern. Also supports clipboard URL monitoring.

src: https://tshwanedje.com/corpus/

tlCorpus Arabic Support and Google Image Search Quick-launch Functionality - Rice and Water

tlCorpus_Windows_10

Michael
Collapse


Emanuele Vacca
 

Clarkalo
United States
Local time: 01:10
Member (2015)
English to Spanish
+ ...
Corpus tools Sep 19, 2018

Hi Emanuele,
Check out BootCaT for creating your own ad hoc corpora from the web, and AntConc or ParaConc for analysis of the corpora (i.e. creating word frequency lists, viewing key words in context (KWIC)), etc. Both are free and relatively easy to learn how to use. There is also another corpus analysis tool I'm aware of called LancsBox, created by Lancaster University in the UK, which I have not tried using yet, but it looks promising.


[Edited at 2018-09-19 21:36 GMT]


 

Daniel Frisano
Czech Republic
Local time: 07:10
Member (2008)
English to Italian
+ ...
Sketch Engine? Sep 20, 2018

Have you tried Sketch Engine? Is it among those you would rank as good for researchers rather than translators?

I am looking for a good corpora analysis myself for learning purposes.


Jean-Yves Préault
 

Emanuele Vacca
Italy
Local time: 07:10
English to Italian
+ ...
TOPIC STARTER
My opinions Sep 21, 2018

Dear colleagues,
First of all, thank you so much for your kind suggestions; I will try to reply to all of you with my opinion on the suggested software.
tlCorpus: indeed, this software is pretty close to what I am looking for! The UI is simple and the software's features are generally user-friendly. Unfortunately, I mostly work with .pdf files, and the pdf-to-text converter that should allow tlCorpus to read this file format is no longer available (it should be here: ...
See more
Dear colleagues,
First of all, thank you so much for your kind suggestions; I will try to reply to all of you with my opinion on the suggested software.
tlCorpus: indeed, this software is pretty close to what I am looking for! The UI is simple and the software's features are generally user-friendly. Unfortunately, I mostly work with .pdf files, and the pdf-to-text converter that should allow tlCorpus to read this file format is no longer available (it should be here: https://www.bluem.net/en/mac/packages/). I also tried to build a corpus with .doc and .docx files but, for some reason, the software was not able to perform any search inside them. I really do not know why. I did not try with .txt files; but, as I said before, I cannot spend hundreds of hours converting my hundreds-pages .pdf, .doc and .docx files into plain-text files and especially aligning them in order to transform them into parallel corpora. What I understood from my researches is that totally automatic alignment methods simply do not exist (of course, I am not 100% sure about this point). But let me ask you, Michael: do you use any corpora analysis and/or extraction tools in your daily work as a freelance professional?
BootCaT: while being very good for monolingual text extraction from the web, BootCat does not seem to be able to extract parallel texts. This paper https://www.researchgate.net/publication/276146186_Comparable_Corpora_BootCaT describes a method which allows to create "comparable corpora", i.e. two different corpora, in two languages, made of different texts but with a common topic (which can actually be very specific, depending on the "seeds" and "tuples" you used to create the corpus). This can be really interesting, but I honestly do not think that a professional translator can spend his or her time manually "comparing" two corpora. This is exactly what we do every day on the web, but in a quicker manner. But, of course, I can be wrong. Clarkalo, do you use BootCat, as a freelance translator?
SketchEngine: SketchEngine is more or less what I am looking for; indeed, I had already tried it before opening this thread. The problem with Sketch Engine is that it is pretty expensive, at least according to my needs: after ending my trial period, I tried to register for a paying account; the monthly fee depends on the number of words, and if I remember rightly, for 30€/month I could have like 30M words. But during the trial period I had already reached 10M words, and I had only uploaded a very small part of my personal corpus. However, I think I will give it another try, sooner or later.
Oh, I just noticed that I forgot to try LancsBox! I will let you know my impression as soon as I try it.
Thank you again!

[Edited at 2018-09-21 17:34 GMT]

[Edited at 2018-09-21 17:34 GMT]

[Edited at 2018-09-21 17:38 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Corpora extraction and/or analysis tools for professional translators

Advanced search


Translation news





TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL MultiTerm 2019
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2019 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2019 you can automatically create term lists from your existing documentation to save time.

More info »