Tool for extracting (repetitive) terminology from PDF for glossary creation
Verena Schmidt Spain Local time: 18:45 Member (2006) Spanish to German + ...
Feb 8, 2011
Dear colleagues,
For a localization project I have to create a glossary for a tourism website which contains all the relevant and repetitive terms and slogans (menu items etc.). I'm just downloading the whole web site into PDF and was wondering if there is any tool, which automatically extracts all the repetitions, tabs and menus from a website/PDF/Word document.
Any ideas?
Regards,
Verena
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
FarkasAndras Hungary Local time: 19:45 English to Hungarian + ...
Wrong format
Feb 8, 2011
Verena Schmidt wrote:
I'm just downloading the whole web site into PDF
I'd start over and save as HTML files. I'm not sure how you're downloading and saving as PDF, but HTML is the native format, at least the native format your browser or downloader can access. It's also a hell of a lot better for any subsequent processing you'll do. I would use wget, but httrack is probably easier for you to use.
Of course it would be even better to just get the original data from the wesite's owner instead of downloading the site yourself.
Once you have your HTML files, you can use tools such as LF aligner to align them all in one fell swoop.
Extracting terminology automatically won't be easy. I wouldn't bother trying.
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Adam Łobatiuk Poland Local time: 19:45 Member (2009) English to Polish + ...
Why PDF?
Feb 8, 2011
Hi Verena,
I'm not sure why you want to use the most troublesome format for localization work. If the client hasn't provided you with source files for the website, you could use a tool like HTTrack Website Copier to download the site.
Still, if you have good reasons to use PDF and can transfer the content to Word or text files, free term extractors were the topic of a recent news story on Proz: http://www.proz.com/translation-news/?p=19987#1677922
Also, if you use Trados, you can analyse the file, and then use the "Export frequent segments" feature, which does what it says. It won't be exactly terminology, but slogans and menu items could be included. Other CATs may have a similar feature.
Good luck
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Verena Schmidt Spain Local time: 18:45 Member (2006) Spanish to German + ...
TOPIC STARTER
Thanks, Adam
Feb 8, 2011
The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising.
Do you mean analyse with Workbench?
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Verena Schmidt Spain Local time: 18:45 Member (2006) Spanish to German + ...
TOPIC STARTER
Hi Farkas,
Feb 8, 2011
I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work).
My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology
For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
FarkasAndras Hungary Local time: 19:45 English to Hungarian + ...
Fine
Feb 8, 2011
Verena Schmidt wrote:
I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work).
My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology
For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me
I see. This solution might be fine if you want to read the site yourself, but I definitely wouldn't use it for any automated processing. It introduces two unnecessary lossy file conversions (HTML->PDF->Word), which is just asking for trouble.
However, if the site is monolingual and you only need to write up a list of relevant terminology, you might as well just do it by hand from the pdf. Automated solutions are pretty much useless anyway, except if you want to compile, say, a list of all words that occur at least 5 times or something crude like that.
If the site is available in two languages, the Httrack->aligner route is clearly the best solution.
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Adam Łobatiuk Poland Local time: 19:45 Member (2009) English to Polish + ...
Yes, Workbench
Feb 8, 2011
Verena Schmidt wrote:
The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising.
Do you mean analyse with Workbench?
That's correct. I don't have Studio installed right now, but it may have a similar feature.
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Michael Beijer United Kingdom Local time: 18:45 Member (2009) Dutch to English + ...
"Extracting terminology from Translation Memory with Similis, step by step.pdf"
Feb 14, 2011
You might want to take a look at:
"Extracting terminology from Translation Memory with Similis, step by step"
SDL provides market-leading translation software to over 185,000 users
SDL offers leading translation management solutions to meet LSPs needs throughout the whole translation supply chain.
With over 185,000 licenses being used by translators and organizations worldwide, our products will help you to connect to a supply chain that guarantees compatibility, making it easier to work with your customers and other users.
Start and finish your translations faster than ever with Fluency Translation Suite 2011. TMs, Terminology, and Online Resources are all fully integrated and only a click away. Download a free trial today!