Creating a translation memory from PDF documents
Thread poster: Elisa Fernández Vic

Elisa Fernández Vic  Identity Verified
Spain
Local time: 17:48
Member (2015)
English to Spanish
+ ...
Jul 1, 2015

Hello all!
So, I have the following ingredients:
- A number of PDFs in English and Spanish.
- A Mac computer.
- Omega T 3.1.8 (updating to 3.1.9 right now).
- No idea what I'm doing.
I want to create a translation memory for this project based on the PDF documents. How can I do this? Any help will be much appreciated.
Thanks in advance!


Direct link Reply with quote
 

Susan Welsh  Identity Verified
United States
Local time: 11:48
Member (2008)
Russian to English
+ ...
create TM Jul 1, 2015

First you have to convert the PDFs into .DOCX or .ODT format. I do this with ABBYY Finereader, which is software you have to buy. There are others that do the same thing, but that's what I use. (Maybe someone else will suggest something cheaper.)

Then you have to align the two files. LF Aligner is a good tool, and free: https://sourceforge.net/projects/aligner/
There are many others. That will give you your TM.


Direct link Reply with quote
 

esperantisto  Identity Verified
Local time: 19:48
Member (2006)
English to Russian
+ ...
ABBYY PDF Transformer Jul 1, 2015

Susan Welsh wrote:

I do this with ABBYY Finereader, which is software you have to buy. There are others that do the same thing, but that's what I use. (Maybe someone else will suggest something cheaper.)


If all you need is extracting texts from PDF files, ABBYY PDF Transformer may be a solution. It’s actually a trimmed-down version of Finereader, thus, its price is lower.


Direct link Reply with quote
 

Meta Arkadia
Local time: 23:48
English to Indonesian
+ ...
Cheaper like free Jul 1, 2015

Susan Welsh wrote:
...Maybe someone else will suggest something cheaper.

Casualtextractor should do the tick, especially if you extract to plain text, which is good enough for creating a TMX file. And it's free. It doesn't work for scanned ("dead") PDFs, though.



And yes, nothing can beat LF_Aligner, but I'm afraid it doesn't have a graphic interface in the Mac version, so you'll need the Terminal. The instructions Andras provides are very clear, though.

And then there's YouAlign, a free web service that also processes PDFs. That means you'd only have to upload the PDFs. Very good, perhaps not wise to use if you signed any NDAs.

Cheers,

Hans


[Edited at 2015-07-01 11:26 GMT]

[Edited at 2015-07-01 11:41 GMT]


Direct link Reply with quote
 

Dan Lucas  Identity Verified
United Kingdom
Local time: 16:48
Member (2014)
Japanese to English
Depends on the PDFs Jul 1, 2015

Elisa Fernández Vic wrote:
- A number of PDFs in English and Spanish.

If the PDFs are image only PDFs you will have to OCR them as described by others. OCR is not much fun, whatever software you use. Check the ouput files very carefully for errors.

However, machine-readable PDFs can usually be saved as plain text files. How do you know if it's a machine-readable file? If you can select text with the mouse, it's machine-readable. Sometimes the file is protected from copying or exporting, in which case you're out of luck.

If it's machine readable and not protected, using the entirely free Sumatra PDF you can simply choose "Save As..." from the File menu to save text only. The screenshot below shows me doing just that with a publicly available Japanese document. If the formatting is not too complex saving to text might be both quicker and less effort than OCR.

Regards
Dan



Direct link Reply with quote
 

Didier Briel  Identity Verified
France
Local time: 17:48
Member (2007)
English to French
+ ...
LF Aligner Jul 1, 2015

Elisa Fernández Vic wrote:
So, I have the following ingredients:
- A number of PDFs in English and Spanish.
- A Mac computer.
- Omega T 3.1.8 (updating to 3.1.9 right now).
- No idea what I'm doing.
I want to create a translation memory for this project based on the PDF documents. How can I do this?

What you need is an aligner. You can use LF Aligner:
https://sourceforge.net/projects/aligner/

If your PDFs contain text (not images), you will be able to align directly from the PDF files.

Didier


Direct link Reply with quote
 

Milan Condak  Identity Verified
Local time: 17:48
English to Czech
Editable PDF Jul 1, 2015

Susan Welsh wrote:

First you have to convert the PDFs into .DOCX or .ODT format. (Maybe someone else will suggest something cheaper.)


LF Aligner can extract a text from editable PDF and create TMX (in Czech):

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/02.html


Then you have to align the two files. LF Aligner is a good tool, and free: https://sourceforge.net/projects/aligner/


You can have files in two or in more languages.

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/00.html

There was only import into XLS file

http://www.condak.net/tools/align-sentence/lf-align3-5/cs/04.html

now it is possible to use in-build align editor.

Milan


Direct link Reply with quote
 

Elisa Fernández Vic  Identity Verified
Spain
Local time: 17:48
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
LF aligner issues Jul 1, 2015

Hello all,
Thank you very much for your valuable information
I have managed to convert the files into .txt with UTF-8 and download LF_aligner. But when I try to align the two files, there is an error that I don't know how to solve. I will copy it as it shows, only changing the client's and file's name for privacy reasons:

ERROR: Input file not found (No such file or directory) at line 52066
(file: /Users/elisafernandezvic/Desktop/TRADUCCIÓN/CLIENTES/CLIENT/MATERIAL\ DE\ REFERENCIA\ INGLÉS/\(583153876\)\ 3020\ File\ name\ EN.txt)
Try again!

What can I do to solve it? Thank you very much in advance.


Direct link Reply with quote
 

Milan Condak  Identity Verified
Local time: 17:48
English to Czech
Short path and short file name in ASCII Jul 1, 2015

Elisa Fernández Vic wrote:

ERROR: Input file not found (No such file or directory) at line 52066
(file: /Users/elisafernandezvic/Desktop/TRADUCCIÓN/CLIENTES/CLIENT/MATERIAL\ DE\ REFERENCIA\ INGLÉS/\(583153876\)\ 3020\ File\ name\ EN.txt)
Try again!

What can I do to solve it? Thank you very much in advance.


Elisa,

Try C:\name\EN.txt + second.txt

Possible issues: TRADUCCIÓN/, INGLÉS/\(583153876\)\

Milan


Direct link Reply with quote
 

Elisa Fernández Vic  Identity Verified
Spain
Local time: 17:48
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
Success!! And now... how to merge tmx together? Jul 1, 2015

Thank you! I have managed to create my first translation memory and it seems to work properly! Do I get a cookie?
Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT, or do I have to merge them somehow?
Sorry if this is a stupid question - as I said, it's my first time trying to create my own TM from files.


Direct link Reply with quote
 

Milan Condak  Identity Verified
Local time: 17:48
English to Czech
Auto sub-folder Jul 1, 2015

Elisa Fernández Vic wrote:


Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT,


Elisa,

put all your relevant TMs into folder tm\auto\

then look at "files in project", you will see if the TMXs are relevant or not. There is no need to merge TMXs.

Milan


Direct link Reply with quote
 

Elisa Fernández Vic  Identity Verified
Spain
Local time: 17:48
Member (2015)
English to Spanish
+ ...
TOPIC STARTER
Thank you very much! Jul 2, 2015

Milan Condak wrote:

Elisa Fernández Vic wrote:


Next on the list: as I said, I have a bunch of texts to align. With this method, I will end with a bunch of aligned TMX files. Do I just move them all to the TM folder in OmegaT,


Elisa,

put all your relevant TMs into folder tm\auto\

then look at "files in project", you will see if the TMXs are relevant or not. There is no need to merge TMXs.

Milan


So it was actually this easy Thank you very much for your help!


Direct link Reply with quote
 

Vaclav H
Czech Republic
French to Czech
Thx for the topic Jul 3, 2015

and for the answers!
I'am working on my TM and most of the documents are in pdf.
This made it so much easier and faster, Thank you all


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Creating a translation memory from PDF documents

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search