working with .pdf and .jpg files
Thread poster: Ilze Klotina

Ilze Klotina
Latvia
Local time: 15:55
Japanese to Latvian
+ ...
Sep 16, 2010

I would be grateful for some help concerning the following:

I am working as a freelancer, so sometimes I don't really "see" those documents which I have to translate. For example, lots of different documents are sent to me as .jpg or .pdf files, i.e., scanned. My brother is a programmer, so I asked him if I can convert those files somehow. He said that it is possible, but the result won't be as good as I would like it to be. The problem is that it takes a lot of time to translate those scanned files - I have a source text which is not editable and I have to create a translation as a text document. If it is only a certificate on one page then it's fine... but I've had also some quite big files (scanned), and it really tires me out...
I am working with Open Office and in the latest version there is an option to edit .jpg and .pdf files with Open Office Draw. However, it works very slowly - if the file is big, I have to wait for 20 minutes until the program converts it. And it is impossible to use OmegaT with those files, so it takes a lot of time anyway...

Does any of you know how to convert scanned files to make them editable?


Direct link Reply with quote
 

Susan Welsh  Identity Verified
United States
Local time: 08:55
Member (2008)
Russian to English
+ ...
converting PDFs Sep 16, 2010

There is discussion of this in various other forums in the archives, as it is not an issue for OmegaT as such. No CAT tool can work directly with PDFs, to my knowledge.

You can convert PDFs to plain text using Adobe Reader, but the results can be quite poor, especially for something with complex formatting, like a certificate.

I use ABBYY PDF Transformer, which converts PDFs to .rtf or .txt (it says it converts to .doc, but it's not a "real" Microsoft .doc file, it's really .rtf, as I understand it). The results are quite variable. You have the option of converting the file through an OCR procedure, which works the best. (The better, more expensive tool for OCR is ABBYY Finereader.) For a rather simply formatted document, this works okay. For a certificate, or anything with lots of graphics and tables and text boxes, I have not found it satisfactory.

I now take the advice of various people from the OmegaT group, and convert the PDF to plain text, then reformat it. It's less grief all around

Search in the forum archives for "converting PDFs," and you'll find plenty more advice.

Good luck. PDFs are a pain.


Direct link Reply with quote
 

Graeme Waller  Identity Verified
Finland
Local time: 15:55
Finnish to English
+ ...
One workaround Sep 16, 2010

ilzeilze wrote:

I would be grateful for some help concerning the following:

I am working as a freelancer, so sometimes I don't really "see" those documents which I have to translate. For example, lots of different documents are sent to me as .jpg or .pdf files, i.e., scanned. My brother is a programmer, so I asked him if I can convert those files somehow. He said that it is possible, but the result won't be as good as I would like it to be. The problem is that it takes a lot of time to translate those scanned files - I have a source text which is not editable and I have to create a translation as a text document. If it is only a certificate on one page then it's fine... but I've had also some quite big files (scanned), and it really tires me out...
I am working with Open Office and in the latest version there is an option to edit .jpg and .pdf files with Open Office Draw. However, it works very slowly - if the file is big, I have to wait for 20 minutes until the program converts it. And it is impossible to use OmegaT with those files, so it takes a lot of time anyway...

Does any of you know how to convert scanned files to make them editable?


Clients often send me scanned pdf files as source files. First I ask them if they can send me a doc, rtf or txt file. If they cannot, I use OCR (Optical Character Recognition) software to get a doc or txt source file, tidy it up to match the original as close a possible and run a spell check to fix spelling mistakes. Where there are a lot of complex graphics, I set the OCR software to produce text only, unformatted. I then closely proofread / check my source files against the original file.

All this is very time consuming but in some case works quite well. I explain to the client about the necessary preprocessing and charge them at a higher word rate. If they felt unable to pay the higher rate, I would just have to turn down the work

Sometimes my OCR software will not accept the pdf file (I have not yet had to process jpegs). I have also found using OpenOffice Draw terribly slow and frustrating. On at least one order I ended up typing in the source file.

If anyone has a workaround. I would be very grateful to hear it too.

By the way there might be a OCR program with your printer software. I have two OCR programs both of which came with hardware.

[Edited at 2010-09-16 21:31 GMT]

[Edited at 2010-09-17 12:18 GMT]


Direct link Reply with quote
 

Isaac Verdú  Identity Verified
Venezuela
Local time: 09:55
English to Spanish
+ ...
Note exactly cheap, but it can be done. Sep 16, 2010

SDL Trados Studio 2009 works directly with PDF's. It's not quite as comfortable as working with editable formats: takes a bit longer to convert the document for translation, there are a lot of tags in the editor window, and the output generated is a .doc file that you will have to clean up. Still, it's a fairly painless process.

However, I haven't done this with scanned files, so I'm not sure whether Trados itself performs OCR. What I do know, is that Acrobat has a built-in OCR option, which creates a new PDF "text-based" document, while keeping everything else as it is. I suppose you could do this, then process it with Trados, and be translating a lot sooner.


Direct link Reply with quote
 

John Fossey  Identity Verified
Canada
Local time: 08:55
Member (2008)
French to English
ABBYY Trasformer Sep 17, 2010

I have been using ABBYY Transformer for a couple of years, with very good results. However, I have found that its best not to let it use its "automatic" setup feature but spend a few minutes to go through the documents manually identifying where the graphics and text are. With some practise I find it reasonably quick - maybe 1 minute or less per page.

Only very occasionally - with a document that is very heavy on images and graphics - can it not produce a document that's quite close to the original. In that case I will only select the text and produce a text only document for the client.

For .jpg and other image files, I have the free version of CutePDF installed as a printer, so will open the image in whichever program suits and print it to a .pdf file, then transform the .pdf with ABBYY Transformer.


Direct link Reply with quote
 

Susan Welsh  Identity Verified
United States
Local time: 08:55
Member (2008)
Russian to English
+ ...
ABBYY Transformer @John Sep 17, 2010

John Fossey wrote:

I have found that its best not to let it use its "automatic" setup feature but spend a few minutes to go through the documents manually identifying where the graphics and text are.

John, I always use the "manual" setup, but the converted document comes out with the text and graphics already marked off in boxes. I have never figured out what I was supposed to do with them, since they are already there, and--almost always--correct. Yet the document, when opened in Word, is almost invariably messed up in one way or another. What am I missing?


Direct link Reply with quote
 

Olieslagers
French Polynesia
Local time: 02:55
Member (2009)
Dutch to French
+ ...
ABBYY PDF transformer Sep 17, 2010

I find ABBYY transformer a great tool to convert PDFs into RTF even for converting graphics, tables and text boxes. But for a good result it is asolutely necessary to use the manual setup and also, to check the typo before starting to translate. I noticed that the OCR tools have difficulties to detect the difference between the end of a line and the end of a sentence...

Direct link Reply with quote
 

Ilze Klotina
Latvia
Local time: 15:55
Japanese to Latvian
+ ...
TOPIC STARTER
got ABBYY FineReader Sep 20, 2010

Thank you for suggestions.
I checked also other forum topics and finally bought ABBYY FineReader Pro. It works great, however, quite slowly - converting one page takes about 10 min. Is that normal?


Direct link Reply with quote
 

Quang Ngo
Local time: 19:55
English to Vietnamese
+ ...
Try Nitro PDF Reader to extract text and images Mar 2, 2011

I would suggest you to try Nitro PDF Reader , a powerful tool that can extract text and images from the .pdf files. You can download Nitro PDF reader (100% free) at http://www.nitroreader.com/download/. For translating .pdf files, I use this program to extract the source text into Notepad, and then transfer it to a Word file using the C&P command. After some minor modifications, I can start translating using Omega T. Hope this will help.

Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


working with .pdf and .jpg files

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL Trados Studio 2017 only €415 / $495
Get the cheapest prices for SDL Trados Studio 2017 on ProZ.com

Join this translator’s group buy brought to you by ProZ.com and buy SDL Trados Studio 2017 Freelance for only €415 / $495 / £325 / ¥60,000 You will also receive FREE access to our getting started eLearning program!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search