Should I OCR this document?
Thread poster: Mark Connolly

Mark Connolly
Mexico
Spanish to English
Jun 2

I have turned down jobs before because clients tell me not to OCR a pdf document. This time I accepted the job before getting the instructions and work is thin on the ground. I turns out the document is full of tables that OCR beautifully.

I always OCR without a format, could I get away with it?


 

Kevin Fulton  Identity Verified
United States
Local time: 02:40
German to English
I don't see why not Jun 3

To be honest, I don't understand why a client might not want you to use OCR on a file. After all, how you produce a usable intermediate (i.e. working) document is your business.

However, using OCR isn't always trouble-free.

One problem with using OCR on PDF files is that all sorts of artifacts including hidden tags can be embedded in the converted file which then interfere with successful formatting in Word, for example. There are various utilities available, such as Code Zapper, or TransTools Suite which "clean up" such artifacts and help regularize fonts and spacing. Another issue is faulty character recognition – although a character or word may appear legible to the human eye, it might be misinterpreted during the OCR process. Again, a careful reading of the output document should help eliminate such errors.

If you are using a CAT tool, you don't have many alternatives to using OCR, apart from INFIX, which results in reproducing a translated PDF file after using a CAT tool.

Using OCR to reproduce tables makes perfect sense to me, assuming the process doesn't introduce spacing or formatting errors.

You might ask the client regarding the instruction not to use OCR. It's possible that the client uses DTP and hidden embedded tags interfere with the process. As mentioned above, there are utilities that remedy this issue.


 

finnword1
United States
Local time: 02:40
English to Finnish
+ ...
ignorant clients Jun 3

Ask them to send you the material in text or Word document or to OCR the material themselves.

 

Germaine  Identity Verified
Canada
Local time: 02:40
English to French
+ ...
Agree with Kevin Jun 3

Using Adobe Acrobat (Standard), you can simply "save as" the pdf in one of the various format offered, including Word and Excel and most of the time, there's little word processing to do. An OCR (EN+FR) is also included, should the pdf be a scan.

Sure, the software is pricey at first, but upgrades (and you don't have to buy each and everyone) are more affordable. See it as an investment. You'll be surprised by all you can do with it (and even more with Adobe Acrobat Pro). I started with version 4 and I am now using version X. I never regretted buying it. It has been worth every cent!

P.S.: should you buy it, don't forget to install the pdf printer. You'll get better pdfs by "printing" your Word/Excel documents than "saving as".


 

Tom in London
United Kingdom
Local time: 07:40
Member (2008)
Italian to English
I agree with F Jun 4

finnword1 wrote:

Ask them to send you the material in text or Word document or to OCR the material themselves.


Finnword's suggestion is the correct one.


 

LEXpert  Identity Verified
United States
Local time: 01:40
Member (2008)
Croatian to English
+ ...
Be careful what you wish for Jun 4

Tom in London wrote:

finnword1 wrote:

Ask them to send you the material in text or Word document or to OCR the material themselves.


Finnword's suggestion is the correct one.


That often results in a slipshod effort yielding tag soup and horrible segmentation that costs you more than time than it saves, especially since, if the client is going to go through the trouble of OCRing for you, they're going to figure that they might as well run it through their CAT tool and knock your price down a bit. 9 times out of 10, I can do a much better job of OCRing a file than the client can.


 

José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 03:40
English to Portuguese
+ ...
Definitely true! Jun 4

LEXpert wrote:

9 times out of 10, I can do a much better job of OCRing a file than the client can.


I always wonder why clients - particularly agencies - "lie" about having done (horrible) OCR work.

They send me a table with a sea of typos, I ask them for the original file, and they say it's all they've got.

Later they ask me to proofread a laid-out PDF to check whether they've put all my translations in the right places.


 

DZiW
Ukraine
English to Russian
+ ...
extra work = extra charge Jun 4

Sometimes I use FreeTM.com (free WordFast Anywhere), which can convert not very complicated or bizarre PDFs to email box, otherwise I have to use FineReader. Anyway, I do charge for this, because it takes more time and efforts to make the text ok.

Most clients know very little even regarding the final translation, so many just aren't aware of an editable document, types of PDF/DJVU and why OCR/DTP at all. In this view, translators work as mentors and educators, teaching the ABC.

Shortly, clients don't know why exactly they must pay for something not asked for. When I had a similar issue and asked for an editable copy, my client insisted the file must be intact and very reluctantly sent me a password to unprotect the PDF. I had to explain to him again that a scanned PDF is no different with or without a password for it's but a set of images, no text. He was surprised and wondered whether translation involves reading a hardcopy or from the screen. I was ready to cancel the deal, when he suddenly replied he understood the problem--he could only view the file as photos without selection word or making remarks... Finally he sent me the original DOC and once more he was dumbfounded by a question which final format was required--DOC, PDF or some other... Yes, as far as there were charts and I didn't want to mess with explaining about ZIP/RAR and sent him a DOC, an RTF, and a searchable PDF... He was puzzled and asked whether he had to pay threefold)

Why, I believe it's much better than "a plain DOC file without tables and graphics", which turned to be a DOC with scanned handwriting.


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Should I OCR this document?

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search