ProZ.com global directory of translation services
 The translation workplace
Ideas
Put OCR Software in Your Business Model

ProZ.com Translation Article Knowledgebase

Articles about translation and interpreting
Article Categories
Search Articles


Advanced Search
About the Articles Knowledgebase
ProZ.com has created this section with the goals of:

Further enabling knowledge sharing among professionals
Providing resources for the education of clients and translators
Offering an additional channel for promotion of ProZ.com members (as authors)

We invite your participation and feedback concerning this new resource.

More info and discussion >

Article Options
Your Favorite Articles
You Recently Viewed...
Recommended Articles
  1. ProZ.com overview and action plan (#1 of 8): Sourcing (ie. jobs / directory)
  2. ProZ.com Translation User Manual
  3. Getting the most out of ProZ.com: A guide for translators and interpreters
  4. El significado de los dichos populares
  5. Second Language Acquisition: Learners' Errors and Error Correction in Language Teaching
No recommended articles found.
Popular Authors
  1. Henry Schroeder
  2. Riens Middelhof
  3. John Neilan
  4. CommsLab
  5. beatricesther
No popular authors found.

 »  Articles Overview  »  Art of Translation and Interpreting  »  Translation Techniques  »  Put OCR Software in Your Business Model

Put OCR Software in Your Business Model

By KSL Berlin | Published  01/3/2008 | Translation Techniques | Recommendation:
Contact the author
Quicklink: http://www.proz.com/doc/1586
Author:
KSL Berlin
Germany
German to English translator
 
View all articles by KSL Berlin

See this author's ProZ.com profile
Introduction
OCR software has been discussed on a number of occasions on ProZ.com and at ProZ events, usually in the context of how to deal with PDF files. Hector Calabia, Peter Linton and others have made a number of useful contributions in this area in the artcle knowledgebase, the ProZ forums and various conferences. There is also some information on the “How To” tab of my profile (The PDF Challenge, Post-processing of OCR text files – click on the headings) which provide additional guidance. However, I think it is useful to consider the usefulness of OCR software in a broader business context for the translation business. In this article, I will discuss briefly the use of OCR for document conversion for translation purposes as others have done, but also the use of OCR to generate additional income for your business and to reduce risk when bidding translation jobs.

OCR for translation
There are a number of programs available for this purpose, and which one is best for your purposes may depend on the language combinations you deal with and other factors. For years now we have used Abbyy FineReader, because it gave the best test results for the particular set of European languages one of our clients offered. It is also relatively inexpensive and easy to use.

Most OCR conversions of TIFF and PDF documents which we receive from agencies are difficult to use for translation purposes and require significant modification if they can be made useful at all. Particularly in cases where TM tools are to be used or target texts differ significantly in length (especially when they are longer) there may be problems. The best ways to avoid these problems are
  • never use automatic settings for OCR conversions, but instead use zone definitions

  • avoid saving the converted texts with full formatting in most cases

  • use a suitable post-OCR workflow to clean up the converted document by joining broken sentences, removing superfluous characters, fixing conversion errors, etc.


If the idea of doing individual zone definitions on each page of a 100 page document is intimidating, take heart. In many cases, programs such as Abbyy FineReader allow you to define OCR templates, which speed up the work considerably. More than one translator I know has become so skilled at the use of these OCR templates and is so good with his conversions, that agencies hire him just to do high-quality OCR work for them. Which brings me to….

OCR as an income-generating activity for the translator or agency
Hardcopy, scanned documents, faxes and PDF documents generally require additional work for translators than electronically editable documents and require different, sometimes more fallible quality control measures than a typical workflow for a translator using original electronic documents in a translation memory system. If no conversion is performed, it is more time-consuming to check terminology or use concordances during the translation, and it is also unfortunately too easy for eyes to skip over bits of text. Under time pressure this can lead to very serious problems. Even with conversion, the OCR text requires careful checking against the original document to identify and correct any errors introduced (and there will be some at times with even the best OCR software). So it is not at all unreasonable for a translator to charge a higher rate for dealing with hardcopy, scanned documents, faxes and PDF documents.

There are a number of ways to incorporate these higher charges into your business model. The two obvious ways are a premium (surcharged) word/line/page rate and charging by the hour. I usually offer both options to my clients, with the word/line rate surcharge representing the “fixed” rate and the hourly rate the “flexible” rate where I make an non-binding estimate and they may end up paying more or less according to the actual effort. For pure OCR conversion jobs where I am not doing the translating, I charge a typical proofreading rate or a bit more, because I go through the entire document and see that it is correctly formatted for translation work and that obvious errors are fixed (i.e. basic spellcheck, etc.).

Sometimes I hear that “the client doesn’t want to pay for that”. Well, that’s OK, too. The client has the option of doing the work and doing it right and saving me the effort. The recognition that there is additional effort involved and that this effort should be compensated is important. But usually there is a way to sugar-coat the bitter cost pill, and this is where your marketing savvy comes into play. Some win-win arguments you might present include:

  • the availability of an editable source text the client can use for future versions;

  • the ability to create TM resources from the OCR text (which can save time/money later);

  • potentially better quality assurance, especially with tight deadlines.


Returning a clean, nicely formatted OCR of the source document is excellent advertising. End clients will love you and agencies may recognize your skill at creating documents that don’t go crazy when edited and offer you extra jobs. If your language pair is in low demand or is very competitive, this may fill in the gaps or provide one more way of distinguishing yourself from the pack.

I got started doing OCR work and charging for it after suffering through the conversion of several long PDF documents by more manual methods. I finally wised up, bought FineReader and started to use it with most of the hardcopy, scanned documents, faxes and PDF documents I received simply because it enabled me to use my TM tools and do better quality checks. I started sending the cleaner-looking source texts converted with OCR along with the target text translations, and soon I started getting requests for paid OCR work, and a number of my agency clients began to buy the software and learn to use it with varying degrees of success. Even if they do all the conversion work, I still win if they do it right, because I save time for what I enjoy more – the translation. Some people I know still haven’t learned to do a high-quality OCR (or they don’t care to), but they still use the software effectively in a very important area of their business: quotation and risk limitation.

OCR as tool for quotation
There are lots of good tools out there for text counting, which is a critical part of the translation business. Some people even still do it manually, which, though time consuming, is not a bad way of checking the numbers from an electronic estimate. A number of factors can result in text counts being too low – due to embedded objects or graphics with text - or even too high (as is the case with at least one CAT tool counting RTF and MS Word files). Keep using whichever method you prefer - I won't try to persuade you that any one approach is best. I use a number of methods myself.

When translating larger documents, however, or documents with a complex structure, it is often useful to have a “sanity check” for your text counts. On a number of occasions I have received translation jobs from agency clients where the text count was given a X words, where in fact there were quite a few more words embedded in Excel objects, bitmap graphics, Visio charts, etc. which had not been measured by the method used. In a few cases these clients had to take a loss on the job after giving a fixed price bid to the end client. Using OCR to check your estimates can prevent such an unfortunate scenario.

To do this, print the document (whatever it is) to a PDF file. Then run the PDF file through an OCR program with automatic settings (to save time – you don’t need to translate this OCR). Save the text and count it. There will probably be a bit more text due to headers or footers or perhaps garbage from graphics, but the results should be close to your other estimate. (You can always subtract an appropriate factor for the text count in headers and footers to improve your OCR estimate.) If there is a major deviation, this is a clear sign that you should take a much closer look at the document(s) before quoting the job.

Conclusion
As I see it, OCR software is one of the most essential software tools for a translator today, even more so than CAT software. Not just a tool for recovering “lost” electronic documents or making legacy typed material more accessible for translation work, it also offers possibilities for generating additional jobs and income, differentiating one’s services and reducing risks when quoting large jobs. An essential feature of whatever OCR tool is used should be the ability to choose the text areas to be converted and the order of their conversion (using user-defined zones), and various options for saving the converted text (full page format, limited text formatting and no formatting) are very helpful. Most important of all, though, is a good quality-checking workflow for your OCR documents to ensure that you avoid difficulties in the translation process and that your work has a polished, professional appearance.

OCR software is another good tool for improving your visibility with agencies and end customers and making your work processes easier in an age when many archiving and ERP systems are focused on the retention of PDF documents or TIFFs and even actively discourage saving original formats. The major providers of this software often have free, functional demonstration versions to use before making a purchase decision. Try several options and choose the best one for you. You won’t be sorry.


Comments on this article

Knowledgebase Contributions Related to this Article
  • No contributions found.
     
Want to contribute to the article knowledgebase? Join ProZ.com.


Articles are copyright © ProZ.com, 1999-2012, except where otherwise indicated. All rights reserved.
Content may not be republished without the consent of ProZ.com.