Pages in topic:   [1 2] >
Is this the solution to formatting problems from OCR?
Thread poster: DJHartmann

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
Jun 13, 2016

While I have purchased ABBYY Finereader, I've avoided using it much of the time because after doing (Thai) OCR, having to fix spelling errors throughout, making the source formatting right, and then running through my CAT tool, the final MS Word doc ends up with strange formatting problems. I had quizzed ABBYY about why the italics, bold and underlining was locked in some paragraphs, not in others and posted asking for help here, but it wasn't ever solved (my workaround was to type the bold heading, for example, in a new doc and then copy it to the file I was working on!). In addition, many of the agencies I work for now have as part of the instructions, "DO NOT OCR the source files – they create formatting that is unusable on the back end"!

Well, with this being said, there are certainly situations where OCR can be very helpful. I'm wondering if getting the OCR to export as a .txt file and manually inserting formatting will be the best workaround? I had previously tried exporting as plain text, as a word document, and fixing formatting in the source prior to running the CAT, but the final translated doc still had locked bold, italics and underlining, as mentioned earlier.

I've tested this new .txt method on a couple of PDF and have noticed no final problems, wondering what everyone else suggests?

Is this the solution to formatting problems from OCR?


Direct link Reply with quote
 

telefpro
Local time: 03:52
Portuguese to English
+ ...
formatting problems Jun 14, 2016

There are formatting problems which still persist. OCR can't always solve this issue

Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 01:22
Member (2006)
English to Russian
+ ...
OpenOffice Writer Jun 14, 2016

In my experience, removing unnecessary formatting and fixing other post-OCR problems is easier with Apache OpenOffice / LibreOffice Writer (even when working with MS Word formats), especially with the OOoFBTools add-on. However, I do not work with Thai and have no idea about any implications specific to that language / script. As I can understand, saving results to plain text is sometimes really the best option for languages of Asia / Far East with complex scripts.

[Edited at 2016-06-14 04:48 GMT]


Direct link Reply with quote
 

Christine Andersen  Identity Verified
Denmark
Local time: 00:22
Member (2003)
Danish to English
+ ...
Acrobat or Trados Studio Jun 14, 2016

I can open PDFs in Trados Studio, but threatening to do so made one recent client send me the InDesign file.

The trouble was that there were hard line breaks, often two or three in a sentence, so the source was broken up into tiny segments that could not be merged. Apart from that, I could not create a target file that I could open. The client was happy, but it was sheer guesswork on my part, as I had no WYSIWYG of the document, graphics and formatting, and had to assume the DTP person could make any adjustments.

A file like that is a pain to translate anyway...

Sometimes Adobe Acrobat works well with Danish, my source language, if the settings are correct, but if the scanned quality is not good, then nothing helps much - Danish has three extra letters, which are very often garbled. A 'search and replace' may or may not help - they are not always garbled consistently!
Then come all the other spelling errors...

Whichever workaround is best for a given situation, I hope translators are making clients aware of the need for a workaround and charging for the time spent re-creating documents and formatting. It should not be included in the same standard word rate as for a simple document in Word!


Direct link Reply with quote
 

Tom in London
United Kingdom
Local time: 23:22
Member (2008)
Italian to English
Is this the solution? No- Jun 14, 2016

DJHartmann wrote:

....

Is this the solution to formatting problems from OCR?



None of what you describe has anything to do with translating.


Direct link Reply with quote
 

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
No blanket rates Jun 14, 2016

Christine Andersen wrote:

I hope translators are making clients aware of the need for a workaround and charging for the time spent re-creating documents and formatting. It should not be included in the same standard word rate as for a simple document in Word!


I totally agree with this point.

Tom in London wrote:

Nothing worth noting



Thanks for your two-cents Tom


Direct link Reply with quote
 

Katerina O.  Identity Verified
Russian Federation
English to Russian
+ ...
Clear All Formatting Jun 14, 2016

I use 'Clear All Formatting' function in Word, and then apply styles as necessary. It's not that time consuming after all

Direct link Reply with quote
 

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
Locked Jun 14, 2016

Katerina O. wrote:

I use 'Clear All Formatting' function in Word.


Yes, likewise.

However the bold, italics and underline functions were still locked afterwards.

In certain documents document language was locked (either as Thai or Arabic) and I couldn't change to English after translation to run a spellcheck.


Direct link Reply with quote
 

Tina Vonhof  Identity Verified
Canada
Local time: 16:22
Member (2006)
Dutch to English
+ ...
Why bother? Jun 14, 2016

Why would you spend valuable time struggling with formatting in a converted document, which, as Tom points out, has nothing to do with translation? Just open a blank document and start typing!

Direct link Reply with quote
 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 00:22
Russian to English
+ ...
Styles? Jun 14, 2016

If the formatting remains locked after being cleared, it is most likely due to Word styles. Open the list of styles used in the document and delete the styles responsible for that formatting.

Direct link Reply with quote
 

Tom in London
United Kingdom
Local time: 23:22
Member (2008)
Italian to English
Yes, and..... Jun 14, 2016

Tina Vonhof wrote:

Why would you spend valuable time struggling with formatting in a converted document, which, as Tom points out, has nothing to do with translation? Just open a blank document and start typing!


Yes - and translating

If you use dictation software you can just read out your translation from the PDF and hey presto, it will type itself out in the target language. I too struggled with PDF conversion for a long time until I realised I could do it with dictation.

[Edited at 2016-06-14 15:18 GMT]


Direct link Reply with quote
 

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
TM Jun 14, 2016

Tina Vonhof wrote:

Why would you spend valuable time struggling with formatting in a converted document, which, as Tom points out, has nothing to do with translation? Just open a blank document and start typing!


In plenty of situations this is the best option, however sometimes OCR can be very useful!


Direct link Reply with quote
 

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
List of styles? Jun 14, 2016

Anton Konashenok wrote:

If the formatting remains locked after being cleared, it is most likely due to Word styles. Open the list of styles used in the document and delete the styles responsible for that formatting.


I have never found instructions for this. Most point only to the clear formatting icon.

Nevertheless, shouldn't a .txt be clear of all styles, formatting and be safe to use? While I have my own issues with the MS Word docs, something must lead the agencies to not allow OCR!


Direct link Reply with quote
 

DJHartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
Still locked Jun 14, 2016

Well, even using a .txt has caused issues.

It seems to be related to MS Word and Thai fonts because the latin fonts can be formatted fine.

My process was as follows:

Exported the OCR as plain text .txt file.

Opened with MS Word.

Bolded the heading of the first line (worked) and then tried to correct the spelling of the first character of the first word. The new text that I typed couldn't be bolded! However, if I typed new text in English, it could!

If anyone can clarify this situation, it'd be very appreciated!


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 01:22
Member (2006)
English to Russian
+ ...
File sample Jun 15, 2016

You should better share a file (a sample page where the problem appears).

Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is this the solution to formatting problems from OCR?

Advanced search







WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search