How best to save PDF conversions in ABBYY Finereader
Thread poster: Susan Welsh

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
Feb 25, 2017

I have Finereader 10, and have been struggling to get it to work right for me in conversion of PDFs. The users' manual is very unhelpful. Just when I thought I was getting the hang of it, I encountered a bunch of new problems today. What I would like to ask is this: How do you decide, based on the nature of a given PDF, how it should best be saved to Word for the purpose of translation?

* Exact Copy: Almost everyone says you should never do this (except the Finereader support staff, which seems to like it), so I never use it.

* Editable Text: Despite what the manual says about how it is "easy" to edit (although "the formatting may differ slightly from that of the original"), this is not necessarily so. It does retain commands like centering of table columns, which Formatted Text does not. I have decided (tentatively) to use it for documents that have complex tables, which Formatted Text does not do well, but to save the document again as Formatted Text. My client wants the tables and figures in a separate file anyway.

* Formatted Text: Does not retain location of objects on the page (!) and all text will be left-aligned. So if you have a lot of tables that require centered columns, you have to fix them one at a time in Word. Today I encountered new problems I have not noticed before:
1) It was impossible to apply Styles in Word 2010. I use a Style template for this particular customer, and attach the template to the converted document via File > Options > Templates > Go, etc., etc. But this time it didn't work. For example, clicking "Normal" in the Style menu did nothing, and the Normal box was not highlighted as it usually is when that style is active. None of the headline/subhead styles did anything. I ended up having to fix everything manually.
2) My client has very specific requirements regarding footnotes. They all have to be converted to endnotes, numbered consecutively regardless of the length of the document. But there can't be any Word coding in there -- so the endnote callout has to be just a superscript number, no code, and the list of endnotes similarly has to be use a plain text number. To make things worse, the Finereader conversion made some of the 84 footnote callouts into "real callouts" (coded), and some not. Therefore it was not an easy matter to convert footnotes to endnotes, and then to avoid deleting all the notes in the process of deleting the codes.

* Plain Text: While some people prefer to save everything as plain text and do all the coding in Word, this can be very time-consuming if there are a lot of font changes, and it's easy to miss things. But I found today that it handled BOTH of the problems I discussed above (under Formatted Text) quite nicely. No problem with applying Styles, and the footnotes came out uncoded but still in superscript, as desired. So clearly some formatting is retained. (Centering of table columns is not.)

I would be grateful to have any additional input from expert users!


[Edited at 2017-02-25 20:13 GMT]


 

esperantisto  Identity Verified
Local time: 15:28
Member (2006)
English to Russian
+ ...
Exact copy Feb 25, 2017

I do not know any good reason not to use this option except for some very rare cases when conversion results are really messy. I use, however, ABBYY FineReader 8.0, the best version ever released. And I use Apache OpenOffice / LibreOffice Writer, because working with styles is much better as compared to Word.

 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 13:28
English to Russian
+ ...
Plain text Feb 25, 2017

Exact copy is almost totally unusable - in most cases, any attempts at editing destroy the formatting in the worst way.
Editable copy isn't much better, either.
Even with formatted text, there may be fluctuations in font size, character width, line spacing, etc. across the text.
In my experience, it takes less time to recreate proper formatting by hand from PLAIN TEXT than to remove unnecessary formatting features from editable/formatted text (and you may not even notice those extra features until you put the file through CAT or something).

Your mileage may vary depending on the quality of the original. With styles, I'd recommend working in the opposite way: create a new empty document with all the necessary styles, then paste the OCRed text into it. There may be surprises anyway, though.

I strongly disagree with Esperantisto about the version 8.0. It may be a good solution for source images of excellent quality, but when you have to deal with mobile phone shots of crumpled and faded cash register receipts (OK, OK, I'm exaggerating, but only a little), the recognition quality has been increasing a lot from version to version, at least until version 11.

On second thoughts, it could be worth an effort to write a Word macro that would take formatted/editable text and retain the basic formatting features but strip those that are imposed by Finereader but hardly ever used in real documents (e.g. character width other than 100%), and maybe also "standardize" font sizes, line spacing, paragraph spacing/indents, etc. - e.g. round fractional font size values to the nearest integer and paragraph spacing to a multiple of 6pt.

[Edited at 2017-02-25 23:56 GMT]


 

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
TOPIC STARTER
to esperantisto and Anton Feb 26, 2017

Thanks for your replies.
I hesitate to delve into Styles using LibreOffice instead of Word, because it has taken me so bloody long to get any sort of mastery of Word styles. I hate to start all over again!
As for attaching the styles template first, in a clean document, and then inserting the converted text: I tried it with the document I'm currently wrestling with (saved as formatted text), and the result was the same as before: It refused to "take" the styles. Maybe there's something particularly weird about this PDF. I don't recall having had this problem before.
I have another PDF to convert for this particular job. Maybe I'll try saving the tables in editable text and the text in plain text. I'm also interested to see what others may have to say.


 

mikhailo
Local time: 15:28
English to Russian
+ ...
re Feb 26, 2017

Susan Welsh wrote:

Thanks for your replies.
I hesitate to delve into Styles using LibreOffice instead of Word, because it has taken me so bloody long to get any sort of mastery of Word styles. I hate to start all over again!
As for attaching the styles template first, in a clean document, and then inserting the converted text: I tried it with the document I'm currently wrestling with (saved as formatted text), and the result was the same as before: It refused to "take" the styles. Maybe there's something particularly weird about this PDF. I don't recall having had this problem before.
I have another PDF to convert for this particular job. Maybe I'll try saving the tables in editable text and the text in plain text. I'm also interested to see what others may have to say.



If your translation is usually shorter than original - you can use exact copy or (with true pdf) workflow with Infix.
If translation is longer - the best way - reformat it from scrap of plain text of FR.


 

Natalie  Identity Verified
Poland
Local time: 13:28
Member (2002)
English to Russian
+ ...

Moderator of this forum
Good morning, Susan Feb 26, 2017

I use FineReader since version 7, if i remember correctly; I use version 12 now, and I have been always quite happy with the results.

The way of saving the results depends on the document (how large it is, how complicated is the formatting). For small 1-page document and larger documents lacking complicated elements the exact copy usually works fine; the only thing you should do afterwards is removing hard page breaks, and in most cases it's enough.

For larger and more complicated documents I use an editable copy, however, I always check manually the whole document, especially tables, as well as remove all unneeded elements (for example, converted parts of images).

One of the largest conversions I did this way, was a manual on safety: it contained 10 chapters, 30 to 45 pages each, with 2-column layout and tons of tables and images. The results of the conversion into an editable copy (with some manual checking) were absolutely fine. One smaller chapter was saved as an exact copy (I simply forgot to change the option), and the result was also fine, however, as Russian text is longer than English, parts of the text were hiding beyond the boundaries of the text fields, and this had to be corrected manually afterwards.

So decide this each time, depending on the document.


 

Rolf Keller
Germany
Local time: 13:28
English to German
Dont mix up Style Sheets with Direct Formatting Feb 26, 2017

Susan Welsh wrote:

As for attaching the styles template first, in a clean document, and then inserting the converted text: I tried it with the document I'm currently wrestling with (saved as formatted text), and the result was the same as before: It refused to "take" the styles.


You misunderstood Anton's hint. He meant: "(re-)format the document manually". In order to "take" the styles from your existing style sheet automatically, several conditions must be met:

#1 The original author's document (i. e .docx or what have you) must include named styles. Very often this condition is not met, many writers don't know anything about named Styles and use Direct Formatting instead.

#2 The software that was used to create the .pdf (e. g. PDFmaker) must be able and set to integrate the styles into the .pdf file while preserving their names. Many softwares are not able to do that.

#3 FineReader must be able to convert the pdf styles to .docx styles while preserving their names. I'm not sure about whether FineReader can do this. Anyway, no OCR function can ever do this because styles as such are invisible.

So, in many cases you will have no success.


 

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
TOPIC STARTER
to Rolf Feb 26, 2017

Rolf Keller wrote:
You misunderstood Anton's hint. He meant: "(re-)format the document manually".
In order to "take" the styles from your existing style sheet automatically, several conditions must be met:

I don't think I misunderstood Anton. I was referring to this

Anton Konashenok wrote:
With styles, I'd recommend working in the opposite way: create a new empty document with all the necessary styles, then paste the OCRed text into it. There may be surprises anyway, though.


I was not expecting the styles to be transferred automatically. As you say, Rolf, there's no way of knowing how the original author or DTP person formatted it. My problem was that the formatted document would not "take" my styles applied manually. In other words, with my template attached, I highlight the main headline and click "Heading 1" in the Styles menu. Nothing happens. Same with all the subheads (Heading 2, Heading 3), the "Normal" text, etc.


 

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
TOPIC STARTER
I tried "exact copy" - and what is "normal" (this is not a psychiatric question!) Feb 26, 2017

Natalie wrote:

One of the largest conversions I did this way, was a manual on safety: it contained 10 chapters, 30 to 45 pages each, with 2-column layout and tons of tables and images. The results of the conversion into an editable copy (with some manual checking) were absolutely fine.


Natalie, you must be a magician! I just tried saving my document as "exact copy" (14,000 words, one-column format, lots of footnotes, tables, and graphics). When I tried to apply my style template, here's what happened: Again, the styles did not "take." The text format remained Calibri 10.5 points, no first line paragraph indent, extra space between paragraphs, although my style calls for Times Roman 12 point, with a first line indent and no space between paragraphs. Clicking "Normal" did nothing at all. When I tried a page with a table, clicking "Normal" made the type print one line on top of another.

One odd thing is that the PDF formats the text as "Body Text (28)," whereas I use "Normal." I don't even know what "Body Text (28)" means; I assume it is based on Normal. But it seems I cannot change it (when saving as exact copy, editable text, or formatted text). In Word, I tried the Find and Replace function, for styles, replacing "Body Text (28)" with "Normal," but all it did was change the leading (space between lines). The font and size remained the same.

Ah, but here's something weird: The dropdown menu in Word for Find and Replace > Styles gives you Body Text (28) with both a character symbol and a paragraph symbol, whereas Normal gives only the paragraph symbol. Maybe that's why it is not changing the font and size. I don't really understand this character/paragraph business in Word. Does this mean that you cannot really set "Normal" to a specific font and size in your new template?


 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 13:28
English to Russian
+ ...
Delete ALL Finereader styles Feb 27, 2017

Susan, Finereader generates a lot of styles - essentially, a separate style for each paragraph looking different from the rest. Your "Body Text (28)" is one of these. Go into the Word style editor for the Finereader-generated document and delete them all. Then you can paste the text into the template (but then you may have to reassign the styles manually anyway).

[Edited at 2017-02-27 14:15 GMT]


 

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
TOPIC STARTER
@Anton Feb 27, 2017

What you suggest basically means the same as saving the Finereader document as plain text, correct? (I tried it, and that's what it seems to me.) So there's really no advantage to it that I can see.

 

Anton Konashenok  Identity Verified
Czech Republic
Local time: 13:28
English to Russian
+ ...
Not exactly Feb 27, 2017

By doing this, you will delete paragraph-wide styles, but bolding, italics, underlining etc. within paragraphs (that is, not dictated by style) will remain.

 

Susan Welsh  Identity Verified
United States
Local time: 07:28
Member (2008)
Russian to English
+ ...
TOPIC STARTER
Ah! Feb 27, 2017

That's good to know, thanks Anton.

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How best to save PDF conversions in ABBYY Finereader

Advanced search






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL Trados Studio 2019 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2019 has evolved to bring translators a brand new experience. Designed with user experience at its core, Studio 2019 transforms how new users get up and running and helps experienced users make the most of the powerful features.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search