Which format for my need?
Thread poster: Samuel Murray

Samuel Murray  Identity Verified
Netherlands
Local time: 19:40
Member (2006)
English to Afrikaans
+ ...
May 27, 2009

G'day everyone

I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would ultimately still need to consult the image file from time to time.

What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content. In other words, if there were such a thing as hidden text in a PDF, then I would have each image as a PDF page and simply put the OCR'ed text of that page as hidden text, so that the page comes up in a search but I would still need to read the page like a paper page.

My current solution is to use a text-like format that allows hyperlinks in it, and then I simply ensure that every page contains a link to the relevant image file. Then I can use Ctrl+F to do a search, and when I find something, I can click the hyperlink, which launches the relevant image file in the default image viewer. This is a workable solution, though far from ideal.

Can what I've described above with PDF be done in PDF at all? Or is there another format that will allow this sort of thing?

Thanks!
Samuel


Direct link Reply with quote
 

Andreas Nieckele  Identity Verified
Brazil
Local time: 14:40
English to Portuguese
Maybe May 27, 2009

You could try usign InDesign or your favorite desktop publishing program and do the following:

- Create a new page, and place a text box containing the text for that page
- On top of this text box, place the image from the page so that it completely blocks the text
- Generate a pdf

I've never tried to do this myself, but I guess in theory it should work.


Direct link Reply with quote
 

Adam Łobatiuk  Identity Verified
Poland
Local time: 19:40
Member (2009)
English to Polish
+ ...
Adobe Acrobat (Professional) has an OCR feature May 27, 2009

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm

Direct link Reply with quote
 

Narcis Lozano Drago  Identity Verified
Spain
Local time: 19:40
Member (2007)
English to Spanish
+ ...
PDF with transparent text May 27, 2009

The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.

Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR software out there with the same functionality.


Narcis


Direct link Reply with quote
 

Jing Nie
China
Local time: 01:40
Member (2011)
English to Chinese
+ ...
I agree May 27, 2009

Adam Łobatiuk wrote:

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm


you may convert scanned images into PDF. Then you can use the OCR function in the acrobat, thus it will not change the font and layout , and all text will be searchable.


Direct link Reply with quote
 

Erik Freitag  Identity Verified
Germany
Local time: 19:40
Member (2006)
Dutch to German
+ ...
Abby FineReader May 28, 2009

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

Direct link Reply with quote
 

amurati
Local time: 19:40
English to Albanian
+ ...
So far the best tool for OCR is Abby Fineread May 28, 2009

efreitag wrote:

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.


but the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed


Direct link Reply with quote
 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 20:40
Member (2008)
English to Russian
+ ...
make two files of each May 28, 2009

Samuel Murray wrote:
What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content.


1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.
2) Make a plain TXT for search.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:40
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Where can I find a DJVU generator? May 28, 2009

Sergei Leshchinsky wrote:
1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.


I'm well aware of DJVU but I have yet to find a DJVU generator that is not experimental. Do you know of stable, usable DJVU generators?


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 19:40
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Best resolution for OCR May 28, 2009

Ahmet Murati wrote:
efreitag wrote:
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

But the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed.


I have found that 300 DPI is about the most economical resolution for OCR. Scanning at 450 DPI will result in an image twice as large, and scanning it will also take twice as long. Scanning at 600 DPI will result in an image four times as large and scanning it will take four times as long.

My scanner takes about 30 seconds to scan an A5 page (using the document feeder) at 300 DPI.

My experience (anecdotal) is that I gain about one or two percent in accuracy with 450 DPI as opposed to 300 DPI, and only about another half a percent in accuracy with 600 DPI, so it really aint worth scanning at resolutions higher than 300 DPI.

I also find that there is very little difference in accuracy whether I scan in full colour or in straight black and white, except if the printed page contains halftone backgrounds, in which case strangely the black and white scan OCRs better (I would have thought the other way round makes more sense). Even so, I find it best to scan in full colour and leave it up to the OCR program to posterise the image if it wants to.

I'm a little annoyed that my scanner has a white background and not black, for black would be somewhat easier to autocrop.

Anyway, I was aware of ABBYY's PDF creator but I did not realise that it can combine text with images. I'll experiment a bit.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Which format for my need?

Advanced search






SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search