Mobile menu

OCR batch program that recreates catalogue structure?
Thread poster: Jan Sundström

Jan Sundström  Identity Verified
Sweden
Local time: 22:47
English to Swedish
+ ...
Sep 9, 2005

Hi all,

Can you recommend any OCR program with functionality for a really large job?

I have 3200 TIF files, all named "1.TIF", nested in a huge catalogue tree.

I need to OCR them into RTF or similar, and recreate the catalogue structure.

I have Abbyy FineReader 6.0 here, but I haven't been able to tweak the settings to recreate it yet. I don't know if it's capable? Besides, the layout recognition in FineReader is poor, it often chokes on tables or columns.

Do you have any other suggestions?

BTW, the source text is in Swedish, so the OCR should have support for Scandinavian characters.

[Edited at 2005-09-09 08:37]


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 22:47
Member (2004)
English to Polish
It depends on what you want to do with it... Sep 9, 2005

You have not really specified what you mean by "recreate catalogue structure"... Do you want to have 3200 OCRed documents, do you want one document (with some "structure"), or partial documents based on the folders?

I do not know all the OCR programs in the market, but I doubt they will be up to the task. It should be rather a question of batch copying/renaming/merging files before they are fed into OCR.

FineReader is not perfect, but I don't think it is much worse than anything else... It is true that you may be forced to visit _each_ page to check the result. However, this gets easier with predefined layouts - if the structures are repeatable, it is faster to load a layout than to recognize one.


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 22:47
English to Swedish
+ ...
TOPIC STARTER
OCR for technical drawings? Sep 9, 2005

Jabberwock wrote:
You have not really specified what you mean by "recreate catalogue structure"... Do you want to have 3200 OCRed documents, do you want one document (with some "structure"), or partial documents based on the folders?

I do not know all the OCR programs in the market, but I doubt they will be up to the task. It should be rather a question of batch copying/renaming/merging files before they are fed into OCR.

FineReader is not perfect, but I don't think it is much worse than anything else... It is true that you may be forced to visit _each_ page to check the result. However, this gets easier with predefined layouts - if the structures are repeatable, it is faster to load a layout than to recognize one.


Sorry, I'll try to be more specific.

The bulk of the images are just text with no fancy layout.
Ideally, each TIF image should be replaced by one RTF file in the same folder structure. That way, I could pull all RTF files and run them through Trados.

But one portion of the images consist of scanned technical drawings.
What would be the best way to handle that? Is there any OCR program with output as PDF with editable text layer?

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology

Would this be the right way to go?
I realize I probably won't be able to work with the technical drawings in Trados. But at least I want to open them in Acrobat, and translate the text layer. It would be a bit more convenient than opening a flat TIF in Photoshop and replacing the text freehand!

Any input is appreciated.

/Jan


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 22:47
Member (2004)
English to Polish
OK, here it goes... Sep 9, 2005

I don't have much time, so I have tested only parts of the procedure. However, I think it should work

1. Copy the catalogue structure with files.
2. Rename with one of batch renamers:
http://www.tucows.com/downloads/Windows/IS-IT/FileManagement/FileRenaming/
all the files within their catalogues with counter, so they go:
0001.tif, 0002.tif, etc.
3. Copy the catalogue structure again.
4. Rename all the files in the second tree to 0001.rtf, 0002.rtf, etc. (A.F.5 Rename does that, for example).
5. Zip the files in both catalogues.
6. Unpack all the files from the first archive to one directory - no directory structure (or several, as a dir with 3k files is not handsome).

Now we have one (or more) folder with the tif files numbered consecutively.

8. Load the tif files with FineReader in reasonable amounts.
9. OCR.
10. Export the files with the option "Name files as source images".
- You can work on the files here, or place them back:
11. Pack all the new rtf files to the rtf archive so that they overwrite the original files.
12. Unpack the rtf archive with the directory structure.
13. Rename all of them to 1.rtf, if you really want.

As for the technical drawings, it is hard to say without seeing them. I would still try rtf (Word), as it is possible in FR to recognize a graphic file with a text boxes on it (which can be processed with Trados). Just draw a text box ON a picture frame.

[Edited at 2005-09-09 14:07]


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 22:47
English to Swedish
+ ...
TOPIC STARTER
Thanks! Sep 9, 2005

Jabberwock wrote:

I don't have much time, so I have tested only parts of the procedure. However, I think it should work

1. Copy the catalogue structure with files.
2. Rename with one of batch renamers:
http://www.tucows.com/downloads/Windows/IS-IT/FileManagement/FileRenaming/
all the files within their catalogues with counter, so they go:
0001.tif, 0002.tif, etc.
3. Copy the catalogue structure again.
4. Rename all the files in the second tree to 0001.rtf, 0002.rtf, etc. (A.F.5 Rename does that, for example).
5. Zip the files in both catalogues.
6. Unpack all the files from the first archive to one directory - no directory structure (or several, as a dir with 3k files is not handsome).

Now we have one (or more) folder with the tif files numbered consecutively.

8. Load the tif files with FineReader in reasonable amounts.
9. OCR.
10. Export the files with the option "Name files as source images".
- You can work on the files here, or place them back:
11. Pack all the new rtf files to the rtf archive so that they overwrite the original files.
12. Unpack the rtf archive with the directory structure.
13. Rename all of them to 1.rtf, if you really want.

As for the technical drawings, it is hard to say without seeing them. I would still try rtf (Word), as it is possible in FR to recognize a graphic file with a text boxes on it (which can be processed with Trados). Just draw a text box ON a picture frame.

[Edited at 2005-09-09 14:07]


That's brilliant, jabberwock! I'll try that and post my results here after the weekend.


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 22:47
Member (2004)
English to Polish
Update: Sep 9, 2005

Hmm... it was not as brilliant as I have thought...

I seem to remember that it was possible, but the zip program I am using now (ZipGenius) is not letting me update the files with the path intact (it adds another file to the archive). Maybe some other zip programs allow it.

However, just in case, I will describe a workaround.

First obtain the list of the rtf files within their structures. There are some programs that let you do just that (piping out of a dir command would not be practical...) Then you edit that list with search and replace functions so you get the result shown below.

Alternately, sometimes you can export the list of files zipped in the archive. ZipGenius exports the list of the files in csv, which is just perfect. After some manipulation, I get a table with rows like this (the paths may be relative or absolute, it depends how you zip the files):

|copy /y|0001.rtf|C:\the\original\path\
|copy /y|0002.rtf|c:\the\second\original\path\

Export it to txt with separating spaces, rename the txt file to .bat and run it. It is possible that it may be too large to run, but it is easy enough to split the .bat file before.


Direct link Reply with quote
 
Michael Hesselnberg  Identity Verified
Local time: 22:47
French to German
+ ...
There ist a quite efficient software Sep 10, 2005

which is GEMINI,
you can download a demo-version for 30 days on http://www.iceni.com ;

it was recommended to me by ATRIL (DéjàVu) to transform PDF file into RTF files ( to work with them in DéjàVu),

HTH
Michael


Direct link Reply with quote
 

Hynek Palatin  Identity Verified
Czech Republic
Local time: 22:47
Member (2003)
English to Czech
+ ...
Directory structure etc. Sep 10, 2005

Jan Sundström wrote:

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology


I'm not 100% sure, but I think Abbyy FineReader can do the same. I don't use it, but I localized part of it and I think I read about this feature.

Also, the new version has a tool called Automation Manager. It should be able to automatically process the files in subdirectories.

And as for opening several files from subdirectories (in Word, TagEditor, etc.), I often use drag&drop from Total Commander, which can display the contents of all subdirectories by pressing CTRL+B. I'm not saying it's the only and best solution, but it might help you.


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 22:47
English to Swedish
+ ...
TOPIC STARTER
This is what I did in the end... Sep 14, 2005

Hynek Palatin wrote:

Jan Sundström wrote:

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology


I'm not 100% sure, but I think Abbyy FineReader can do the same. I don't use it, but I localized part of it and I think I read about this feature.

Also, the new version has a tool called Automation Manager. It should be able to automatically process the files in subdirectories.

And as for opening several files from subdirectories (in Word, TagEditor, etc.), I often use drag&drop from Total Commander, which can display the contents of all subdirectories by pressing CTRL+B. I'm not saying it's the only and best solution, but it might help you.


I decided to go ahead and upgrade my FineReader 6.0 to 8.0 Professional. The Automation Manager is a nice tool, but there are still many functions I miss.

One drawback is that the output will only be in one single directory, even though it can fetch multple files from a folder tree.

That callls for a "rename - OCR - revert" operation as described above. I downloaded this time-limited shareware, that does just what I need:
http://lcen.com/default.asp?id=fren
I simply tack the parent folder names to the file names. This way I can run everything through FineReader and still be able to identify them afterwards.

Thanks a lot for pointing me in the right direction!

/Jan


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

OCR batch program that recreates catalogue structure?

Advanced search






WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs