https://www.proz.com/forum/software_applications/36561-ocr_batch_program_that_recreates_catalogue_structure.html

OCR batch program that recreates catalogue structure?
Thread poster: Jan Sundström
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 02:34
English to Swedish
+ ...
Sep 9, 2005

Hi all,

Can you recommend any OCR program with functionality for a really large job?

I have 3200 TIF files, all named "1.TIF", nested in a huge catalogue tree.

I need to OCR them into RTF or similar, and recreate the catalogue structure.

I have Abbyy FineReader 6.0 here, but I haven't been able to tweak the settings to recreate it yet. I don't know if it's capable? Besides, the layout recognition in FineReader is poor, it often chokes on tables
... See more
Hi all,

Can you recommend any OCR program with functionality for a really large job?

I have 3200 TIF files, all named "1.TIF", nested in a huge catalogue tree.

I need to OCR them into RTF or similar, and recreate the catalogue structure.

I have Abbyy FineReader 6.0 here, but I haven't been able to tweak the settings to recreate it yet. I don't know if it's capable? Besides, the layout recognition in FineReader is poor, it often chokes on tables or columns.

Do you have any other suggestions?

BTW, the source text is in Swedish, so the OCR should have support for Scandinavian characters.

[Edited at 2005-09-09 08:37]
Collapse


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 02:34
Member (2004)
English to Polish
SITE LOCALIZER
It depends on what you want to do with it... Sep 9, 2005

You have not really specified what you mean by "recreate catalogue structure"... Do you want to have 3200 OCRed documents, do you want one document (with some "structure"), or partial documents based on the folders?

I do not know all the OCR programs in the market, but I doubt they will be up to the task. It should be rather a question of batch copying/renaming/merging files before they are fed into OCR.

FineReader is not perfect, but I don't think it is much worse than
... See more
You have not really specified what you mean by "recreate catalogue structure"... Do you want to have 3200 OCRed documents, do you want one document (with some "structure"), or partial documents based on the folders?

I do not know all the OCR programs in the market, but I doubt they will be up to the task. It should be rather a question of batch copying/renaming/merging files before they are fed into OCR.

FineReader is not perfect, but I don't think it is much worse than anything else... It is true that you may be forced to visit _each_ page to check the result. However, this gets easier with predefined layouts - if the structures are repeatable, it is faster to load a layout than to recognize one.
Collapse


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 02:34
English to Swedish
+ ...
TOPIC STARTER
OCR for technical drawings? Sep 9, 2005

Jabberwock wrote:
You have not really specified what you mean by "recreate catalogue structure"... Do you want to have 3200 OCRed documents, do you want one document (with some "structure"), or partial documents based on the folders?

I do not know all the OCR programs in the market, but I doubt they will be up to the task. It should be rather a question of batch copying/renaming/merging files before they are fed into OCR.

FineReader is not perfect, but I don't think it is much worse than anything else... It is true that you may be forced to visit _each_ page to check the result. However, this gets easier with predefined layouts - if the structures are repeatable, it is faster to load a layout than to recognize one.


Sorry, I'll try to be more specific.

The bulk of the images are just text with no fancy layout.
Ideally, each TIF image should be replaced by one RTF file in the same folder structure. That way, I could pull all RTF files and run them through Trados.

But one portion of the images consist of scanned technical drawings.
What would be the best way to handle that? Is there any OCR program with output as PDF with editable text layer?

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology

Would this be the right way to go?
I realize I probably won't be able to work with the technical drawings in Trados. But at least I want to open them in Acrobat, and translate the text layer. It would be a bit more convenient than opening a flat TIF in Photoshop and replacing the text freehand!

Any input is appreciated.

/Jan


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 02:34
Member (2004)
English to Polish
SITE LOCALIZER
OK, here it goes... Sep 9, 2005

I don't have much time, so I have tested only parts of the procedure. However, I think it should work

1. Copy the catalogue structure with files.
2. Rename with one of batch renamers:
http://www.tucows.com/downloads/Windows/IS-IT/FileManagement/FileRenaming/
all the f
... See more
I don't have much time, so I have tested only parts of the procedure. However, I think it should work

1. Copy the catalogue structure with files.
2. Rename with one of batch renamers:
http://www.tucows.com/downloads/Windows/IS-IT/FileManagement/FileRenaming/
all the files within their catalogues with counter, so they go:
0001.tif, 0002.tif, etc.
3. Copy the catalogue structure again.
4. Rename all the files in the second tree to 0001.rtf, 0002.rtf, etc. (A.F.5 Rename does that, for example).
5. Zip the files in both catalogues.
6. Unpack all the files from the first archive to one directory - no directory structure (or several, as a dir with 3k files is not handsome).

Now we have one (or more) folder with the tif files numbered consecutively.

8. Load the tif files with FineReader in reasonable amounts.
9. OCR.
10. Export the files with the option "Name files as source images".
- You can work on the files here, or place them back:
11. Pack all the new rtf files to the rtf archive so that they overwrite the original files.
12. Unpack the rtf archive with the directory structure.
13. Rename all of them to 1.rtf, if you really want.

As for the technical drawings, it is hard to say without seeing them. I would still try rtf (Word), as it is possible in FR to recognize a graphic file with a text boxes on it (which can be processed with Trados). Just draw a text box ON a picture frame.

[Edited at 2005-09-09 14:07]
Collapse


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 02:34
English to Swedish
+ ...
TOPIC STARTER
Thanks! Sep 9, 2005

Jabberwock wrote:

I don't have much time, so I have tested only parts of the procedure. However, I think it should work

1. Copy the catalogue structure with files.
2. Rename with one of batch renamers:
http://www.tucows.com/downloads/Windows/IS-IT/FileManagement/FileRenaming/
all the files within their catalogues with counter, so they go:
0001.tif, 0002.tif, etc.
3. Copy the catalogue structure again.
4. Rename all the files in the second tree to 0001.rtf, 0002.rtf, etc. (A.F.5 Rename does that, for example).
5. Zip the files in both catalogues.
6. Unpack all the files from the first archive to one directory - no directory structure (or several, as a dir with 3k files is not handsome).

Now we have one (or more) folder with the tif files numbered consecutively.

8. Load the tif files with FineReader in reasonable amounts.
9. OCR.
10. Export the files with the option "Name files as source images".
- You can work on the files here, or place them back:
11. Pack all the new rtf files to the rtf archive so that they overwrite the original files.
12. Unpack the rtf archive with the directory structure.
13. Rename all of them to 1.rtf, if you really want.

As for the technical drawings, it is hard to say without seeing them. I would still try rtf (Word), as it is possible in FR to recognize a graphic file with a text boxes on it (which can be processed with Trados). Just draw a text box ON a picture frame.

[Edited at 2005-09-09 14:07]


That's brilliant, jabberwock! I'll try that and post my results here after the weekend.


 
Jaroslaw Michalak
Jaroslaw Michalak  Identity Verified
Poland
Local time: 02:34
Member (2004)
English to Polish
SITE LOCALIZER
Update: Sep 9, 2005

Hmm... it was not as brilliant as I have thought...

I seem to remember that it was possible, but the zip program I am using now (ZipGenius) is not letting me update the files with the path intact (it adds another file to the archive). Maybe some other zip programs allow it.

However, just in case, I will describe a workaround.

First obtain the list of the rtf files within their structures. There are some programs that let you do just that (piping out of a di
... See more
Hmm... it was not as brilliant as I have thought...

I seem to remember that it was possible, but the zip program I am using now (ZipGenius) is not letting me update the files with the path intact (it adds another file to the archive). Maybe some other zip programs allow it.

However, just in case, I will describe a workaround.

First obtain the list of the rtf files within their structures. There are some programs that let you do just that (piping out of a dir command would not be practical...) Then you edit that list with search and replace functions so you get the result shown below.

Alternately, sometimes you can export the list of files zipped in the archive. ZipGenius exports the list of the files in csv, which is just perfect. After some manipulation, I get a table with rows like this (the paths may be relative or absolute, it depends how you zip the files):

|copy /y|0001.rtf|C:\the\original\path\
|copy /y|0002.rtf|c:\the\second\original\path\

Export it to txt with separating spaces, rename the txt file to .bat and run it. It is possible that it may be too large to run, but it is easy enough to split the .bat file before.
Collapse


 
Michael Hesselnberg (X)
Michael Hesselnberg (X)  Identity Verified
Local time: 02:34
French to German
+ ...
There ist a quite efficient software Sep 10, 2005

which is GEMINI,
you can download a demo-version for 30 days on http://www.iceni.com ;

it was recommended to me by ATRIL (DéjàVu) to transform PDF file into RTF files ( to work with them in DéjàVu),

HTH
Michael


 
Hynek Palatin
Hynek Palatin  Identity Verified
Czech Republic
Local time: 02:34
Member (2003)
English to Czech
+ ...
Directory structure etc. Sep 10, 2005

Jan Sundström wrote:

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology


I'm not 100% sure, but I think Abbyy FineReader can do the same. I don't use it, but I localized part of it and I think I read about this feature.

Also, the new version has a tool called Automation Manager. It should be able to automatically process the files in subdirectories.

And as for opening several files from subdirectories (in Word, TagEditor, etc.), I often use drag&drop from Total Commander, which can display the contents of all subdirectories by pressing CTRL+B. I'm not saying it's the only and best solution, but it might help you.


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 02:34
English to Swedish
+ ...
TOPIC STARTER
This is what I did in the end... Sep 14, 2005

Hynek Palatin wrote:

Jan Sundström wrote:

I read about a program called Image2PDF OCR (Text Layer):

Among the specs are:
...retaining original image and text layout
- Create text-searchable PDF image + text format with compressed hidden text layer
- Make searchable PDF files with OCR technology


I'm not 100% sure, but I think Abbyy FineReader can do the same. I don't use it, but I localized part of it and I think I read about this feature.

Also, the new version has a tool called Automation Manager. It should be able to automatically process the files in subdirectories.

And as for opening several files from subdirectories (in Word, TagEditor, etc.), I often use drag&drop from Total Commander, which can display the contents of all subdirectories by pressing CTRL+B. I'm not saying it's the only and best solution, but it might help you.


I decided to go ahead and upgrade my FineReader 6.0 to 8.0 Professional. The Automation Manager is a nice tool, but there are still many functions I miss.

One drawback is that the output will only be in one single directory, even though it can fetch multple files from a folder tree.

That callls for a "rename - OCR - revert" operation as described above. I downloaded this time-limited shareware, that does just what I need:
http://lcen.com/default.asp?id=fren
I simply tack the parent folder names to the file names. This way I can run everything through FineReader and still be able to identify them afterwards.

Thanks a lot for pointing me in the right direction!

/Jan


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

OCR batch program that recreates catalogue structure?






Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »