Pages in topic:   [1 2] >
OCR-ing graphics embedded in Word?
Thread poster: pj-ffm

pj-ffm
Local time: 18:09
German to English
Jun 13, 2011

Hi all,

does anyone know of a product (or if it is at all possible) for extracting the text from graphics directly within Word?

Situtation:
The document to be translated contains a large number of screenshots which need to be translated. I obviously don't want to have to type a glossary of source-target words by hand for each...

Due to the large number of graphics (several hundred) I also don't want to have to save/name/OCR each one separately. Managi
... See more
Hi all,

does anyone know of a product (or if it is at all possible) for extracting the text from graphics directly within Word?

Situtation:
The document to be translated contains a large number of screenshots which need to be translated. I obviously don't want to have to type a glossary of source-target words by hand for each...

Due to the large number of graphics (several hundred) I also don't want to have to save/name/OCR each one separately. Managing that would be a nightmare.

Ideally I would be able to select the graphic in questions, start a macro/application and end up with the text in another windows, or inserted as a text block below the graphic in Word.

Any chance of this?

cheers,
Peter.
Collapse


 

Jorge Payan  Identity Verified
Colombia
Local time: 11:09
Member (2002)
German to Spanish
+ ...
CodeZapper might do the job Jun 14, 2011

Among other things, it allows to extract the images out of Word file, without any text; then you can use FineReader to OCR the extracted graphics-only file.

You can get CodeZapper here: http://asap-traduction.com/CodeZapper

saludos


 

pj-ffm
Local time: 18:09
German to English
TOPIC STARTER
Will try out a copy Jun 14, 2011

Hi Jorge,

thanks for the suggestion. It seems like a useful tool to have generally so I've requested a copy from the site you linked to.

I'm guessing I'll have to brush up my VBA skills and try and write a macro for kicking off the OCR, grabbing the graphic, naming etc...

cheers,
Peter.


 

István Hirsch  Identity Verified
Local time: 18:09
English to Hungarian
... or try this... Jun 14, 2011

I think it is also possible to convert the whole Word file (as is) into a Pdf file (with a free pdf maker or Adobe), OCR this Pdf file, and translate the product of OCR in Word.

 

pj-ffm
Local time: 18:09
German to English
TOPIC STARTER
Still no luck finding auto-graphic-grab-ocr-from-Word-macro, but... Jun 14, 2011

Well, I've had a helpful Email exchange with Dave (CodeZapper) however, it seems it won't really help me in what I want to do.

He also mentioned the possiblity of using a dictation solution, but I'm not so sure about how I'd integrate that into my workflow.

Another tip he gave me was a script someone has written here:
http://www.autohotkey.com/forum/topic11186.h
... See more
Well, I've had a helpful Email exchange with Dave (CodeZapper) however, it seems it won't really help me in what I want to do.

He also mentioned the possiblity of using a dictation solution, but I'm not so sure about how I'd integrate that into my workflow.

Another tip he gave me was a script someone has written here:
http://www.autohotkey.com/forum/topic11186.html
The description says it takes a screen grab, calls an external OCR and generates a text file, pretty close to what I want conceptually and I'll try and investigate when I have time.

István, thanks for your input; It would be an idea if the document were nearly all graphics, but doing it like that in this case would really present an even greater challenge in reformating the document afterwards (there are lots of awkward tables etc. in addition to the graphics...)

cheers,
Pete.

p.s. what do other translators do when a significant part of a large document is graphics?
I mean, if it's a few graphics I just retype the source and target and charge a supplement, but it makes for horrible workflow and the CAT doesn't really benefit...
Collapse


 

Natalie  Identity Verified
Poland
Local time: 18:09
Member (2002)
English to Russian
+ ...

Moderator of this forum
Hi Pete Jun 14, 2011

I, for example, own Finereader and use it to perform OCR of images (it gives perfect results). I have never used any images embedded into Word files. Aren't you able to obtain the images in their native format?

Please also take a look at http://www.abbyy.com/screenshot_reader/ - maybe this would be what you need. However, I doubt you would be able to use the embedded images with it.<
... See more
I, for example, own Finereader and use it to perform OCR of images (it gives perfect results). I have never used any images embedded into Word files. Aren't you able to obtain the images in their native format?

Please also take a look at http://www.abbyy.com/screenshot_reader/ - maybe this would be what you need. However, I doubt you would be able to use the embedded images with it.

Natalia
Collapse


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 17:09
Member (2009)
Dutch to English
+ ...
.doc -> .docx -> .zip Jun 14, 2011

one way of getting at just the images is:

re-save the Word .doc as a .docx, then rename it to a .zip

this will concert it into a zip folder containing all of the images in your Word document


 

Natalie  Identity Verified
Poland
Local time: 18:09
Member (2002)
English to Russian
+ ...

Moderator of this forum
??? Jun 14, 2011

Michael J.W. Beijer wrote:

one way of getting at just the images is:

re-save the Word .doc as a .docx, then rename it to a .zip

this will concert it into a zip folder containing all of the images in your Word document




1) you cannot make a ZIP file by renaming anything
2) if the images are embedded the are part of the doc file
3) what would you expect from a ZIP? It is just an archive


 

pj-ffm
Local time: 18:09
German to English
TOPIC STARTER
The OCR software is not really the issue Jun 14, 2011

Hi Natalie,

Thanks for your suggestion. The OCR software is not really the issue here (I've heard good things about Finereader too), it's more a workflow issue.

The images are embedded in the document, not linked to, i.e. I can click on an image and copy it into the clipboard and paste into a graphics program, but I don't have access to the originals.
Unfortunately, some of them are literally just a few lines in height, so rather than a manageable number of large
... See more
Hi Natalie,

Thanks for your suggestion. The OCR software is not really the issue here (I've heard good things about Finereader too), it's more a workflow issue.

The images are embedded in the document, not linked to, i.e. I can click on an image and copy it into the clipboard and paste into a graphics program, but I don't have access to the originals.
Unfortunately, some of them are literally just a few lines in height, so rather than a manageable number of large screen-shots there are hundreds of little ones... D'oh!

- Obviously I want to avoid having to re-type each source word by reading each graphic and then translating (time consuming and error prone)

- I also don't want to have to go through the document and select each graphic-copy-paste into empty file-name/save-open OCR s/w-convert to text-save result file-open in Word-translate-reimport into original doc. (Even if I can save a few steps by cutting/pasting rather than saving as a file each time, it's still an awfully time-consuming process.)

Michael, I'm also not quite sure I understand your approach. If I could get the images out in one go, it might help, but it would still represent a pretty nasty workflow (how would they be named and how would I locate each in the original document to re-insert the translated text after OCRing..?)

Hmm... I guess my utopian idea of a macro for doing this is not so straight forward...

cheers,
Peter.
Collapse


 

Michael Beijer  Identity Verified
United Kingdom
Local time: 17:09
Member (2009)
Dutch to English
+ ...
let me clarify Jun 14, 2011

I was only trying to point out a way of accessing all of the images embedded in a Word document in a simple way. It really does work*, just try it.

In your specific case however, if the document is decently formatted, you could try:

saving it as a PDF from within Word,
and then import it into ABBYY,
then save it back out to a Word doc, ...

ABBYY will now use OCR on the images inside the document.

Michael

*... See more
I was only trying to point out a way of accessing all of the images embedded in a Word document in a simple way. It really does work*, just try it.

In your specific case however, if the document is decently formatted, you could try:

saving it as a PDF from within Word,
and then import it into ABBYY,
then save it back out to a Word doc, ...

ABBYY will now use OCR on the images inside the document.

Michael

*
"With Word 2007, Microsoft introduced the XML-based .docx file format. The new format is essentially a ZIP container, which contains a series of XML files and any embedded images. To access the embedded images in a .docx file, use the following steps:
If it's not already a .docx file, Open the file in Word 2007 and save the file as a Word Document (*.docx).
Change the file extension on the original file from .docx to .zip."

(http://www.techrepublic.com/photos/save-images-in-microsoft-word-documents-as-separate-files/206113?seq=4)


[Edited at 2011-06-14 12:52 GMT]
Collapse


 

Peter Linton  Identity Verified
Local time: 17:09
Swedish to English
+ ...
Tell the customer Jun 14, 2011

Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.

They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible.


 

pj-ffm
Local time: 18:09
German to English
TOPIC STARTER
Re-educating customers... Jun 14, 2011

Peter Linton wrote:

Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.

They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible.


Indeed, I have explained that I would charge 50% more or by the hour for doing the graphics.

If it weren't for the fact that I have translated many other documents for this project and have built up a useful TM, I would have refused the job...

I just thought that now, when faced with hundreds of them, would be a good time to look for an efficient and consitent way of dealing with graphics in my workflow.

cheers,
Peter.

p.s. I tried Michael's suggestion about saving as .docx and renaming to .zip, and indeed, in the "\word\media" folder there are all the images (all 292 of them!) saved as "image###.png". I'd have to think how I could make an efficient workflow from here though...


 

István Hirsch  Identity Verified
Local time: 18:09
English to Hungarian
This works for my sample file Jun 15, 2011

1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-st
... See more
1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators).
Collapse


 

pj-ffm
Local time: 18:09
German to English
TOPIC STARTER
Sounds interesting... Jun 15, 2011

István Hirsch wrote:


1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators).


Hi István and thanks for the suggestion!

So if I understand correctly, the entire document will be put into a new, all encompassing giant table with two columns: the second of which will contain just the graphics, the first will contain every other document element (text, tables, text boxes, TOCs, links, etc.).

- Just the second column is copy/pasted into the OCR?

- The post-OCR result will be a single column with the text from the graphics in a one-column Word-compatible table

- This column is then pasted over the "graphics" column 2 in the doc? (ideally it needs to be aggregated, so that the text is below, or in some other way, associated with the corresponding graphic, but I guess there could be an enhanced solution involving further cunning search/replace steps...)

If it doesn't mess with the formatting, internal refs etc. and plays nice with Wordfast's segmentation it looks like a big step in the right direction!

I will give it a try when I have a moment and see what it does to the formatting in the rest of the document.

cheers,
Pete.


 

István Hirsch  Identity Verified
Local time: 18:09
English to Hungarian
That's it Jun 15, 2011

Absolute correct. That is what I suggest and tried out on a sample file which was a mixture of some sentences, a 3 x 3 table and 3 embedded pictures. First, the pictures went one step right into a second column to be OCR-ed as a batch. Then the 2nd column was deleted, and the OCR-ed column was inserted. Then the OCR-ed cells went one step left into their original position.
Of course, this file is far from the complexity of a „real” file, so preliminary trials with a file similar to mi
... See more
Absolute correct. That is what I suggest and tried out on a sample file which was a mixture of some sentences, a 3 x 3 table and 3 embedded pictures. First, the pictures went one step right into a second column to be OCR-ed as a batch. Then the 2nd column was deleted, and the OCR-ed column was inserted. Then the OCR-ed cells went one step left into their original position.
Of course, this file is far from the complexity of a „real” file, so preliminary trials with a file similar to mine are suggested.
Collapse


 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

OCR-ing graphics embedded in Word?

Advanced search






WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search