Pages in topic:   [1 2] >
help needed on converting PDF to Word format
Thread poster: Luke Mersh

Luke Mersh  Identity Verified
United Kingdom
Local time: 23:21
Spanish to English
Aug 18, 2015

Dear colleagues.
I have been sent some PDFs which are more like scanned documents saved as PDFs, but my problem is that when I convert them to Word format they are still like images, so I am unable to to a word count without re-typing the PDF into a word document.

Can anybody tell me if there is a way to convert these image type PDFs into a word document without re-typing the whole document, so I am able to get a word count.

many thanks


Direct link Reply with quote
 

Kevin Dias
Local time: 08:21
SITE STAFF
OCR Aug 19, 2015

Hi Luke,

The technology you are looking for is called OCR (optical character recognition). It will convert scanned PDFs or images to text format. The quality of the extraction will depend both on the tool you use and the quality of the PDF (is the scan clear, is it straight, is it high enough resolution, etc.).

The most popular tool for this is probably made by the company Abbyy. There are also a few free options from other companies.

Kevin


Direct link Reply with quote
 

Luke Mersh  Identity Verified
United Kingdom
Local time: 23:21
Spanish to English
TOPIC STARTER
OCR Aug 19, 2015

Thank you.

My only concern is that some of the PDFs are ECG graphs with printed results on them.
regards


Direct link Reply with quote
 

Chris P.
United Kingdom
Local time: 23:21
German to English
Another two options to play with... Aug 19, 2015

Hello Luke,

Kevin was entirely correct in his previous comment about optical recognition being the only applicable solution to your original question.

Two options you might like to (web search) check out:

Nuance PDF
Nitro PDF

Whether your ECG's will display correctly is entirely debatable, but these have been the best two performers I'm aware of to date.

And whether you're prepared to upload potentially sensitive docs for conversion remains another matter entirely. That said, trial versions for download are also available for (more discrete) local conversion offline. NDA items should never, ever be trusted to cloud solutions (not to mention TM's, termbases or any other material you'd consider sensitive or private).

On that last note, applicable to all translators generally, never underestimate that cloud=share (you're no longer in control of the information uploaded), period, full stop.

By the same simple equation, convenience=trade-off.

Best of British,
Chris


Direct link Reply with quote
 
esperantisto  Identity Verified
Local time: 02:21
Member (2006)
English to Russian
+ ...
How? Aug 19, 2015

Luke Mersh wrote:

when I convert them to Word format they are still like images


And how do you actually perform the conversion?


Direct link Reply with quote
 

neilmac  Identity Verified
Spain
Local time: 00:21
Spanish to English
+ ...
Nitro etc Aug 19, 2015

Nitro or Omipage are the best conversion programs I've found - I find Nitro is easier/less complicated to use. However, if they are scanned PDFs the results might never be optimum.
If the texts are short, I might prefer to retype/recreate the texts in Word. Or explain to the client that the format is causing problems and you will need to charge extra, or extend the agreed deadline, unless they can provide you with the text in a more workable format...

http://www.nuance.com/for-individuals/by-product/omnipage/index.htm

[Edited at 2015-08-19 09:32 GMT]


Direct link Reply with quote
 

Paula D  Identity Verified
United Kingdom
Local time: 23:21
Member (2013)
Turkish to English
OCR software and different alphabets Aug 19, 2015

In my experience, some are better than others at recognising different alphabets so you need to try the software on your particular language. They can probably all do English OK but in my translation language (Turkish) Omnipage is best one I have found for recognising the characters of the Turkish alphabet.

Direct link Reply with quote
 

Andrzej Mierzejewski  Identity Verified
Poland
Local time: 00:21
Polish to English
+ ...
OCR Aug 19, 2015

For ES-EN language pair, any commercially available OCR software should be OK. AFAIK the prices for a single computer licence are approx. EUR 100 - 150. With that price and with approx. 7 page a minute capacity, the return on investment is very quick. The OCR quality is also very high but it still depends on the image quality.

I 've recently worked on a PDF file with English text scanned from 34 paper pages. The character count was 53,500. The Spellcheck found not more than 30 errors.

Of course, for more complex page layouts (tables etc.), more manual work is still needed in order to give the output DOC file the 'as-original' look.

Let me repeat esperantisto's question:

Luke, what do you mean in: "my problem is that when I convert them to Word format they are still like images, so I am unable to to a word count without re-typing the PDF into a word document."?

That's not an image-to-text conversion, for sure.

Rest regards

AM


Direct link Reply with quote
 

Rachel Waddington  Identity Verified
United Kingdom
Local time: 23:21
Member (2014)
Dutch to English
+ ...
Do you really need to convert the files? Aug 19, 2015

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.

Direct link Reply with quote
 

Chris P.
United Kingdom
Local time: 23:21
German to English
Very true... Aug 19, 2015

Very true for agency work.

But direct clients can be charged a premium for providing a fully translated "clone" of the original PDF, complete with all charts, diagrams, images etc. perfectly formatted as in the original document - albeit in the docx format that these conversion softwares tend to export.


Direct link Reply with quote
 

Rafael Harriet
Spain
Local time: 00:21
German to Spanish
+ ...
One more option Aug 20, 2015

I usually work with ABBYY FineReader and the quality is very good.

My two cents. Good luck!!


Direct link Reply with quote
 

Luke Mersh  Identity Verified
United Kingdom
Local time: 23:21
Spanish to English
TOPIC STARTER
Abbyy finereader Aug 20, 2015

After reading your posts.
I had already done a webinar on OCR, so I have decided to use the trial of Abbyy Finereader, which seems to do a good job.


Direct link Reply with quote
 

Andrzej Mierzejewski  Identity Verified
Poland
Local time: 00:21
Polish to English
+ ...
just quote a rate based upon the target language...? Aug 20, 2015

Rachel Waddington wrote:

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.


So, do you think an agency would wait until the job is done, and only then the translator payment would be calculated, and the customer would know how much they should pay?

Well, in my country the agencies normally know the job size, whether in words or characters, when they ask translators for availability. A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.

It's very unusual for an agency not to know the job size as the work time and all invoices are dependent thereon. Can happen when all the staff is young and unexperienced - but just once, and not again.

Let's not allow ourselves to do the agencies' job! Remember that they take a significant share of what the customers pay.

Regards

AM

[Edited at 2015-08-20 09:12 GMT]


Direct link Reply with quote
 

Adrien Esparron
Local time: 00:21
Member (2007)
German to French
+ ...
Strange Aug 20, 2015

Andrzej Mierzejewski a écrit :

A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.



So "a good agency" should be able to send the translator an editable text, or?

If not, it's not "a good agency". And it's the case.

[Modifié le 2015-08-20 09:27 GMT]


Direct link Reply with quote
 

Rachel Waddington  Identity Verified
United Kingdom
Local time: 23:21
Member (2014)
Dutch to English
+ ...
Yes Aug 20, 2015

Andrzej Mierzejewski wrote:

Rachel Waddington wrote:

In this situation (assuming you don't manage to get a good result with the OCR software) I would just quote a rate based upon the target language word count and type the translation into a Word file. Agencies have always been happy with this approach.


So, do you think an agency would wait until the job is done, and only then the translator payment would be calculated, and the customer would know how much they should pay?

Well, in my country the agencies normally know the job size, whether in words or characters, when they ask translators for availability. A good agency is expected to have and use a reliable OCR software just in order to tell their customer the price.

It's very unusual for an agency not to know the job size as the work time and all invoices are dependent thereon. Can happen when all the staff is young and unexperienced - but just once, and not again.

Let's not allow ourselves to do the agencies' job! Remember that they take a significant share of what the customers pay.

Regards

AM

[Edited at 2015-08-20 09:12 GMT]


Yes, in cases where the agency cannot provide an editable text I would always propose invoicing based on the target text and this has never been a problem. It's becoming less common nowadays, but still happens occasionally. In any case I would regard it as the agency's job to do the OCRing, not mine. Direct clients are a different thing, obviously.


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

help needed on converting PDF to Word format

Advanced search






SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search