How to extract Chinese text from a scanned PDF document?
Thread poster: Yiting "Amy" Hsiao

Yiting "Amy" Hsiao  Identity Verified
United States
Local time: 18:55
English to Chinese
+ ...
Jan 12, 2014

I received a scanned simplified Chinese document to translate into English. Does anyone know how to extract the simplified Chinese text from the PDF into word, so that I can translate and edit it (and keep the original format)?

I've tried using Adobe's "save as word" function, but only 20% of the Chinese characters were kept, all the rest were coded. I also tried the Automator on Mac, but it didn't work. I also tried submitting it to e.g., OCR online converter, but it came out as images on Word, which renders it uneditable.

Could any experienced translators give me some help here? Thank you so much!!!


 

esperantisto  Identity Verified
Local time: 04:55
Member (2006)
English to Russian
+ ...
OCR Jan 12, 2014

Use an OCR program such as ABBYY FineReader or OmniPage.

 

xxxxxLecraxx
Germany
Local time: 03:55
French to German
+ ...
can't keep the original format Jan 12, 2014

Hello,

I don't think you'll be able to keep the original format. You'll have to format it later. The only procedure I can think of is the following:

1. copy the Chinese text in the PDF
2. paste into Windows Editor to get a plain text file
3. copy text in Editor
4. paste into a Word file

Then you can edit the text, e.g. removing the pararaphs etc., before you start translating. The original format can be restored at the end.


 

Tony M  Identity Verified
France
Local time: 03:55
Member
French to English
+ ...
Re-type Jan 12, 2014

I don't know anything about the special characteristics of Chinese script, but I somehow suspect it is likely to cause problems for any kind of OCR program — unless maybe one has been developed specially in China, say?

I'd have thought the quickest, cheapest, and simplest solution would have been to simply get the source text re-typed, and then do any formatting manually after translation.

Generally, OCR progammes do not reproduce formatting very well, inasmuch as their output is usually a 'fudge' to produce a facsimile of the formatting — which is a long way from being the same thing as actually reproducing the formatting! What they produce may look fine, until you start to translate — and then it can turn out to be a total nightmare, and in my personal experience, waste you a great deal more time than if you had just started with plain, unformatted text.

Again, I don't know about Chinese, but in terms of Western 'Roman alphabet' languages, a good typist can reproduce the source text with only basic formatting more quickly and cheaply than I can do it myself. Of course, if it is not possible to preserve the original formatting, maybe in any case re-typing is unnecessary? Although I like having a source text I can translate by over-typing, in point of fact, it is rarely essential, and one can often translate direct from the PDF original — the time and money saved can then be put to better use attempting to recreate the original formatting.


 

Paulinho Fonseca  Identity Verified
Brazil
Local time: 22:55
Member (2011)
English to Portuguese
+ ...
How to extract text from PDF...? Jan 12, 2014

I had the same experience with a client last year. I asked the company if they could provide me with the PDF or word doc and the reply was negative.

I had to review quotes as client wanted both clean and unclean.
I did have to type the whole doc and then translate it.


icon_smile.gif


 

Flavio Granados  Identity Verified
Venezuela
Local time: 21:55
Member (2006)
English to Spanish
Abbyy Jan 12, 2014

it is almost perfect in my languages. I assume that for chinese it is better yet. There is a trial version.
Also plustools from Wordfast.


 

Phil Hand  Identity Verified
China
Local time: 09:55
Chinese to English
Hanwang Jan 12, 2014

Abbyy is OK, but I find the formatting it does quite irritating.

I use a freebie called 汉王, which allows you to copy the OCR results (the characters) as text, and paste them into a Word file. I then translate the Word file in my CAT tool and reconstruct the format in the English document. Hanwang isn't super-accurate, but it's good enough.


 

Ricardy Ricot  Identity Verified
Local time: 21:55
French to English
+ ...
Optical character recognition (OCR) Jan 13, 2014

You need to find a good OCR software. Or create your own that focuses on Chinese characters.

And well, if OCR won't work, you are left with writing down the original text.

By the way, Marcel, the text on scanned documents can't be copied or pasted. As far as the computer is concerned, it is a picture. Only OCR softwares can sort it out.


 

xxxxxLecraxx
Germany
Local time: 03:55
French to German
+ ...
yes Jan 13, 2014

Ricardy Ricot wrote:

By the way, Marcel, the text on scanned documents can't be copied or pasted. As far as the computer is concerned, it is a picture. Only OCR softwares can sort it out.


You're right. I skipped the word 'scanned', sorry. (:

But how could she even keep 20 % of the characters when she tried to save the document as word? If it was only a picture, it shouldn't be possible at all. Or does Adobe Acrobat Pro come with an in-built OCR?


 

Ricardy Ricot  Identity Verified
Local time: 21:55
French to English
+ ...
Hm Jan 13, 2014

To be frank, Marcel, I do not know. Could be. Because, if a document is scanned, it becomes a picture on the computer.

 

Lincoln Hui  Identity Verified
Hong Kong
Local time: 09:55
Member
Chinese to English
+ ...
2014 Jan 13, 2014

OCR has been an integral function of Acrobat since 2008.

Heck, the driver suite that my printer comes with has OCR.

[Edited at 2014-01-13 16:45 GMT]


 

wenbuyi
Angola
online pdf text extractor Oct 28, 2015

you can try this free online pdf text extractor(http://www.online-code.net/pdf-to-word.html), it support extract Chinese text from pdf document, just need you upload your pdf doc, this tool can extract all pages content as text online.

 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract Chinese text from a scanned PDF document?

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search