Mobile menu

Extracting Text from a PDF File
Thread poster: xxxCHENOUMI
xxxCHENOUMI  Identity Verified
English to French
+ ...
Jul 31, 2003

Hi!
What's the easiest way to extract a PDF file?
Any tips will be much appreciated.

TIA,
Sandra


Direct link Reply with quote
 

Valentina Pecchiar  Identity Verified
Italy
Member
English to Italian
+ ...
Acrobat or wordfast Jul 31, 2003

IF the text can be extracted at all (sometime it's not real text but an image, like a picture of the words) Wordfast (even the free trial version) will do the job quite nicely. So will Acrobat (not *Reader*), but personally I have always obtained bettere results with WF (check out the relevant options available in Pandora Box).
The quality and direct usability (without much reformatting) of the output depend on the formatting of the original text that has been been pdf'ed. So sometimes there are just no quick ways to retrieve the text. And sometimes, ay que llorar

HTH

PS Extraction (whatever method) will lose the formatting of the text (less so if you enable OptimalPDF or something similar in Pandora Box). If the format (or even time!) is an issue, you may be better off scanning up the printed pdf page in OCR.


CHENOUMI wrote:

Hi!
What's the easiest way to extract a PDF file?
Any tips will be much appreciated.

TIA,
Sandra


[Edited at 2003-07-31 16:17]


Direct link Reply with quote
 

RWSTranslation
Germany
Local time: 17:16
Member (2007)
German to English
+ ...
With Adobe Acrobat (Full version) Jul 31, 2003

Hello,

with the full version, you can save a pdf as rtf or as text (if it was distilled from a native file format)

If you have a scanned file (all text are bitmap images) you can use an ocr software. Maybe you have to convert your page first in tif format (save as tif in the full version) if your ocr software cannot work with pdf files.

Maybe it is easier to ask your client for the source files ?!

Hans


Direct link Reply with quote
 

Andrzej Lejman  Identity Verified
Local time: 17:16
German to Polish
+ ...
Select, copy and paste Jul 31, 2003

Dear Sandra,

this topic has a lot of threads. Look for earlier postings; it does not make much sense to discuss all the time about the same.

Regards

Andrzej


Direct link Reply with quote
 

GoodWords  Identity Verified
Mexico
Local time: 10:16
Spanish to English
+ ...
No easy way Jul 31, 2003

In Adobe Acrobat reader, under "Edit", choose "Copy File to Clipboard" or "Select All". Open a new word processing document and paste.

The rest is all a matter of formatting--eliminating hyphens within words, and paragraph marks that appear at the end of every line. Eliminating running headers and/or footers. If the document has footnotes, there is more work to do.

If you introduce an extra line break between every paragraph, list element and before and after every title (if it's not there already), then you can eliminate all the superfluous line breaks in three easy steps. 1. change all double line breaks to a unique string (###, say). 2. change all single line breaks to a single space. 3. change all instances of your unique string back to a double line break.

[Edited at 2003-07-31 15:33]


Direct link Reply with quote
 

Valentina Pecchiar  Identity Verified
Italy
Member
English to Italian
+ ...
Sorry, my post did double up... Jul 31, 2003

...so I deleted the content of the second one!

[Edited at 2003-07-31 16:18]


Direct link Reply with quote
 

Carlos Moreno  Identity Verified
Colombia
Local time: 10:16
English to Spanish
+ ...
Adobe Acrobat Reader 6.0 Jul 31, 2003

The latest version of the free reader from Adobe, which has changed its name from Adobe Acrobat Reader to Adobe Reader 6.0, can help you.
This program can read PDF documents, as well as E-books, since now it combines the Adobe E-book reader, which used to be a separate program. It can also help you make PDFs for free, and even read books aloud!
If the file you need is a text, not an image of a text, and document content extraction is allowed (you can see it clicking on a little arrow above the scroll bar), you can simply click "File - Save as Text", or "Edit - Copy to Clipboard".
By now the Reader is only in English (15 MB). Versions in other languages which appear in the download page refer to older Acrobat 5.1.
Download address is
http://www.adobe.com/products/acrobat/readstep2.html
I make clear that I do not work for Adobe or any of its subsidiaries.
And enjoy your work, as I do mine!


Direct link Reply with quote
 

Alexander Chisholm  Identity Verified
Local time: 17:16
Italian to English
+ ...
Some OCR programs Jul 31, 2003

allow you to do this (one may have come bundled with your scanner). I like Omnipage but finereader 6 is also OK.

Direct link Reply with quote
 
Nigel Skipper
Local time: 17:16
Swedish to English
Freeware PDF 995 Jul 31, 2003

It you use a PC this freeware is a very useful item to have. It allows youto create PDF's from inside existing applications and the edit version PDFedit 995 allows you to extract text from an exisitng PDF to a Word or text file.
You can download it free of charge from www.pdf995.com

//Regards,

Nigel


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 17:16
Spanish to English
+ ...
Various methods Aug 1, 2003

I have recently been downloading PDF material, and since I have limited funds, used a couple of possibilities.

1. If the PDF document toolbar has a little T on the toolbar, click on this and it will allow you to copy, then select the entire text all at once (scrolling down) or page by page.

2. If this doesn't work, there is an online conversion facility. See this site: http://www.adobe.com/products/acrobat/access_email.html

You may want to clean up the text and save it as Word, although you will lose a lot of the features of the original format.
Use the find & replace function to remove unwanted paragraph breaks and white spaces.

Finally, watch out for headings, tables, boxes and similar inserts as these are relocated arbitrarily in the copied version. You will need to check that no sentences or words are cut off. And sometimes excess hyphenation may appear, which you will also have to correct manually.


Direct link Reply with quote
 
xxxCHENOUMI  Identity Verified
English to French
+ ...
TOPIC STARTER
Thank you All! Aug 3, 2003

Thanks to each one of you for your time and tips!
I'm used to using the Editing tool from Acrobat, cutting and pasting the text, then reconverting it into PDF format.

To Muja, DSC, GoodWords, Carlos, Alexander, Nigel, Ailish, I'll certainly put your advice and recommendations to good use, next time.
Alexander: I have used the OCR option not for PDF files but in WORD.

Since I rarely receive requests for extracting PDF files, I'm faced with the decision whether or not to purchase the whole Adobe Acrobat program now.

Have a nice week,
and
Thank you again!

S.:)

[Edited at 2003-08-03 21:12]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Extracting Text from a PDF File

Advanced search






WordFinder
The words you want Anywhere, Anytime

WordFinder is the market's fastest and easiest way of finding the right word, term, translation or synonym in one or more dictionaries. In our assortment you can choose among more than 120 dictionaries in 15 languages from leading publishers.

More info »
PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs