Extracting Text from a PDF File
Thread poster: CHENOUMI (X)
CHENOUMI (X)
CHENOUMI (X)  Identity Verified
English to French
+ ...
Jul 31, 2003

Hi!
What's the easiest way to extract a PDF file?
Any tips will be much appreciated.

TIA,
Sandra


 
Valentina Pecchiar
Valentina Pecchiar  Identity Verified
Italy
English to Italian
+ ...
Acrobat or wordfast Jul 31, 2003

IF the text can be extracted at all (sometime it's not real text but an image, like a picture of the words) Wordfast (even the free trial version) will do the job quite nicely. So will Acrobat (not *Reader*), but personally I have always obtained bettere results with WF (check out the relevant options available in Pandora Box).
The quality and direct usability (without much reformatting) of the output depend on the formatting of the original text that has been been pdf'ed. So sometimes th
... See more
IF the text can be extracted at all (sometime it's not real text but an image, like a picture of the words) Wordfast (even the free trial version) will do the job quite nicely. So will Acrobat (not *Reader*), but personally I have always obtained bettere results with WF (check out the relevant options available in Pandora Box).
The quality and direct usability (without much reformatting) of the output depend on the formatting of the original text that has been been pdf'ed. So sometimes there are just no quick ways to retrieve the text. And sometimes, ay que llorar

HTH

PS Extraction (whatever method) will lose the formatting of the text (less so if you enable OptimalPDF or something similar in Pandora Box). If the format (or even time!) is an issue, you may be better off scanning up the printed pdf page in OCR.


CHENOUMI wrote:

Hi!
What's the easiest way to extract a PDF file?
Any tips will be much appreciated.

TIA,
Sandra


[Edited at 2003-07-31 16:17]
Collapse


 
RWSTranslation
RWSTranslation
Germany
Local time: 23:25
German to English
+ ...
With Adobe Acrobat (Full version) Jul 31, 2003

Hello,

with the full version, you can save a pdf as rtf or as text (if it was distilled from a native file format)

If you have a scanned file (all text are bitmap images) you can use an ocr software. Maybe you have to convert your page first in tif format (save as tif in the full version) if your ocr software cannot work with pdf files.

Maybe it is easier to ask your client for the source files ?!

Hans


 
Andrzej Lejman
Andrzej Lejman  Identity Verified
Poland
Local time: 23:25
Member (2004)
German to Polish
+ ...
Select, copy and paste Jul 31, 2003

Dear Sandra,

this topic has a lot of threads. Look for earlier postings; it does not make much sense to discuss all the time about the same.

Regards

Andrzej


 
Margaret Schroeder
Margaret Schroeder  Identity Verified
Mexico
Local time: 15:25
Spanish to English
+ ...
No easy way Jul 31, 2003

In Adobe Acrobat reader, under "Edit", choose "Copy File to Clipboard" or "Select All". Open a new word processing document and paste.

The rest is all a matter of formatting--eliminating hyphens within words, and paragraph marks that appear at the end of every line. Eliminating running headers and/or footers. If the document has footnotes, there is more work to do.

If you introduce an extra line break between every paragraph, list element and before and after every ti
... See more
In Adobe Acrobat reader, under "Edit", choose "Copy File to Clipboard" or "Select All". Open a new word processing document and paste.

The rest is all a matter of formatting--eliminating hyphens within words, and paragraph marks that appear at the end of every line. Eliminating running headers and/or footers. If the document has footnotes, there is more work to do.

If you introduce an extra line break between every paragraph, list element and before and after every title (if it's not there already), then you can eliminate all the superfluous line breaks in three easy steps. 1. change all double line breaks to a unique string (###, say). 2. change all single line breaks to a single space. 3. change all instances of your unique string back to a double line break.

[Edited at 2003-07-31 15:33]
Collapse


 
Valentina Pecchiar
Valentina Pecchiar  Identity Verified
Italy
English to Italian
+ ...
Sorry, my post did double up... Jul 31, 2003

...so I deleted the content of the second one!

[Edited at 2003-07-31 16:18]


 
Carlos Moreno
Carlos Moreno  Identity Verified
Colombia
Local time: 16:25
English to Spanish
+ ...
Adobe Acrobat Reader 6.0 Jul 31, 2003

The latest version of the free reader from Adobe, which has changed its name from Adobe Acrobat Reader to Adobe Reader 6.0, can help you.
This program can read PDF documents, as well as E-books, since now it combines the Adobe E-book reader, which used to be a separate program. It can also help you make PDFs for free, and even read books aloud!
If the file you need is a text, not an image of a text, and document content extraction is allowed (you can see it clicking on a little arrow
... See more
The latest version of the free reader from Adobe, which has changed its name from Adobe Acrobat Reader to Adobe Reader 6.0, can help you.
This program can read PDF documents, as well as E-books, since now it combines the Adobe E-book reader, which used to be a separate program. It can also help you make PDFs for free, and even read books aloud!
If the file you need is a text, not an image of a text, and document content extraction is allowed (you can see it clicking on a little arrow above the scroll bar), you can simply click "File - Save as Text", or "Edit - Copy to Clipboard".
By now the Reader is only in English (15 MB). Versions in other languages which appear in the download page refer to older Acrobat 5.1.
Download address is
http://www.adobe.com/products/acrobat/readstep2.html
I make clear that I do not work for Adobe or any of its subsidiaries.
And enjoy your work, as I do mine!
Collapse


 
achisholm
achisholm
United Kingdom
Local time: 22:25
Italian to English
+ ...
Some OCR programs Jul 31, 2003

allow you to do this (one may have come bundled with your scanner). I like Omnipage but finereader 6 is also OK.

 
Nigel Skipper (X)
Nigel Skipper (X)
Local time: 23:25
Swedish to English
Freeware PDF 995 Jul 31, 2003

It you use a PC this freeware is a very useful item to have. It allows youto create PDF's from inside existing applications and the edit version PDFedit 995 allows you to extract text from an exisitng PDF to a Word or text file.
You can download it free of charge from www.pdf995.com

//Regards,

Nigel


 
Lia Fail (X)
Lia Fail (X)  Identity Verified
Spain
Local time: 23:25
Spanish to English
+ ...
Various methods Aug 1, 2003

I have recently been downloading PDF material, and since I have limited funds, used a couple of possibilities.

1. If the PDF document toolbar has a little T on the toolbar, click on this and it will allow you to copy, then select the entire text all at once (scrolling down) or page by page.

2. If this doesn't work, there i
... See more
I have recently been downloading PDF material, and since I have limited funds, used a couple of possibilities.

1. If the PDF document toolbar has a little T on the toolbar, click on this and it will allow you to copy, then select the entire text all at once (scrolling down) or page by page.

2. If this doesn't work, there is an online conversion facility. See this site: http://www.adobe.com/products/acrobat/access_email.html

You may want to clean up the text and save it as Word, although you will lose a lot of the features of the original format.
Use the find & replace function to remove unwanted paragraph breaks and white spaces.

Finally, watch out for headings, tables, boxes and similar inserts as these are relocated arbitrarily in the copied version. You will need to check that no sentences or words are cut off. And sometimes excess hyphenation may appear, which you will also have to correct manually.
Collapse


 
CHENOUMI (X)
CHENOUMI (X)  Identity Verified
English to French
+ ...
TOPIC STARTER
Thank you All! Aug 3, 2003

Thanks to each one of you for your time and tips!
I'm used to using the Editing tool from Acrobat, cutting and pasting the text, then reconverting it into PDF format.

To Muja, DSC, GoodWords, Carlos, Alexander, Nigel, Ailish, I'll certainly put your advice and recommendations to good use, next time.
Alexander: I have used the OCR option not for PDF
... See more
Thanks to each one of you for your time and tips!
I'm used to using the Editing tool from Acrobat, cutting and pasting the text, then reconverting it into PDF format.

To Muja, DSC, GoodWords, Carlos, Alexander, Nigel, Ailish, I'll certainly put your advice and recommendations to good use, next time.
Alexander: I have used the OCR option not for PDF files but in WORD.

Since I rarely receive requests for extracting PDF files, I'm faced with the decision whether or not to purchase the whole Adobe Acrobat program now.

Have a nice week,
and
Thank you again!

S.:)

[Edited at 2003-08-03 21:12]
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Extracting Text from a PDF File






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »