Extracting text from PDFs in Acrobat
Thread poster: Mary Worby

Mary Worby  Identity Verified
United Kingdom
Local time: 22:11
Member
German to English
+ ...
Nov 4, 2002

Folks,



More and more of my work is coming in in PDF format. Which currently means either copying and pasting the text into word and all the associated rigmarole of reformatting the text, getting rid of paragraph marks, etc. or just printing the thing out and starting from scratch. Neither of which is a particularly time-effective soluton.



What I would like is a system that allows me to extract the text directly into a Word or PDF format. It does not have to be perfectly formatted, but it would be nice to have flowing text which is all in the right order.



I\'m tempted to get the full version of Acrobat, which allegedly allows you to save in RTF format. My question is whether this actually works! Does it do what it says on the tin, or are there reasons why this would not be the right way to go?



Thanks in advance for any suggestions!



Mary


Direct link Reply with quote
 

Joeri Van Liefferinge  Identity Verified
Belgium
Local time: 23:11
Member (2002)
English to Dutch
+ ...
Full version of Acrobat is not the ideal solution either Nov 4, 2002

The full version allows you to save in rtf format, but it\'s not marvellous either: there\'s a hard return after each line and if there are columns in your document, everything is mixed up.

The best solution is to ask for the original text, but I know that clients often say that they don\'t have access to that.



I for one treats pdf files as texts I receive on paper or by fax, which means that I charge extra for them.



fwiw





Joeri


Direct link Reply with quote
 

Juan Pablo Solvez Beneyto  Identity Verified
Spain
Local time: 23:11
Member (2002)
English to Spanish
+ ...
A few global replacements Nov 4, 2002

Hi All,



I assume that you know how to copy and paste the entire text in Word. From that point, my personal solution to get flowing text with a good accuracy (in terms of good flowing) is to make 3 global replacements:



1- Replace every period followed by a paragraph mark with a unique tag like ZZZ.

2- Replace every remaining paragraph mark with nothing.

3- Replace ZZZ with a period followed by a paragraph mark.



And that\'s all I do globally. Then, I guess you have to take care of the 5% (or whatever) remaining.



Please note that this could not be a good idea if there are lots of circumstances in which there is a natural paragraph mark without a period. The accuracy may vary a lot.



Best regards,

Juan Pablo



Direct link Reply with quote
 

Al Gallo
English to Spanish
+ ...
I use Adobe Acrobat 5.0 Nov 5, 2002

Hi Mary,

First of all I click on the small T in the toolbar, then press Ctrl and Alt together and with the mouse I select each full column independently, pasting each column into Word, where I have inserted a 2 column table. I paste the original language on the left. When all is translated, I reformat to imitate the original.

Luck

Al


Direct link Reply with quote
 

monitor  Identity Verified
Local time: 23:11
English to German
+ ...
Professional Tool «Gemini Solo» Nov 5, 2002

If you should have repeated demand for extracting text and grafics from pdf\'s go to www.iceni.com and have a look what they offer.

Instead of spending the money for Adobe Acrobat you\'d better bought Gemini Solo.

Solves virtually exactly that question.

Kind Regards

Marcel
[addsig]


Direct link Reply with quote
 

Anneken
Local time: 23:11
French to Dutch
+ ...
Wordfast Nov 5, 2002

I usually extract the text by means of Wordfast. You can download Wordfast for free at http://www.champollion.net. Just follow the guidelines to install it, open the PDF file, open a new Word document and start the Wordfast session (by clicking on the Wordfast button). Normally Wordfast detects automatically that there is a PDF file opened and it subsequently asks you whether you want to import it. Simply click yes and wait until Wordfast has imported the entire file. You will have to double check the document though, since the lay out tends to change (titles, columns, etc appear on a different place), but it is definitely a lot easier than copying and you don\'t have the annoying hard returns at the end of each line.



Hope it helps!



Kind regards,



Anneken


Direct link Reply with quote
 
Nathalie M. Girard, ALHC  Identity Verified
English to French
+ ...
F.Y.I. Wordfast is no longer *free* Nov 5, 2002

Good morning Anneken



I just wanted to make a little correction on your post, as this change is rather recent:



Wordfast is unfortunately no longer *free*.



You can see the pricing details on the website...



Have a great day everyone!

Nathalie



Direct link Reply with quote
 
mckinnc  Identity Verified
Local time: 23:11
French to English
+ ...
Just tried what you suggested in Acrobat Nov 5, 2002

I converted a simple word file without tables into PDF then saved as .rtf. Unfortunately, I lost a lot of formatting information (line breaks, page breaks etc).



I then tried it on a typical file that I translate, including, tables footnotes and side boxes overlaid on pages. It was not too bad witha standard word table but didn\'t cope with at lot of these other things properly at all.



So it might work for straightforward texts, provided you do some reformatting afterwards. It should, of course, be taken as read that clients provide you with the source files. Anything else is patently stupid.


Direct link Reply with quote
 

Evert DELOOF-SYS  Identity Verified
Belgium
Local time: 23:11
Member
English to Dutch
+ ...
Readiris Pro 8 Nov 5, 2002

should do the trick.



Opens PDF documents (even read-only!), and converts them into editable files you can send directly to your favorite application:



http://www.irislink.com/opt/uk/products/readiris/pc/features/index.html



Good luck





[ This Message was edited by: on 2002-11-05 12:08 ]


Direct link Reply with quote
 

Mary Worby  Identity Verified
United Kingdom
Local time: 22:11
Member
German to English
+ ...
TOPIC STARTER
So there is no answer! Nov 5, 2002

Thanks to you all for your suggestions. It would appear that there is no easy answer (and there I was hoping that Acrobat would solve all my problems ).



I\'ve tried the demo version of Gemini Solo in the past, and found the results less than satisfactory. Obviously, a lot depends on how well the document was created in the first place! But on the short documents I tried, I would have had to do almost as much reformatting as when I\'ve simply copied and pasted the text .



I\'ve also used the global replace methods before, but have found this, as you say, only to be effective for texts in normal paragraphs. If a text has a lot of bullet points or other formatting, it\'s not much use.



And yes, the answer would be to get the customer to supply the document in the right format. It\'s especially frustrating when you\'re translating something which is patently a Word file converted into PDF, and they claim that there is no original document. Customers, eh, who\'d \'ave \'em



Thanks again for all your suggestions, it looks like I may have to head back to the drawing board.



Regards



Mary


Direct link Reply with quote
 
Karin Adamczyk  Identity Verified
Canada
Local time: 17:11
Member
French to English
No original files not possible Nov 5, 2002

Quote:




And yes, the answer would be to get the customer to supply the document in the right format. It\'s especially frustrating when you\'re translating something which is patently a Word file converted into PDF, and they claim that there is no original document. Customers, eh, who\'d \'ave \'em







You probably already know this by now, but it is not even possible that there are no original files because Acrobat cannot create files on its own. All PDF documents are generated from some other format. That\'s the whole idea behind Acrobat -- the resulting documents are intended to be distributed to people who do not have the program that created the original files.



Here is the general description of Acrobat from the Adobe site:



Whether you create business plans, spreadsheets, graphically rich brochures, or Web sites, Adobe® Acrobat® 5.0 software lets you convert any document to an Adobe Portable Document Format (PDF) file. Anyone can open your document across a broad range of hardware and software, and it will look exactly as you intended — with layout, fonts, links, and images intact.



HTH,

Karin Adamczyk

Direct link Reply with quote
 

Mary Worby  Identity Verified
United Kingdom
Local time: 22:11
Member
German to English
+ ...
TOPIC STARTER
So true ... Nov 5, 2002

Quote:


You probably already know this by now, but it is not even possible that there are no original files because Acrobat cannot create files on its own.





I know, that\'s what makes the whole thing so bloomin\' frustrating!



Even worse was the one I had recently which was \'the customer has the original files but doesn\'t want you to have them\'! Talk about not making life easy - luckily the job didn\'t come to anything!



I was just hoping there would be an answer which didn\'t involve nagging the customer for the original files ...



Evert - thanks for the link. Do you have any experience of the software?



Regards



Mary









[ This Message was edited by: on 2002-11-05 13:47 ]

Direct link Reply with quote
 
Karin Adamczyk  Identity Verified
Canada
Local time: 17:11
Member
French to English
You don't need to nag Nov 5, 2002

Quote:


I was just hoping there would be an answer which didn\'t involve nagging the customer for the original files ...





All you need to do is inform your customer of your hourly rate for the extra time involved in extracting and formatting the text.



It\'s actually quite hilarious how quickly they manage to come up with the original files then!! (works well for faxed documents too, but in the case of faxed documents, sometimes there really are no original documents, but some clients will decide to type them up themselves)



Good luck,

Karin

Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Extracting text from PDFs in Acrobat

Advanced search







PDF Translation - the Easy Way
TransPDF converts your PDFs to XLIFF ready for professional translation.

TransPDF converts your PDFs to XLIFF ready for professional translation. It also puts your translations back into the PDF to make new PDFs. Quicker and more accurate than hand-editing PDF. Includes free use of Infix PDF Editor with your translated PDFs.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search