How can I mass convert pdf (and other formats) to txt
Thread poster: Timothy Barton

Timothy Barton
Local time: 06:00
French to English
+ ...
Apr 18, 2006

I use corpus analysis for a lot of technical translations, but unfortunately my corpus analysis tool (WordSmith Tools) does not handle pdfs very well. I wondered whether there's any quick way of converting a folder full of pdfs to txts. I haven't managed to find such a tool on the internet. Any ideas?

Direct link Reply with quote
 
Jalapeno
Local time: 05:00
English to German
This might be what you're looking for. Apr 18, 2006

http://www.verypdf.com/pdf2word/pdf-to-doc/pdf-to-txt.html

I googled for "convert pdf to text" + "batch", and came up with this site. It advertises batch conversion.

Hmm, the link doesn't seem to be working.

I've improved on my search anyway: "batch conversion" + "pdf to txt" should yield even better results.

[Edited at 2006-04-18 11:27]


Direct link Reply with quote
 

Claudia Iglesias  Identity Verified
Chile
Local time: 01:00
Member (2002)
Spanish to French
+ ...
ABBYY FineReader 8 Apr 18, 2006

Hi Timothy

I usually don't need to open several PDFs at once, but it happens that the PDFs have up to 70 pages.
I decided to go for for ABBY FineReader 8, although I consider it expensive (€126) because I was tired of having to print, scan and OCR my documents to have them in Word when they couldn't be processed by free tools.

My life has changed completely. I haven't tried it with several PDFs at once but it's so quick and reliable that I think that it must be the best tool, even if you have to do it one by one.

Claudia


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 05:00
English to Swedish
+ ...
Abbyy FR8 Apr 18, 2006

I have to throw in my vote for FineReader too.

We use it a lot here at our agency, and the time it saved us easily makes up for the cost.

It has a lot of other smart functions too.
For instance, you can separate the text layer from the image layer, so you can edit the text and still keep the layout and illustrations if you have architectural drawings and similar!


If you search for FineReader here on the Forum, you will find other posts where users compared this program with other free- and shareware converters.

Best,

Jan


Direct link Reply with quote
 

monitor  Identity Verified
Local time: 05:00
English to German
+ ...
LogiTerm by Terminotix Apr 18, 2006

Have a look at this!

http://www.terminotix.com

LogiTerm has a built in conversion tool, that changes all pdfs into clean word rtf.
You can even go further with it and have created bitext versions of two pdf documents. Its like aligning in other CAT's but much easier.
I've used this for many times now and I am still amaized how good it works.

Kind regards
Marcel


Direct link Reply with quote
 
RobinB  Identity Verified
Germany
Local time: 05:00
German to English
ABBY and Gemini Apr 18, 2006

We also use ABBY, and Iceni Gemini for those files that ABBY can't handle.

The cost is peanuts considering the time savings. Now all we need is a tool that will automatically edit the converted PDFs (getting rid of unwanted headers/footers, etc.), restore any garbled formatting (it always happens) and then align the files.

Robin


Direct link Reply with quote
 

Timothy Barton
Local time: 06:00
French to English
+ ...
TOPIC STARTER
Further explanation Apr 18, 2006

Claudia Iglesias wrote:

I think that it must be the best tool, even if you have to do it one by one.

Claudia



Thanks for all the tips. I'll look into the different programmes.

I think I maybe should have made it clearer what I'd be using it for. It's not to convert texts I am sent to translate, but to convert texts I have downloaded for corpus analysis. With corpora (plural of "corpus"), the bigger the corpus the better, which means I'm sometimes downloading several hundred texts.

I have recently discovered Google Scholar (scholar.google.com), which appears to be a great tool for translators, and particularly useful for creating corpora, but depending on the subject matter, many of the texts are pdfs. Because of the number of texts I'd be downloading (I can download up to 100 texts in two clicks), converting them one by one is simply not an option.

Tim


Direct link Reply with quote
 
xxxLia Fail  Identity Verified
Spain
Local time: 05:00
Spanish to English
+ ...
Welcome aboard the corpus train, Tim:-) Apr 19, 2006

Timothy Barton wrote:

Claudia Iglesias wrote:

I think that it must be the best tool, even if you have to do it one by one.

Claudia



Thanks for all the tips. I'll look into the different programmes.

I think I maybe should have made it clearer what I'd be using it for. It's not to convert texts I am sent to translate, but to convert texts I have downloaded for corpus analysis. With corpora (plural of "corpus"), the bigger the corpus the better, which means I'm sometimes downloading several hundred texts.

I have recently discovered Google Scholar (scholar.google.com), which appears to be a great tool for translators, and particularly useful for creating corpora, but depending on the subject matter, many of the texts are pdfs. Because of the number of texts I'd be downloading (I can download up to 100 texts in two clicks), converting them one by one is simply not an option.

Tim


Glad to see you are developing an interest in corpora, it's got huge potential for translators, yet is a relatively new area of knowledge for them; please:-) look out for the Med Translators and Editors corpus workshop, to take place in July in Canet del Mar (Barcelona) (details posted as a separate thread).

I created a corpus of 500,000 words on macroeconomics a few years ago, mostly from PDFs. I didn't use a tool, I think I did it one by one, mostly cos the texts still have to be cleaned up individually anyway. In other words, the click to save as text is the least of your problems:-)




[Edited at 2006-04-19 15:53]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 05:00
Member (2006)
English to Afrikaans
+ ...
PDFTXT1.zip Apr 19, 2006

Timothy Barton wrote:
I wondered whether there's any quick way of converting a folder full of pdfs to txts.


PDFTXT1:
http://www.simtel.net/product.php?url_fb_product_page=51612

PikyBasket:
http://www.conceptworld.com/Piky/piky_features.asp

Use PikyBasket to copy the list of files to the clipboard. Then use the clipboard contents to write a batch file which invokes PDFTXT with each PDF file to save it as TXT.


Direct link Reply with quote
 

Jan Sundström  Identity Verified
Sweden
Local time: 05:00
English to Swedish
+ ...
Can Bitext handle PDF? Apr 20, 2006

monitor wrote:

http://www.terminotix.com

LogiTerm has a built in conversion tool, that changes all pdfs into clean word rtf.
You can even go further with it and have created bitext versions of two pdf documents. Its like aligning in other CAT's but much easier.
I've used this for many times now and I am still amaized how good it works.

Kind regards
Marcel



Marcel, you are positive that Bitext can handle PDF directly?!
I don't see this feature documented on their website:
http://www.terminotix.com/eng/produits/txt_lt_c4.htm

If it's true, it's a give-away investment for us... Thanks for the advice!

/Jan


Direct link Reply with quote
 

Timothy Barton
Local time: 06:00
French to English
+ ...
TOPIC STARTER
Howto? May 14, 2006

Samuel Murray wrote:

PDFTXT1:
http://www.simtel.net/product.php?url_fb_product_page=51612

PikyBasket:
http://www.conceptworld.com/Piky/piky_features.asp

Use PikyBasket to copy the list of files to the clipboard. Then use the clipboard contents to write a batch file which invokes PDFTXT with each PDF file to save it as TXT.


Have you tried what you explained? I can't figure out how to do it. I've gone to the folder with my pdf files and done a right-click copy to clipboard, but it just puts D:/pdffiles into the clipboard, not the path of each individual file. Also, when you say "write a batch file", does that mean I have to write my own program? Or do you mean some kind of command within PDFTEXT?


Direct link Reply with quote
 

Timothy Barton
Local time: 06:00
French to English
+ ...
TOPIC STARTER
Done it!!! May 14, 2006

With PikyBasket, I need to actually go inside the directory containing my PDF files, select all the files, then right-click, piky basket, copy to clipboard. This copies all the files to the clipboard with a hard return in between, ie:

001.pdf
002.pdf
003.pdf

In a program such as Word or OpenOffice, the list needs converting (via find and replace), to the following format: 001.pdf && 002.pdf && 003.pdf.

Then in PDF-TXT1, I go into the directory with these files, then type PDF-TXT1 001.pdf && 002.pdf && 003.pdf, and bingo!

The only problem is, it doesn't convert documents well that are in columns. I'm not quite sure why, since if you go into the pdf files and select text, the order in which it selects the text is correct. It's a shame, because otherwise the tool would be fantastic. As it is, it will still be useful for corpus analysis, but not as useful as it could have been.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 05:00
Member (2006)
English to Afrikaans
+ ...
How to improve your results May 15, 2006

Timothy Barton wrote:
The only problem is, it doesn't convert documents well that are in columns.


That depends entirely on the PDF-TXT converter. I have seen such converters which deliver much beter results (free ones, even, but unfortunately I can't remember their names). You could also try PDF2HTM converters, and then use an HTM2TXT converter (but again, your milage will vary depending on the tool).


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 05:00
Member (2006)
English to Afrikaans
+ ...
How to write a batch file May 15, 2006

Timothy Barton wrote:
Also, when you say "write a batch file", does that mean I have to write my own program? Or do you mean some kind of command within PDFTEXT?


A batch file in MS Windows is a file which executes a series of commands. You type the commands in Notepad, and save it with a .BAT file extension (you may have to disable hiding of file extensions, otherwise Windows names the file foo.BAT.TXT without you knowing it). Then double-click the batch file to execute all the commands in it.

I didn't know about the trick with && in the command, so what I usually did was to have:

PDF-TXT.exe 001.pdf
PDF-TXT.exe 002.pdf
PDF-TXT.exe 003.pdf
PDF-TXT.exe 004.pdf
...etc


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How can I mass convert pdf (and other formats) to txt

Advanced search






SDL Trados Studio 2017 Freelance
The leading translation software used by over 250,000 translators.

SDL Trados Studio 2017 helps translators increase translation productivity whilst ensuring quality. Combining translation memory, terminology management and machine translation in one simple and easy-to-use environment.

More info »
LSP.expert
You’re a freelance translator? LSP.expert helps you manage your daily translation jobs. It’s easy, fast and secure.

How about you start tracking translation jobs and sending invoices in minutes? You can also manage your clients and generate reports about your business activities. So you always keep a clear view on your planning, AND you get a free 30 day trial period!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search