How to deal with PDF source files in Linux?
Thread poster: RNAtranslator
RNAtranslator
RNAtranslator  Identity Verified
Local time: 02:24
English to Spanish
+ ...
Jan 31, 2008

I have no problem to get text and figures from a PDF file with the aid of pdf-utils or poppler-utils, but cursive, boldface and other formatting are lost. I have tried with pdftohtml from poppler-utils, but the results are bad. Anything better?. The best would be to convert to OpenDocument format.

 
Marc P (X)
Marc P (X)  Identity Verified
Local time: 02:24
German to English
+ ...
How to deal with PDF source files in Linux? Jan 31, 2008

From my experience, the biggest problem with PDFs isn't the inline formatting; that can be applied later without too much effort. The main problem is preserving the logical structure of complex layouts. My normal procedure is not to attempt to reproduce the layout; instead, I import the text into a two-column table and provide a paragraph-by-paragraph translation in the second column. Even if the layout isn't reproduced, though, contiguous blocks of text still have to be recognized as such.
... See more
From my experience, the biggest problem with PDFs isn't the inline formatting; that can be applied later without too much effort. The main problem is preserving the logical structure of complex layouts. My normal procedure is not to attempt to reproduce the layout; instead, I import the text into a two-column table and provide a paragraph-by-paragraph translation in the second column. Even if the layout isn't reproduced, though, contiguous blocks of text still have to be recognized as such.

Adobe Reader's save-as-text function is probably the simplest solution; it has the advantage of generally keeping bits of text together that logically belong together.

kpdf has a neat function that enables you to select text by area rather than by logical structure. This can sometimes be useful with irregular blocks or columns of text.

pdftohtml, which you've tried, is OK for simpler documents but with complex layouts, the resulting structure is a mess, as you've probably discovered already.

KWord probably preserves the most formatting, and probably makes the worst job of it, too.

In technical terms, the best solution may be to do what many Windows users do, and OCR the text. Unfortunately, the only high-quality OCR application for Linux that I'm aware of (Vividata) is extortionately expensive.

If anyone has any more information, I'd be very interested in hearing it.

Marc
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 03:24
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
ABBYY FineReader is reported to work in Linux via wine Jan 31, 2008

and, possibly, ABBYY PDF Transformer too. Both are capable of transforming a PDF input to MS Word/Excel (not to ODF so far).

 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Prachya Mruetusatorn[Call to this topic]

You can also contact site staff by submitting a support request »

How to deal with PDF source files in Linux?






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »