Pages in topic:   [1 2] >
Experiments in translating PDF files
Thread poster: José Henrique Lamensdorf

José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 23:00
English to Portuguese
+ ...
Sep 12, 2010

As I see it, a forum is a place to exchange experiences, ideas, opinions, etc. This post is about a recent job I was assigned, completed, delivered, got paid, and received adequate recognition from the client on quality. So these are not issues here. As I actually attempted doing it in three different ways, this is an invitation for colleagues to express their views on the different strategies, and perhaps suggest some of their own. Kindly refrain from promoting your favorite CAT tool, unless it offers features that others don't have, and which could materially improve any of these work strategies.

The fact remains that any translator could land a project like this.

The project (actually part of it, there were other files) involved translating a participant's workbook for a training program. Translation involved 5,500 words, it was a PDF having 78 'artsy' pages, apparently created with InDesign CS3. No, the InDesign files were not available.

First attempt

I used InFix Pro to export all text to XML, translated all that XML on Word, using WordFast Classic. During the process, it's a DOC file. At the end, WordFast/Word automatically converts that DOC into XML.

For those unfmiliar with it, InFix tags all the text and the PDF, so that later it will import the XML back into the PDF in the right places, with the original formatting. Then InFix (probably that's where its name comes from) will allow fixing all the formatting as necessary. At the time of importing, InFix will warn on any unavalable chars (from partially embedded fonts), and ask about what font should be used to replace that font, one by one.

When I tried to import the XML with InFix (this operation had been previously successful with smaller and possibly simpler files), some minor flaw in the XML prevented it completely. I tried to open that XML with Word, but the same flaw prevented it as well.

I searched the web for some simple XML 'debugger', and couldn't find one. Now I'm not particularly interested in learning about XML. I just wanted to find some 'XML detective' that would open the file, point me to the defective parts of it - which I'd delete and fix manually later. Found some freeware with unfriendly (i.e. requiring good XML knowledge) interfaces, and some commercial software at, e.g. USD 300, that said it would check and fix anything.

Question: Is there any cheap or free software that will simply 'fix' an XML file, no difficult questions asked?

Bottom line is that I had a fairly comprehensive TM for that job.

Second attempt

I used the TM already available with WordFast Pro directly on the PDF file, intending to later fix layout issues with InFix. Translation was not automatic all the way, as segmentation was often different (I mean the InFix-exported XML was segmented differently from the WFpro working on the same PDF directly). Nevertheless, I made it. After WFpro had done its job, it only allowed me to save the converted PDF into a DOC (or was it RTF?) Word file. So I did it. The layout was completely cockeyed: page count almost doubled. Though I could arrange it with InFix, it would involve an immense amount of work.

Question: Without bragging about other features, what CAT tools - if any - work directly inside a PDF file? Do they leave the PDF layout relatively unchanged? ... i.e. is everything that was, say, on page 47, still on page 47 after translation?

Third and final attempt

By this time I had all the translation memory for this job carved in stone on my mind. So I used InFix to manually overwrite all the text in the PDF with the corresponding translation. At first InFix warned me every time I typed a character unavailable in the partially embedded fonts; plenty of them, as the translation was from accent-less English into lotsa-diacritics Portuguese. I had in front of me a handwritten list of all the fonts used in the PDF (InFix provides that), and the equivalents I intended to use for each of them. I soon learned to change all fonts first upon opening another page.

One thing I noticed was that InDesign does font effects by creating 2-3 layers, viz. one for shadow, another for outline, etc. often splitting a word into parts on the lower layers. The worst is when a word or a phrase is fitted around a circle, so it gets split into individual letters, sometimes a pair of them, I guess when manual kerning made them closer together.

Question: Does any CAT tool handle such 'artsy' stuff well?


Just for comparison, my "old way" of doing it, either from PDF or hardcopy.

I first OCR'd the whole pub into plain TXT to translate. Then I'd get all illustrations, either by scanning or extracting them with whatever program that managed to do it from a PDF page. Then I assembled a publication in PageMaker, and put each page of the original as background. I used this background as a template/guide to rebuild the whole translated publication, making heavy use of PM's Master Styles for formatting the plain TXT translation, brought in via copy&paste from Windows Notepad. After a page was completed, I deleted that original 'guide page' from the background.

Nevertheless, I reckon that anyone being less than a speed demon with PageMaker would take too long to do it. So it was a rather 'personal' solution.


Inputs are welcome, as I think that we, translators, will be getting more and more PDFs to translate. I've seen many that consider this format a curse. On the bright side, we should take that as a blessing, since we don't have to figure out how to replicate the original DTP artist's somersaults, nor learn the nuts & bolts of all DTP/graphic editing software packages in the market.


Direct link Reply with quote
 

Tom in London
United Kingdom
Local time: 02:00
Member (2008)
Italian to English
PDF Sep 12, 2010

My only comment is:

If they give you a file to translated in pdf format, give them the translation in pdf. Maybe that will stop them.

The only reason why a client would give you a pdf is because they don't possess the original document, and probably didn't write it. I've had a few examples of clients sending me a pdf, the original text of which I've then found on other people's websites.



[Edited at 2010-09-12 16:29 GMT]


Direct link Reply with quote
 

José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 23:00
English to Portuguese
+ ...
TOPIC STARTER
But of course! Sep 12, 2010

Tom in London wrote:
My only comment is:
If they give you a file to translated in pdf format, give them the translation in pdf too. Maybe that will stop them.

[Edited at 2010-09-12 16:28 GMT]


In this case they DID want a (translated) PDF, and absolutely nothing else. This kind of request comes up quite often. The client couldn't care less how the original PDF was created, nor how the PDF we deliver will be created. They also couldn't care less what - if any at all - CAT tool we use. All they want is a well translated file that looks just like the original, and that will glitchlessly open in Acrobat Reader, at least as much as the original one did.

It's up to us - translators - to find the shortest and safest route between the original PDF and the translated one. This is the issue here, not stopping them from anything.


Btw, in this specific project there were also PPT and DOC files to be translated. They wanted respectively PPT and DOC files translated. Amazingly, the translation requested was EN-PT, however the PPTs had been assembled with an Italian version of PowerPoint, and there were some Japanese fonts still embedded, though I didn't find any text left using them. A typically globalized thing.


Direct link Reply with quote
 

Paulo Eduardo - Pro Knowledge  Identity Verified
Brazil
Local time: 23:00
Member (2008)
Portuguese to English
+ ...
Abby Sep 12, 2010

José Henrique, Abby solves some with language and font recognition too...
But tables still to be put in order.
So, a combination of Abby, Acrobat writer and werecat (and time) would do the trick.
Best of luck José and Tom.


Direct link Reply with quote
 

José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 23:00
English to Portuguese
+ ...
TOPIC STARTER
Interesting points - some questions Sep 12, 2010

Paulo Eduardo - Pro Knowledge wrote:
José Henrique, Abby solves some with language and font recognition too...
But tables still to be put in order.
So, a combination of Abby, Acrobat writer and werecat (and time) would do the trick.
Best of luck José and Tom.


OK, ABBYY is an OCR program. Though I've been using OmniPage, my new multifunctional gizmo came with ABBYY bundled, so it's worth checking. I thought OCR would be mostly for a scanned PDF, while this one had been apparenly distilled from InDesign.

Next, AFAIK Acrobat Writer is one of the two ways (the other is Acrobat Distiller) to get a PDF from any application capable of printing to a PostScript printer. Does the newer versions also allow editing PDFs?

Finally, I knew WereCat was an utility - no longer supported - for converting PowerPoint (not PDF) files to word for translation with a CAT tool. Could you sketch a workflow like the one you propose? A simple thing... a sequence of what software does what?


Direct link Reply with quote
 

Tony M  Identity Verified
France
Local time: 03:00
Member
French to English
+ ...
Workflow Sep 12, 2010

OK, here goes for part of your query.

Yes, ABBY is great for both 'image' and 'text' PDFs, and recovers graphics elements etc. that you can re-use.

It does, however, do crazy things, creating dozens of different styles to attempt to mimic the page layout; it also has a nasty habit of using columns to try and mimic tables and tabulations; all of this does create a lot of extra work, trying to put it right, but when it works, the results can be a very good facsimile of the original.

As for Werecat, it really is an excellent utility; there are two versions, 'red' and 'blue', one of them is specifically for extracting text from PPTs, the other, from text boxes etc. in .DOCs. This can be useful, as ABBY seems to put some text in boxes (haven't investigated enough to find out when / why)

So I do things in this order:

PDF > DOC conversion using ABBY

(at this point, it's worth sorting out as many of the formatting problems as you can, to avoid problems in your CAT tool, which soemtimes has trouble with segmentation across column boundaries, unwanted hard returns, etc.)

Extraction of text from text boxes (if necessary) into an auxiliary .DOC file using Werecat.

Translate both doc files

Re-insert text from auxiliary file into main file (Werecat)

Tidy up .DOC file

Convert back to PDF (I use PRIMO PDF from Nitro, there's a free version available).

I've not so far had major issues with fonts or special text effects, but I can imagine that would be a real pain!

Good luck!

PS:

Maybe some enterprising person could offer a bureau service for part of this process?


Direct link Reply with quote
 

Jabberwock  Identity Verified
Poland
Local time: 03:00
Member (2004)
English to Polish
FineReader is fine Sep 12, 2010

José Henrique Lamensdorf wrote:

OK, ABBYY is an OCR program. Though I've been using OmniPage, my new multifunctional gizmo came with ABBYY bundled, so it's worth checking. I thought OCR would be mostly for a scanned PDF, while this one had been apparenly distilled from InDesign.



Actually, it converts any PDF into editable form much better than any automatic converter.

The trick is manually micromanage the text boxes, tables etc. It is additional effort, but it eliminates many formatting problems that Tony mentioned (except the column division, but you can go around that using tables or boxes).

If you are handy with styles in Word, it might make sense to use less "facsimile" conversion. FineReader allows four degrees of layout similarity, from very faithful - but heavily formatted - rendition of the original page to plain text export).


Direct link Reply with quote
 

Nikita Kobrin  Identity Verified
Lithuania
Local time: 04:00
Member (2010)
English to Russian
+ ...
PDF by its nature is NOT suitable for translation Sep 12, 2010

José Henrique Lamensdorf wrote:

what CAT tools - if any - work directly inside a PDF file? Do they leave the PDF layout relatively unchanged?[/b]


Such CAT tool doesn't exist and I doubt it will appear in the nearest future. I.E.: There's no CAT tool that can ADEQUATELY treat ANY document in PDF.


José Henrique Lamensdorf wrote:

I think that we, translators, will be getting more and more PDFs to translate.


We should not. At least we should do our best to prevent that from being so. The matter is that PDF by its nature is against such processes as translation. Unfortunately there are a lot of computer-illiterate people who don't understand that...


Direct link Reply with quote
 
xxxNMR
France
Local time: 03:00
French to Dutch
+ ...
I am doing experiments right now Sep 12, 2010

I get excellent results with Wordfast 2.4.0.1.
Workflow:
1) create environment in WF
2) import PDF file, this creates a Word file. Lay-out and images are preserved. It is in the right order, but sometimes some characters aren't recognized (for instance "f"), which makes me think that WF proceeds by using OCR.
3) translate using WF Pro
4) save the translation (in Word, but also keep the xtml file, one never knows)
5) do some lay-out and tidying-up (joining sentences, )
6) conversion of the Word file into PDF (print on disk) by using a PDF creator

Alternatively, 3) translate directly in the Word file.

My settings:
Word 2003 on a Vista computer
WF 2.4.0.1 (the latest version as far as I know)
Scansoft Expert PDF 4 (didn't try Acrobat Writer)

I don't know InFix but I wonder if this would be useful, especially if the target text is longer than the source text. Acrobat is not a DTP-program and never will be. (the full version has only very limited DTP functions).

Pagemaker/InDesign won't give good results because you will have to re-create the file. (although I worked as a DTP specialist and would be able to do so).

Note: in all cases, one of the problems might be the images (low density in the PDF file).


José Henrique Lamensdorf wrote:

I think that we, translators, will be getting more and more PDFs to translate.

I am sure we'll have to. Being able to do something other translators can't do can be your main marketing argument.


[Modifié le 2010-09-12 20:45 GMT]

Update: as for the "artsy" stuff, I have two or three lines which are oblique and not recognized by the process. I will transform them into horizontal texts, so I take some loss of functions for granted. In the same way, my TM is not very "beautiful" because some sentences are cut into two or three parts. Experiments are going on.

[Modifié le 2010-09-13 07:33 GMT]


Direct link Reply with quote
 
xxxOlaf
Local time: 03:00
English to German
XML debugging -- Internet Explorer Sep 13, 2010

José Henrique Lamensdorf wrote:
Question: Is there any cheap or free software that will simply 'fix' an XML file, no difficult questions asked?

One of the quickest and easiest ways to test the XML compliance of an XML file is to drag and drop it to an open Internet Explorer window. If the XML file is malformed it won't load completely and towards the end of the file IE will display an error message letting you know where the error occurred.
Most of the time it was the use of left or right angle brackets or an ampersand in the text. These characters need to be written as XML entities: < > and &

It's also relatively easy to write a VBA macro with the MSXML library that will tell you exactly in what line the error occurred.
For more information see A Beginner's Guide to the XML DOM -- Dealing with Failure. I used a modified version of the second listing in that section to get the error information from the ParseError object and to replace the offending character if possible.


Direct link Reply with quote
 
FarkasAndras
Local time: 03:00
English to Hungarian
+ ...
ABBY Sep 13, 2010

Jabberwock wrote:

José Henrique Lamensdorf wrote:

OK, ABBYY is an OCR program. Though I've been using OmniPage, my new multifunctional gizmo came with ABBYY bundled, so it's worth checking. I thought OCR would be mostly for a scanned PDF, while this one had been apparenly distilled from InDesign.



Actually, it converts any PDF into editable form much better than any automatic converter.


Well, the fact that the best option for handling PDF is OCR says a lot about the usefulness of the format itself. It's just preposterous.
For God's sake people, forget about PDF already!

Note: just a few days ago, I found a nice little open source command line tool in the xpdf pdf viewer project that extracts text from pdf files into formatted or unformatted txt. I'm using it in a project for automatic pdf->txt conversion, and in some ways, it works better than Acrobat Reader's own save as text feature (better with weird characters).
http://www.foolabs.com/xpdf/download.html
Windows binaries: ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
The tool is called pdftotext, sample command:
pdftotext -layout -enc UTF-8 input/file.pdf output/file.txt

I for one don't trust OCR to recognize all characters correctly, so I prefer solutions like this. The formatting may be off, tables and columns may get mangled badly, but at least the words themselves should be what they are in the pdf.


[Edited at 2010-09-13 08:36 GMT]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 03:00
Member (2006)
English to Afrikaans
+ ...
@José Sep 13, 2010

José Henrique Lamensdorf wrote:
Is there any cheap or free software that will simply 'fix' an XML file, no difficult questions asked?


Dunno. But "XML debugger" may not get you what you're looking for. Try "XML validator". Oh, and have you tried these: http://www.w3.org/XML/Schema.html ?

I used the TM already available with WordFast Pro directly on the PDF file, intending to later fix layout issues with InFix.


AFAIK, WFP extracts the text and tries to guess the layout, but that's about it. The more complex your PDF, the less likely that a CAT tool will guess the formatting and layout correctly.

One thing I noticed was that InDesign does font effects by creating 2-3 layers, viz. one for shadow, another for outline, etc. often splitting a word into parts on the lower layers.


Not just InDesign does this. Another common gripe with translating PDF files is that graphic artists create font effects by doing similar things manually, instead of using the font effects library of their DTP tool.

...and put each page of the original as background. I used this background as a template/guide to rebuild the whole translated publication...


This is similar to what I do when I have to rebuild a PDF file's layout in MS Word. I first type all the text into Notepad, then I take a screenshot of the PDF page and then I use that page as a non-washed watermark in MS Word to guide me in beating the typed text into the original layout. Lastly, I remove the watermark.

Inputs are welcome, as I think that we, translators, will be getting more and more PDFs to translate.


My concern is that as with any other format, the client may make small changes in mid-project, and if your translation process relies on the original format staying the same, then you're going to run into problems.

Also, if a client wants the translator to deliver a PDF file, he might not use the PDF file as-is. Well, his DTP guy may import the PDF directly and use it as-is, but it is also equally likely that the DTP guy is going to open the PDF file in his secondary monitor and then copy/paste the contents into the final version of the DTP file... and in that case, you might as well have delivered the translation in a two-column format (source left, translation right).


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 03:00
Member (2006)
English to Afrikaans
+ ...
Micromanagement in an OCR program Sep 13, 2010

Jabberwock wrote:
The trick is manually micromanage the text boxes, tables etc. It is additional effort, but it eliminates many formatting problems that Tony mentioned (except the column division, but you can go around that using tables or boxes).


Absolutely. The little bit of extra effort in selecting chunks of text manually (instead of letting the OCR program guess which bits of text belong together) will result in an OCR'ed file that is far easier to handle.

The same goes for extracting text from PDF using Acrobat Reader. You can either just go Ctrl+A and copy everything, but that will result in some misplaced text and reverse-order paragraphs, or you can use the block-select tool in Reader 5.05 or older to select blocks of text and copy them manually, which will ensure that text that should stay together stays together. Newer versions of Reader doesn't see to have this block-select feature (I wonder why...).


Direct link Reply with quote
 

Marlene Blanshay  Identity Verified
Canada
Local time: 21:00
Member (2009)
French to English
+ ...
I use a service Sep 13, 2010

OCR now. It does a really good job...you can buy credits for a really small fee and get a large number of jobs.

Direct link Reply with quote
 

Heinrich Pesch  Identity Verified
Finland
Local time: 04:00
Member (2003)
Finnish to German
+ ...
Though shall not send pdf to translators Sep 13, 2010

This is the eleventh commandment, but unfortunately it is little known.

I would extract the text either using Abbyy Finereader or manually. Some customers prefer translations, where the text chunks are grouped according to page and column etc.
But if the customer cannot create the final file himself, I would format the text in Word so that it is clear what belongs where. I would not split sentences. After that they can do what they want with it.

If the job is too complicated I let it pass.

Regards
Heinrich


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Experiments in translating PDF files

Advanced search







Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search