https://www.proz.com/forum/cat_tools_technical_help/53766-which_tool_to_align_pdf_files_and_create_tm.html

Which tool to align PDF files and create TM?
Thread poster: Jan Sundström
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 18:48
English to Swedish
+ ...
Aug 22, 2006

Hi all,

Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss?

OCR with a third party application and then just use Trados Winalign, or is there anything better/more direct?

If it's good, it's worth paying for, so the pricetag is not a problem.

What are your experiences?


 
Natalie
Natalie  Identity Verified
Poland
Local time: 18:48
Member (2002)
English to Russian
+ ...

Moderator of this forum
SITE LOCALIZER
Hi Jan Aug 22, 2006

PDF files are not intended for aligning or whatever. To learn more please refer to http://www.proz.com/doc/128

Natalia


 
RobinB
RobinB  Identity Verified
United States
Local time: 11:48
German to English
"Minimum fuss" Aug 22, 2006

Jan Sundström wrote: Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss?


I supposed it depends on your concept of "minimum"....

There isn't any quick and easy way to convert and align PDFs, AFAIK. We use either the ABBYY PDF converter or Iceni Gemini - some PDFs convert better with ABBYY, others with Gemini. But whichever you use, there will still be quite a lot of cleaning up to do before you align, for example:

- eliminating unnecessary page headers/footers
- eliminating "hard" line breaks in sentences
- eliminating "manual" line break hyphens
- reformatting: it often happens that multiple columns get screwed up. The same applies to tables.

Additionally, depending on the font used in the DTP system, you might find yourself confronted with ligatures, which tend to leave spaces in the middle of words when you convert. These have to be identified (manually!) and eliminated.

These are just some of the issues relating to PDFs created from DTP systems. The quality of PDFs created by scanning in paper copy will depend critically on the OCR system you use.

Then there are the TM-specific edits you need to do before it's worth aligning. For Trados, for example, this means inserting non-breaking spaces after colons, inserting hard line breaks if the sentence ends in a figure, that sort of thing.

And then you can move on to the joy of alignment...


 
Heinrich Pesch
Heinrich Pesch  Identity Verified
Finland
Local time: 19:48
Member (2003)
Finnish to German
+ ...
Don't bother Aug 22, 2006

If the documents are longer than a few pages, it is not worth the effort. I once tried to align two Word-files, which I had translated mostly myself, but the resulting TM was not usable. You can retranslate the file by coping and pasting from the translation, that will deliver a decent TM.
Regards
Heinrich


 
Barnaby Capel-Dunn
Barnaby Capel-Dunn  Identity Verified
Local time: 18:48
French to English
Logiterm Aug 22, 2006

Hi Jan
Logiterm does this without any trouble - but of course you need Logiterm! (www.terminotix.com)


 
RobinB
RobinB  Identity Verified
United States
Local time: 11:48
German to English
Logiterm experience Aug 22, 2006

Barnaby Capel-Dunn wrote:
Logiterm does this without any trouble - but of course you need Logiterm! (www.terminotix.com)


Hi Barnaby,

Several colleagues have recently recommended Logiterm to me - so forcefully that I'll probably buy a copy for evaluation. Perhaps you could let us know your experience with this software. Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)? How good is it at converting special characters? How long does it take to make a bitext out of e.g. two 200 page PDFs (one in each language)?

TIA,
Robin


 
Viktoria Gimbe
Viktoria Gimbe  Identity Verified
Canada
Local time: 12:48
English to French
+ ...
AutoUnbreak Aug 22, 2006

First off, I find that OCR isn't the way to go here unless your PDFs are image files (scans, faxes, etc.). If they are text PDFs, this is what I do:

I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF t
... See more
First off, I find that OCR isn't the way to go here unless your PDFs are image files (scans, faxes, etc.). If they are text PDFs, this is what I do:

I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF text, so you keep the formatting, but the unnecessary carriage returns are removed. You paste this into a Word document. You repeat the procedure with the target text and paste it into another empty Word file. Then, you simply align with your usual align tool (I use WinAlign, it works well for this purpose).

Once aligned, all you have to do is check that the alignment is done OK (sometimes, a source segment is broken up into two target segments and vice versa). Once the alignment is satisfactory, export the aligned file pair (or project) and voilà!

I have used this several times to create TMs using parallel texts from government websites prior to starting an assignment and it helps a lot.

Good luck!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:48
Member (2006)
English to Afrikaans
+ ...
OCR and align with PlusTools Aug 22, 2006

Jan Sundström wrote:
Which tool do you use to align two PDF files (source/target) and create a TM with minimum fuss?


Today I had to do just that. I have an OCR scanner with document feeder so that I can scan multiple pages quickly. I then extracted the segments using Wordfast, and aligned the two files using PlusTools.

I don't know how good WinAlign is... is it user-friendly? PlusTools's align feature basically puts the text into a two column table with one row per segment, and you have keyboard shortcuts for merging and splitting cells (Alt+S to split at cursor point, and Alt+M to merge the text with the cell beneath it). I've seen aligners that work with the mouse, where you have to draw lines from one side of the screen to the other, but those are IMO hopelessly too cumbersome.

What other free aligners are there? I know of http://sourceforge.net/projects/bitext2tmx and the old Cypressoft aligner. What others are there?


 
Barnaby Capel-Dunn
Barnaby Capel-Dunn  Identity Verified
Local time: 18:48
French to English
To Robin Aug 23, 2006

Will get back to you later in the day!

 
Barnaby Capel-Dunn
Barnaby Capel-Dunn  Identity Verified
Local time: 18:48
French to English
Re Logiterm Aug 23, 2006

Robin,
I've been in touch with Logiterm and this is what they say:

Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)?
Yes, it can handle PDFs with columns, tables, etc. Keep in mind though, that PDFs are the most difficult format to handle and it may cause certain misalignments. The best way of seeing how it handles complicated PDF files is to either ask for a 30-day trial of LogiTerm or send the files to Terminotix so they process
... See more
Robin,
I've been in touch with Logiterm and this is what they say:

Can it handle complex formats in PDFs (multiple columns, tables, that sort of thing)?
Yes, it can handle PDFs with columns, tables, etc. Keep in mind though, that PDFs are the most difficult format to handle and it may cause certain misalignments. The best way of seeing how it handles complicated PDF files is to either ask for a 30-day trial of LogiTerm or send the files to Terminotix so they process them.

How good is it at converting special characters?
Which special characters are we talking about? The Professional Edition handles latin-only languages.

How long does it take to make a bitext out of e.g. two 200 page PDFs (one in each language)?
I just created a bitext with a 155 page French document and its corresponding 153-page English document and it took 25 seconds.

I hope this is of some use to you?
I personally am a great fan of Logiterm. I must admit I don't use all its features by a long chalk but IMO its worth its price for its alignment tool and its Logitrans component alone.
Best
Barnaby
Collapse


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 18:48
English to Swedish
+ ...
TOPIC STARTER
Logiterm exporting to TM Aug 25, 2006

Hi all,

I've been reading the Logiterm specs, but the part on exporting to a TM is very brief:

"Compatible with translation memories
[...]
Inversely, you can also take one or more bitexts and create
documents that can be imported into a translation memory."

This is the function that I'm specifically looking for.
A question to the ones of you who have tried this:
Is the alignment/creation of TM through Logiterm more accurate or
... See more
Hi all,

I've been reading the Logiterm specs, but the part on exporting to a TM is very brief:

"Compatible with translation memories
[...]
Inversely, you can also take one or more bitexts and create
documents that can be imported into a translation memory."

This is the function that I'm specifically looking for.
A question to the ones of you who have tried this:
Is the alignment/creation of TM through Logiterm more accurate or used friendly compared to Winalign or other similar alignment tools?

It seems like a very competent tool, but the documentation on exactly which charsets are supported is a bit sketchy. Will it handle Scandinavian characters (åäö), and not mangle them while exporting?!

I guess the best way to find out is to download the demo, but if you have any user experience, it would be valuable to find out first!

Thanks a lot,

Jan
Collapse


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 18:48
English to Swedish
+ ...
TOPIC STARTER
AutoUnbreak - a detour? Aug 25, 2006

Viktoria Gimbe wrote:
I use AutoUnbreak (look it up on Google, it's free and does the trick nicely). First, I select all in the PDF and copy. Then I paste it into AutoUnbreak. The software only takes 65000 characters at once, so with longer PDFs, you may want to break it up into several smaller sections. AutoUnbreak removes carriage returns and creates RTF text, so you keep the formatting, but the unnecessary carriage returns are removed. You paste this into a Word document.


Without having tried AutoUnbreak, this sounds like a "poor mans solution" of what can be done in less steps with other commercial software.

My guess is that the copy-paste way is also very vulnerable to tables, inserted text blocks etc.

Converting a PDF with tagged text into a RTF document can just as well be achieved by saving the PDF as RTF in Acrobat 7.0, just with a few mouse clicks. Recent versions of Acrobat interpret line breaks very well, so extra carriage returns hardly occur anyway.

ABBYY PDF Transformer also does this virtually automatically, with hardly any erroneous CRs. I'm not sure if Abbyy is just an "optical" character recognition tool, but my guess is that it extracts tagged text directly, rather than interpreting it optically.

Anyway, these points aside, I was imaging a tool that would merge the conversion and the alignment steps of a PDF, bypassing this as separate individual tasks. And it seems that Logiterm is the closest match so far...

/Jan


 
Jan Sundström
Jan Sundström  Identity Verified
Sweden
Local time: 18:48
English to Swedish
+ ...
TOPIC STARTER
Some answers Aug 25, 2006

Hi,

I got a reply to some of my own questions by writing to Terminotix:

****************
1) LogiTerm's Professional edition supports Scandinavian characters.

2) All alphabets using ISO Latin 1 characters are supported.

3) All file formats [sic!], except for QuarkXpress and Framemaker. PDF imagine files are, of course, not supported, all others are. When a PDF was scanned in an image file, we recommend that it be rescanned using, for
... See more
Hi,

I got a reply to some of my own questions by writing to Terminotix:

****************
1) LogiTerm's Professional edition supports Scandinavian characters.

2) All alphabets using ISO Latin 1 characters are supported.

3) All file formats [sic!], except for QuarkXpress and Framemaker. PDF imagine files are, of course, not supported, all others are. When a PDF was scanned in an image file, we recommend that it be rescanned using, for example, Omnipage.

****************

Good enough, I'm downloading the demo

/Jan
Collapse


 
Barnaby Capel-Dunn
Barnaby Capel-Dunn  Identity Verified
Local time: 18:48
French to English
From a Logiterm user Aug 25, 2006

Hope you enjoy it Jan! I certainly like it a lot. Let us know your impressions.
Best
Barnaby


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Which tool to align PDF files and create TM?


Translation news related to CAT tools





CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »