Best way to add a bilingual pdf file into a TM
ND1169
Oct 11, 2012
English to Japanese
Oct 11, 2012

Hi, I have some previously translated files that were outsourced to an unknown company that are in pdfs that I would like to incorporate into my new translation memory I have made. However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen

What is the most efficient way to go about doing this????????

Obviously the easiest way would be to get in contact with whoever translated this and get a tmx file but that's not possible.

Should I just copy and paste like a madman into excel? That seems very time consuming but possible.

I've tried to align the files but it puts both languages on both source and target and being pdfs with lots of tables etc, has given me lots of formatting trouble that doesn't seem to be worth my time. I've also tried using LF align and I get the same formatting errors, but I can get the data into an xls. Is there an easy way I can select all roman characters / all Japanese characters in an office application like excel or word and delete them???? Aka select target row... then delete all Japanese characters??

Or should I just convert it to .txt and or .doc first and try it that way? Although the problem of both languages on both source and target won't go away.

Thanks for any and all help!

Separate Oct 11, 2012
English to Hungarian
+ ...
Separate Oct 11, 2012

One thing you could try to separate the EN and the JP text is to OCR it with ABBYY or some other OCR software and hope that the text colours are recognized consistently. Then you could use "select text with similar formatting" in Word to select all the JP text and cut and paste it to a new doc.
Theoretically, it should also be possible to separate the texts based on their differing character sets using regex, e.g. in Notepad++. If the JP text is always on the second half of each line, the regex would be:
If this works, you end up with a tab separated text file which is ready for converting into a TMX or importing into Studio as a bilingual file.

Both solutions are error prone depending on what your file looks like.

[Edited at 2012-10-11 10:04 GMT]

Stanislaw Czech, MCIL
United Kingdom
Local time: 19:51
Member (2006)
English to Polish
+ ...
I would outsource it Oct 11, 2012

I don't know how useful to you would be such a TM. If you expect a lot of matches, than it could make sense to outsource the job to someone who will copy the text to two columns in Excel. Once it is done, you could take over and align the text.

From my experience there are websites where you can find someone willing to undertake such a simple job at a fraction of translator's hourly pay.

Alternatively you could do nothing - copy these files into a single directory and run a search on the contents of the files (for instance in Windows Explorer) when you need to find a particular term.

Good luck

Samuel Murray
Local time: 20:51
Member (2006)
English to Afrikaans
+ ...
What does your OCR program do? Oct 11, 2012

ND1169 wrote:
However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen.

Some OCR programs can extract text from PDF perfectly and still attempt to format the text like in the PDF. Does your OCR program successfully format the Japanese text in blue? If so, you can use MS Word's advanced find/replace function to remove all black text and all blue text in two versions of the file, and then align it.

