Best way to add a bilingual pdf file into a TM
Thread poster: ND1169

Local time: 01:41
English to Japanese
Oct 11, 2012

Hi, I have some previously translated files that were outsourced to an unknown company that are in pdfs that I would like to incorporate into my new translation memory I have made. However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen

What is the most efficient way to go about doing this????????

Obviously the easiest way would be to get in contact with whoever translated this and get a tmx file but that's not possible.

Should I just copy and paste like a madman into excel? That seems very time consuming but possible.

I've tried to align the files but it puts both languages on both source and target and being pdfs with lots of tables etc, has given me lots of formatting trouble that doesn't seem to be worth my time. I've also tried using LF align and I get the same formatting errors, but I can get the data into an xls. Is there an easy way I can select all roman characters / all Japanese characters in an office application like excel or word and delete them???? Aka select target row... then delete all Japanese characters??

Or should I just convert it to .txt and or .doc first and try it that way? Although the problem of both languages on both source and target won't go away.

Thanks for any and all help!


Local time: 18:41
English to Hungarian
+ ...
Separate Oct 11, 2012

One thing you could try to separate the EN and the JP text is to OCR it with ABBYY or some other OCR software and hope that the text colours are recognized consistently. Then you could use "select text with similar formatting" in Word to select all the JP text and cut and paste it to a new doc.
Theoretically, it should also be possible to separate the texts based on their differing character sets using regex, e.g. in Notepad++. If the JP text is always on the second half of each line, the regex would be:
If this works, you end up with a tab separated text file which is ready for converting into a TMX or importing into Studio as a bilingual file.

Both solutions are error prone depending on what your file looks like.

[Edited at 2012-10-11 10:04 GMT]


Stanislaw Czech, MCIL  Identity Verified
United Kingdom
Local time: 17:41
Member (2006)
English to Polish
+ ...
I would outsource it Oct 11, 2012

I don't know how useful to you would be such a TM. If you expect a lot of matches, than it could make sense to outsource the job to someone who will copy the text to two columns in Excel. Once it is done, you could take over and align the text.

From my experience there are websites where you can find someone willing to undertake such a simple job at a fraction of translator's hourly pay.

Alternatively you could do nothing - copy these files into a single directory and run a search on the contents of the files (for instance in Windows Explorer) when you need to find a particular term.

Good luck


Samuel Murray  Identity Verified
Local time: 18:41
Member (2006)
English to Afrikaans
+ ...
What does your OCR program do? Oct 11, 2012

ND1169 wrote:
However both the source language (English) and the target language (Japanese) are in the same pdf file. Here is a screen.

Some OCR programs can extract text from PDF perfectly and still attempt to format the text like in the PDF. Does your OCR program successfully format the Japanese text in blue? If so, you can use MS Word's advanced find/replace function to remove all black text and all blue text in two versions of the file, and then align it.


To report site rules violations or get help, contact a site moderator:

You can also contact site staff by submitting a support request »

Best way to add a bilingual pdf file into a TM

Advanced search

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »

  • All of
  • Term search
  • Jobs
  • Forums
  • Multiple search