Alignment, term extraction, or manual concordance searches? Questions about speeding up workflow
Thread poster: Natron

Natron
Japan
Local time: 13:44
English to Japanese
+ ...
May 21, 2013

Hello, I’m very new to translation and I use MemoQ. I’m looking for advice from seasoned translators on the best plan of attack so I don’t waste time experimenting when I could be translating. For my current job, I have to translate specific procedures to perform using a specific piece of software (Primavera P6). So since this software is made by Oracle, I figured they are a proper company and probably already have translated software manuals for my language pair online. A quick search and success! They do exist. (Thank you fellow translators for doing the hard work for me!)
Eng
Jap

This will help immensely with terminology because I’m not an expert on project management and a lot of the terms, settings, and options are already translated. They are .pdf but in what I’d call “clean pdf” format.

I want to have a clean list of these terms and options but there doesn’t seem to be an easy way to automate this, but maybe I just don’t have enough experience parsing files. I’m trying to avoid running manual concordance searches as much as possible to save time (and my wrist). I don’t think Orcale would provide me with source files, glossaries, TBs, or TMs and they’re certainly not obligated to do so.

My question is what can I do from here!?

In my little time that I spent experimenting, I first converted the pdfs to docx using acrobat.

Since I am using MemoQ, I thought maybe the live align feature would help so I don’t have to repeatedly search for things and spend a lot of time manually aligning things that I may end up not even utilizing.

The really good thing here is that at least MemoQ tells me if there is a match or not without having to run a search, however the alignment is SO bad that I often have to manually navigate the pdfs anyway which kind of defeats the purpose of saving time. I tried messing with a bunch of the settings to attempt to make the alignment better but to no avail.

Next I tried using LF Align 4.04. Thank you so much for this great tool! It did a much better job, but it still wasn’t perfect. I went ahead and added the .tmx into a separate TM even though it wasn’t perfect. This helps a little but isn’t a perfect solution.

Maybe since Japanese doesn’t have any delimiters, alignment seems to be difficult? Not to mention the added problem of converted pdfs? Do I need to mess around more with segmentation rules? Do I need to study more on file pre-processing before alignment?

Next, instead of alignment I tried to extract terms so I could focus on the main terms used in these documents. However, when I ran extract terms I got a lot of very basic terms in English.

the
in the
and
you can
you
click the
you want to
of the

Again not very helpful, but maybe I just don’t have the settings configured properly in MemoQ.

So finally it seems the best solution is to just avoid trying to have a fancy alignment and make my own TB as I go along with my translations by manually navigating the pdfs after concordance searches. It doesn’t waste that much time but it feels like an inconvenience since the translations are already there.

My wrist and I would thank you if you have any better ideas or tips.


 

FarkasAndras
Local time: 06:44
English to Hungarian
+ ...
LF Aligner May 21, 2013

Natron wrote:
Next I tried using LF Align 4.04. Thank you so much for this great tool! It did a much better job, but it still wasn’t perfect. I went ahead and added the .tmx into a separate TM even though it wasn’t perfect. This helps a little but isn’t a perfect solution.

Maybe since Japanese doesn’t have any delimiters, alignment seems to be difficult? Not to mention the added problem of converted pdfs? Do I need to mess around more with segmentation rules? Do I need to study more on file pre-processing before alignment?


Hi, I'm the author of LF Aligner.
I don't speak Japanese and I don't have time to look at the pdfs now, so here are some general remarks:
- PDF is a very difficult format to work with, but you can often improve results somewhat. There are two pdf-txt conversion modes in LF Aligner and a separate third option (see the readme and the setup file). Try them all to see which one works best with your files. Then see if there is some systematic error you can correct before alignment (e.g. you may be able to remove page headers/footers with regex search&replace).
- LF Aligner has a very primitive Japanese sentence segmenter. If there is easy room for improvement, let me know (e.g. Japanese has some character that marks the end of the sentence that LF Aligner doesn't know about).
- Perfect results are usually impossible with autoalignment, but "near perfect" (95%+) is feasible if the input files are good (PDF is rarely good). Doing a full manual correction is usually not worth the time it takes. You can do a rough manual correction in the GUI editor and leave it at that. You can always come back to the txt/xls later to look up things if you need to.


 

István Lengyel
Hungary
Local time: 06:44
English to Hungarian
+ ...
fixing the alignment on the fly and stop word lists May 22, 2013

Hi Natron,

In memoQ my suggestion would be the following: after the alignment, open the aligner interface and start with fixing the segment in the middle, and then realign. Then fix the half of the first half, and the half of the second half, and realign, and so on and so forth. You'll see a quick improvement fast.

As for the term extraction, you need to use a stop word list. There is a default EN stop word list included - just select that when you run term extraction. You can add your own words that you don't want to see, and then you won't get these as terms.

István


 

Natron
Japan
Local time: 13:44
English to Japanese
+ ...
TOPIC STARTER
perfect results are not worth it May 23, 2013

I guess I should probably give up on having automated pdf alignment in the near future. The more I look into this, the more it seems to be not worth the effort.


Thank you István. I knew I just didn't have something set up correctly, I'm still very new to translation and MemoQ. I don't know why I didn't see the option for the stop word list at the bottom. That makes much more sense. I guess because the default list wasn't selected by default was the reason I missed it. Always nice to get terminology down before I start translating.


And Farkas. First of all thank you again for your work with LF Aligner. I will let you know if I find something that may help.

FarkasAndras wrote:
(e.g. you may be able to remove page headers/footers with regex search&replace).


Yes this was useful for headers/footers and did clean it up some.

On a totally separate note, the problem comes with the Japanese language itself and in particular software manual translation. Japanese has many sentences that have English text or acronyms mixed in the sentences which makes things difficult. When I was trying to separate out Japanese and English from a translation that had both languages inside a single pdf, I ran into some trouble with this.

Converting to docx was not a very nice xml format. I tried LibreOffice to see if I could convert it to an ODF XML format but it interpreted the PDFs as drawings and saved them as a .odg. The good news is Japanese and English text is usually styled differently so there are different attributes used for each. The bad news is that in the PDF to ODF conversion any line break or even change of font (English and Japanese used together, etc), caused it to create entirely new text blocks.

I ended up using XSLT to extract the English and Japanese text separately, but the sentences are randomly cut off with new lines and I'm not sure I was able to get everything. And most likely any Japanese sentences with English words in them probably lost those English words.

I will keep experimenting, but there's only so much time I can put into that when there are translation due dates ahead.

Thank you again for your responses.


 

Fernando Toledo  Identity Verified
Germany
Local time: 06:44
German to Spanish
May 24, 2013



[Edited at 2013-05-24 16:48 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Alignment, term extraction, or manual concordance searches? Questions about speeding up workflow

Advanced search






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search