International Translation Day 2017

Join ProZ.com/TV for a FREE event on September 26-27th celebrating International Translation Day! 50+ hours of content, Chat, Live Q&A & more. Join 1,000's of linguists from around the globe as ProZ.com/TV celebrates International Translation Day.

Click for Full Participation

Import EDICT or other dictionary into memoQ as TM or other resource
Thread poster: pmzeitler
pmzeitler
United States
Japanese to English
Jun 3, 2016

I'm a relative neophyte to translation memory software, but I've tinkered with the memoQ trial long enough to know I want to go forward using that. However, I can't help but shake the feeling that I'm either missing some element of the software that would make my life easier, or that I'm using the software incorrectly.

Here's my current process: (SL Japanese, TL American English)
1. Scan the page using a flatbed scanner, and OCR it in FineReader. Repeat for all pages in the project.
2. Load the OCR output (plaintext) into memoQ and import my existing TM files.
3. Line by line read the SL text and compose a translation based on suggestions provided by the TM and/or going out to Bing Translate API (non-optimal solution as it's metered access and I'm poor at the moment after shelling out for memoQ, plus it requires an Internet connection).
3a. If I encounter a kanji I'm not familiar with (there are a few of these, given that I'm in the third year of a 4-year college level Japanese program), copy it to the clipboard and paste into a dictionary app (outside of memoQ, local to the computer) for a definition.
3b. Manually create a new TM entry (CTRL-Q?) for the kanji and any variants of its appearance (e.g. for adjectivals, both the -i and -ku forms).
4. Refine the translation for readability.

Where I think the Doing It Wrong is in steps 3a and 3b. I think there may be a way to have MemoQ do the lookup based on a dictionary file stored locally. Such a file exists-- Jim Breen's EDICT file-- but it is in a format that if memoQ can read it, it is not obvious to me. EDICT is in a somewhat proprietary XML format but Mr. Breen does provide the DTD, so I can probably extract data out of it if necessary to transform it to a well-known import format that memoQ can digest.

(Another post on this topic, which yielded no answers to how this extraction could be done, was looking to extract the data from WWWDIC, which violated that package's licensing. Based on my reading of the EDICT license, I believe that transforming the data does NOT violate the license, and that even if I were to distribute the data in memoQ-readable format, I would still not be in violation of the EDICT license assuming I provided the appropriate attribution. If I am in error on this point as well, please say so.)

This may seem like I'm too deep in the weeds on this. I expect I am, actually. However, even with the manual entry into the TM, using memoQ has dramatically accelerated the pace at which I can produce work. For a ten-page project I had to complete last semester, it took me roughly twenty hours to do the translation by hand, with only an electronic dictionary available to me. I chewed through a lot of time in kanji identification and lookup. As a test of memoQ, I loaded three pages (of a different source text of equivalent difficulty) in, and was through two of them in less than an hour, even with the copy-lookup method. I'm sticking with the software, but if it can give me that little push I need to stay in one program, I'll be even happier with it.

Thank you for your time and insight.


Direct link Reply with quote
 

Michael Joseph Wdowiak Beijer  Identity Verified
United Kingdom
Local time: 11:21
Member (2009)
Dutch to English
+ ...
short on time at the moment, but ... Jun 3, 2016

pmzeitler wrote:

I'm a relative neophyte to translation memory software, but I've tinkered with the memoQ trial long enough to know I want to go forward using that. However, I can't help but shake the feeling that I'm either missing some element of the software that would make my life easier, or that I'm using the software incorrectly.

Here's my current process: (SL Japanese, TL American English)
1. Scan the page using a flatbed scanner, and OCR it in FineReader. Repeat for all pages in the project.
2. Load the OCR output (plaintext) into memoQ and import my existing TM files.
3. Line by line read the SL text and compose a translation based on suggestions provided by the TM and/or going out to Bing Translate API (non-optimal solution as it's metered access and I'm poor at the moment after shelling out for memoQ, plus it requires an Internet connection).
3a. If I encounter a kanji I'm not familiar with (there are a few of these, given that I'm in the third year of a 4-year college level Japanese program), copy it to the clipboard and paste into a dictionary app (outside of memoQ, local to the computer) for a definition.
3b. Manually create a new TM entry (CTRL-Q?) for the kanji and any variants of its appearance (e.g. for adjectivals, both the -i and -ku forms).
4. Refine the translation for readability.

Where I think the Doing It Wrong is in steps 3a and 3b. I think there may be a way to have MemoQ do the lookup based on a dictionary file stored locally. Such a file exists-- Jim Breen's EDICT file-- but it is in a format that if memoQ can read it, it is not obvious to me. EDICT is in a somewhat proprietary XML format but Mr. Breen does provide the DTD, so I can probably extract data out of it if necessary to transform it to a well-known import format that memoQ can digest.

(Another post on this topic, which yielded no answers to how this extraction could be done, was looking to extract the data from WWWDIC, which violated that package's licensing. Based on my reading of the EDICT license, I believe that transforming the data does NOT violate the license, and that even if I were to distribute the data in memoQ-readable format, I would still not be in violation of the EDICT license assuming I provided the appropriate attribution. If I am in error on this point as well, please say so.)

This may seem like I'm too deep in the weeds on this. I expect I am, actually. However, even with the manual entry into the TM, using memoQ has dramatically accelerated the pace at which I can produce work. For a ten-page project I had to complete last semester, it took me roughly twenty hours to do the translation by hand, with only an electronic dictionary available to me. I chewed through a lot of time in kanji identification and lookup. As a test of memoQ, I loaded three pages (of a different source text of equivalent difficulty) in, and was through two of them in less than an hour, even with the copy-lookup method. I'm sticking with the software, but if it can give me that little push I need to stay in one program, I'll be even happier with it.

Thank you for your time and insight.


No time right now, but it ought to be relatively simple to convert the data (e.g. this file: http://ftp.monash.edu.au/pub/nihongo/edict.zip ) into a format memoQ can understand/import. You should also ask over @ https://groups.yahoo.com/neo/groups/memoQ/info

I also vaguely remember that quite a few Japanese/English translators hang out over at https://groups.google.com/forum/#!forum/felix-users (where you also might want to ask, as someone there might already have converted the data)

Michael


Direct link Reply with quote
 
MaxOO
Japan
For the time being Jun 5, 2016

Hi,

I'm also a Japanese-English translator, and have been trying to convert EDICT into some kind of text file for use in CATs.

Since I haven't found a solution yet, I recommend you to use a Japanese/English dictionary viewer called Logophile. (though you may already know this ...)

It can import EDICT. And it can be activated by a copy and paste action. So, if you encounter an unfamiliar Japanese word or phrase, just copy it on memoQ and activate the window of that software.


Cheers,
M


Direct link Reply with quote
 
pmzeitler
United States
Japanese to English
TOPIC STARTER
Further questions Jun 9, 2016

Thanks for the insight, folks.

Right now I'm using Takoboto, a free Win10 dictionary that uses EDICT as its primary data source. I'd like to be able to have everything integrated into memoQ, but that may be not something I can do-- I may want to look into writing a plugin or some other extension to do that.

One of the things that memoQ doesn't really do right now (or I'm missing) is understand conjugations and politeness levels; for example, 思う and 思います are functionally equivalent but have differing politeness levels. This is something that Takoboto and my previous dictionary app (imiwa? on ios) did handle properly. Is this just something I need to teach memoQ?


Direct link Reply with quote
 
MaxOO
Japan
Simple terms and CAT tools Jun 11, 2016

>> Right now I'm using Takoboto, a free Win10 dictionary that uses EDICT as its primary data source. I'd like to be able to have everything integrated into memoQ, but that may be not something I can do-- I may want to look into writing a plugin or some other extension to do that.

Another dic viewer app that can be used to convert EDIC into a text file, is DDWin. But, to do so, you have to convert EDIC data into EPWING in advance, which I've been unable to do, so far. This tool can convert the whole of imported EPWING data into plain text, which then you can compile for use in memoQ.

>> One of the things that memoQ doesn't really do right now (or I'm missing) is understand conjugations and politeness levels; for example, 思う and 思います are functionally equivalent but have differing politeness levels. This is something that Takoboto and my previous dictionary app (imiwa? on ios) did handle properly. Is this just something I need to teach memoQ?

If you want memoQ to catch (or ignore) all the conjugations such as 思う 思います, a possible solution is to add 思う as a term with the 50% prefix match setting (or possibly as a fuzzy term). However, there is a negative side to this; it's so short (50% prefix of 思う means 思) that memoQ can also catch 思考 思想 etc., which are often irrelevant and can make the translation results pane so "noisy."

To be more precise, you may need to add all these variations as synonyms (alternative entries).

Please also note that memoQ and other so-called CAT tools are "character/word-based" matching tools and do not consider the functional or semantic equivalence of words or phrases. Especially when Japanese is the source language, they perform termbase look-up operations on a "character" basis, just because Japanese does not have a word separator like a white space as used in English.

Well, you may want to try a desktop Japanese-English machine translation tool that "ignores" or "absorbs" conjugations and also comes with built-in glossaries with millions of entries, such as Honyaku-Pikaichi (翻訳ピカイチ) and Korya-Eiwa(こりゃ英和).

I personally use memoQ and such an MT tool side by side as needed, just to proceed without having to spend time and labor on adding simple, basic terms into a memoQ termbase, which is really laborious.

[Edited at 2016-06-11 17:32 GMT]


Direct link Reply with quote
 
MaxOO
Japan
Complete procedure Jun 23, 2016

Here is a complete procedure for converting EDICT into text format.

1. Download DDWin (dic viewer).
http://homepage2.nifty.com/ddwin/

2. Download EDICT 2 in EPWING format, and click on the supplied app file to extract data.
http://www.vector.co.jp/soft/data/writing/se369320.html

3. Open DDWin, go to ファイル > 辞書をサーチする, and select サーチするドライブ ("C" etc.), and set サーチする深さ(if you have downloaded the EDICT file in the download folder, "5" should be enough.)

4. When EDICT tab has appeared, click on 全文 tab above the search box, and click on 検索 button with the search box blank.

5. When the search results have appeared, go to ファイル > 編集 > エディタ起動. Select ファイル名 (export path). Then, in 出力する内容 box, select 該当項目すべて. And click OK.

Done.


The exported text file looks like this:

防水 ぼうすい [n,vs,adj-no](P)
waterproofing; making watertight

防水シート ぼうすいシート [n]
(See 防水布) waterproof sheet; tarpaulin; tarp; flysheet (of a tent)


You need to do some reorganizing and reformatting.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Import EDICT or other dictionary into memoQ as TM or other resource

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
memoQ translator pro
Kilgray's memoQ is the world's fastest developing integrated localization & translation environment rendering you more productive and efficient.

With our advanced file filters, unlimited language and advanced file support, memoQ translator pro has been designed for translators and reviewers who work on their own, with other translators or in team-based translation projects.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search