Text extraction tool
Thread poster: Andrey Slyadnev

Andrey Slyadnev  Identity Verified
Russian Federation
Local time: 22:27
English to Russian
+ ...
Jan 5, 2015

Hi folks,

I have a En-Ru bilingual text in one pdf file which is doc convertable
and I would like to align both languages to build a TMX.

Is there any tool available that could extract English or Russian text from
a bilingual file?

I would really appreciate any tips or leads.

Best,
Andrey


Direct link Reply with quote
 

FarkasAndras
Local time: 16:27
English to Hungarian
+ ...
Wrong question Jan 5, 2015

I can answer the question you asked: yes, there is a tool designed to extract both the English and the Russian text from your file in preparation for alignment.
However, that is the wrong question and thus the wrong answer. Given that we are talking about only one file and not 100 or 1000, a manual solution is bound to be faster. You haven't told us anything about the file, so specific advice is impossible. Whatever you mean by "doc convertable" is only clear to you. All pdf files can be converted to doc in a couple of ways, and most of the time the result is crap. Go the doc route or go straight to text: try Ctrl-A, Ctrl-C and copy all text to a text editor. See if it looks acceptable. Also try File/Save As/Text and compare the result. You can also download and try pdftotext.exe, which is part of xpdf.
When you have a reasonable text file to work with, you can start thinking about separating the two texts, which may be fairly easy. One trivial solution is, if they are not mixed on the same line:
Replace all \t with a space, then replace
^(.*[принятойонституционным])
with
\t\1
in a good text editor e.g. Notepad++.
Then all Russian sentences should have a tab in front of them. Copy to excel, column A is English, B is Russian. It should require very little cleanup.
Again, this is just one option. You gave us no information so we can't give you much back.



[Edited at 2015-01-05 09:20 GMT]


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:27
Member (2006)
English to Afrikaans
+ ...
What I would do Jan 5, 2015

Andrey Slyadnev wrote:
I have a En-Ru bilingual text in one pdf file ... and I would like to align both languages to build a TMX.


I know of no tool that can extract such text easily.

What I would do is OCR the PDF file using an OCR program that is capable of reading two languages, e.g. ABBYY FineReader. I would save the OCR'ed text as plain text, and then make two copies of it (one for EN, one for RU). Then I would use regex to remove all Latin characters from the RU file, and remove all Cyrillic characters from the EN file, and then align them in an interactive alignment tool (my personal preference being PlusTools). The advantage of using an OCR program to extract the text is that it will fixed most line break error (otherwise you'd have to fix them manually).


Direct link Reply with quote
 

FarkasAndras
Local time: 16:27
English to Hungarian
+ ...
I wouldn't Jan 5, 2015

Samuel Murray wrote:

Andrey Slyadnev wrote:
I have a En-Ru bilingual text in one pdf file ... and I would like to align both languages to build a TMX.


I know of no tool that can extract such text easily.

What I would do is OCR the PDF file using an OCR program that is capable of reading two languages, e.g. ABBYY FineReader. I would save the OCR'ed text as plain text, and then make two copies of it (one for EN, one for RU). Then I would use regex to remove all Latin characters from the RU file, and remove all Cyrillic characters from the EN file, and then align them in an interactive alignment tool (my personal preference being PlusTools). The advantage of using an OCR program to extract the text is that it will fixed most line break error (otherwise you'd have to fix them manually).

That's fraught with possible problems. First, OCR is never perfect, so if the file is not a scanned document, OCR should be avoided. Fixing the line break issue is easy in Word or a text editor (replace two successive line breaks with a special string, remove all line breaks, replace your special string with a line break). Second, OCR programs work better if they know what character set to expect, so a mixed file is likely produce worse output than usual. Third, just removing cyryllic/latin is bound to generate imperfect results, possibly terrible ones. Numbers from the ru text will be left in the en text, and the ru text will likely contain some stray latin characters that will be left in the en and removed from the ru. Add to that the errors from OCR and you have a mess. See a possible solution in my previous post.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:27
Member (2006)
English to Afrikaans
+ ...
@Farkas Jan 5, 2015

FarkasAndras wrote:
Then all Russian sentences should have a tab in front of them.


Assuming that all Russian "sentences" end on a fullstop, and assuming that every one Russian sentence is followed by one English sentence, and not followed by anything else.

Again, this is just one option. You gave us no information so we can't give you much back.


I agree... it really depends on what the file looks like. My only assumption in my advice was that the Russian content was in the same sequence as the English content, but whether they are in separate columns, or within the same paragraphs, I don't know, and that would make a huge difference.

That's fraught with possible problems. First, OCR is never perfect, so if the file is not a scanned document, OCR should be avoided. Fixing the line break issue is easy in Word or a text editor (replace two successive line breaks with a special string, remove all line breaks, replace your special string with a line break).


My experience with PDF files has been that an OCR program is more capable of recognising blocks of text than a manual find/replace activity. Your method only works on clean paragraph text, and not e.g. if there are many lists or tables.

Second, OCR programs work better if they know what character set to expect, so a mixed file is likely produce worse output than usual.


Well, that is why you need to use an OCR program that allows the user to indicate what languages the file is in. I must add that I have no experience in extraction from PDFs with multiple character sets, and only with PDFs with multiple languages.

Third, just removing cyryllic/latin is bound to generate imperfect results, possibly terrible ones. Numbers from the RU text will be left in the EN text, and the RU text will likely contain some stray Latin characters that will be left in the EN and removed from the RU.


That is true, but that is why you need an interactive aligner... so that you can visually separate the good content from the bad content. Some of the problems that you mention can also be solved with slightly more complex find/replace actions. For example, one can assume that any Latin "word" (numbers or letters, plus spaces or punctuation marks) that has at least two Cyrillic "words" on both sides is most probably part of the RU text and not a portion of EN text.


Direct link Reply with quote
 

Andrey Slyadnev  Identity Verified
Russian Federation
Local time: 22:27
English to Russian
+ ...
TOPIC STARTER
Adding more details Jan 5, 2015

FarkasAndras, Samuel, thank you, guys, for your prompt responses
and sharing your expertise.

I have about 50 issues of a bilingual magazine of a company that ceased to exist in 2013.
The original files are in pdf but they can be easily converted into doc files without any affect on the original layout or formatting. However, the layout is what makes alignment a challenge. The Russian content either follows the English content or they are in separate columns.

I tried TextMiningTool to extract text from pdf however in this case it has no advantages over just saving the same pdf as a txt file.

Hope this will give a better idea now.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 16:27
Member (2006)
English to Afrikaans
+ ...
Perhaps if Farkas could see the file... Jan 5, 2015

Andrey Slyadnev wrote:
I have about 50 issues of a bilingual magazine... converted into doc files without any affect on the original layout or formatting. However, the layout is what makes alignment a challenge. The Russian content either follows the English content or they are in separate columns.


One reason why the layout could be a "challenge" is because the DOC version uses text boxes to position the text correctly, and it is very difficult to copy/paste text from text boxes in MS Word (in my experience).

What I would do is to run the DOC file through a CAT tool's "text extraction" utility, so that all paragraphs/sentences that occur in the document are converted to a plain text file (Wordfast Classic's "Extract" feature works without limitation in demo mode).

Then, I would do what I proposed earlier, namely to create two versions of the file, and remove all Latin letters from the RU file (in MS Word, the FIND string is [A-zÀ-ÿ], with wildcards enabled, and remove all Cyrillic letters from the EN file (in MS Word, the FIND string is [Ѐ-ӿԀ-ԓ], with wildcards enabled), and then manually clean the two extraction files a bit (as there will be left-over punctuation marks and numbers in the sections from which text were deleted).

Then use an aligner to align the files. Unfortunately this is going to take some manual fixing.

== Added:

A second option is to perform the text extraction, and the simply create two identical copies of the extracted text, and then align them. Then you simply delete all Latin chunks of text from the one column and all Cyrillic chunks of text from the other column, while aligning the remainder of the segments. It may be marginally faster to do that than to attempt to delete the Latin and Cyrillic text using find/replace.



[Edited at 2015-01-05 12:38 GMT]


Direct link Reply with quote
 

FarkasAndras
Local time: 16:27
English to Hungarian
+ ...
nope Jan 5, 2015

Samuel Murray wrote:

FarkasAndras wrote:
Then all Russian sentences should have a tab in front of them.


Assuming that all Russian "sentences" end on a fullstop, and assuming that every one Russian sentence is followed by one English sentence, and not followed by anything else.


No, my solution doesn't rely on any of those assumptions. All it needs is: no mixed text in any single line, all ru lines contain at least one cyryllic character and en lines have no cyryllic characters.

I would have to see the files to come up with specidic solutions, but dual column files are usually a royal pain to process. Depending on how the file was created, various bits of text might end up in the wrong place after export/conversion. I would probably discard files with a two-column layout unless they're indispensable.

[Edited at 2015-01-05 13:23 GMT]


Direct link Reply with quote
 

Andrey Slyadnev  Identity Verified
Russian Federation
Local time: 22:27
English to Russian
+ ...
TOPIC STARTER
Expanding technical limits Feb 10, 2015

Samuel, Andras,

just wanted to drop a line to share the results of my experiment and some thoughts.

I tried two different methods to make a TM from a bilingual (En-Ru) text with a complex layout:

1. Manual picking of individual language segments and their alignment with alignment tools.
vs
2. Text extraction, removal of Cyrillic characters from En file and visa versa for Ru file and then alignment with alignment tools (Samuel's way).

Both methods turned out to be quite time consuming. As a matter of fact, it took me about 3 hours just to sort out segments properly into English and Russian files as per Method 1 (total word count per file - 7K words). On the other hand, Method 2 required a lot of effort at the alignment stage as most of the segments were poorly aligned due to their irregular placement in the original text. I admit that advanced alignment skills would have somewhat increased the performance of Method 2 but would have not eliminated the effect of the text layout.

The alignment task, which I tried to tackle is probably an extreme case for the available tools and may require a different approach (technology). If I had to design a proper solution for this case, I would look for a technology capable of language recognition, parsing and tokenization to automatically identify and align sentences in two given languages based on 2 or more keywords, entity names, numerals regardless of text formatting and layout.


Direct link Reply with quote
 

FarkasAndras
Local time: 16:27
English to Hungarian
+ ...
You didn't try Feb 10, 2015

... the solution that is most likely to work:

When you have a reasonable text file to work with, you can start thinking about separating the two texts, which may be fairly easy. One trivial solution is, if they are not mixed on the same line:
Replace all \t with a space, then replace
^(.*[принятойонституционным])
with
\t\1
in a good text editor e.g. Notepad++.
Then all Russian sentences should have a tab in front of them. Copy to excel, column A is English, B is Russian. It should require very little cleanup.

Obviously replace the string of cyrillic characters in the [] with the whole alphabet


Direct link Reply with quote
 

Andrey Slyadnev  Identity Verified
Russian Federation
Local time: 22:27
English to Russian
+ ...
TOPIC STARTER
Need to be educated Feb 11, 2015

Andras, excuse my ignorance but I would appreciate it if you could explain the steps described below in more detail. You suggested that all of them could be performed in Notepad++?

Step 1: Replace all \t with a space
(what is the objective?)

Step 2: Replace ^(.*[абвгдежзийклмнопрст]) with \t\1
(Does this code replace all cyrillic characters in the [] with \t\1?
What is \t\1 and what is the difference between \t and \t\1?)

I tried to perform these steps on a sample bilingual text in Notepad++ but got "0 occurrences was replaced' message. Am I doing something wrong?


Direct link Reply with quote
 

FarkasAndras
Local time: 16:27
English to Hungarian
+ ...
regex Feb 11, 2015

These are regular expressions so enable regular expressions in the Notepad++ search and replace dialog when doing them.

The first one removes tab characters because we'll be using those to separate the texts in the next step. Existing tabs would mess things up.


\t\1 replaces the text with a tab and everything that was enclosed in () in the search box. I.e. in this case it inserts a tab at the beginning of every line that has a cyrillic character in it. This way you can copy the text to excel and separate the texts there.


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Text extraction tool

Advanced search






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
SDL Trados Studio 2017 only €495 / $595
Get the cheapest prices for SDL Trados Studio 2017 on ProZ.com

Join this translator’s group buy brought to you by ProZ.com and buy SDL Trados Studio 2017 Freelance for only €495 / $595 / £425 / ¥70,000 You will also receive FREE access to Studio 2019 when released.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search