How to extract text from this bilingual text to create TMX
Thread poster: gianghl1983

gianghl1983
Vietnam
Local time: 20:39
English to Vietnamese
Dec 6, 2015

Dear Prozers,

I have a huge bilingual text (English-Vietnamese) in this below format. 20 Vietnamese sentences followed after every 20 English sentences (Total ~ 40.000 sentences).

I am using EmEdit and Notepad++ to process text, my strategy is to select every 20 others lines, then paste into English file. Then, delete all these lines to save into Vietnamese file to create TMX from these 02 files. However, I could not find a suitable filter/regex to do my job.

Do anybody know how to solve my problem?

Thank you so much in advanced!!!

[sentence id="1"] Another method is to have a central control unit allocate time based on the priority requirements of each transmission , as in the 100 VGAnyLAN demand priority access method .[/sentence]
[sentence id="2"] I 'm using the same interface function here , but I want it to do something different for my new type . .[/sentence]
...
[sentence id="1"] Một phương pháp khác nhờ một đơn vị điều khiển trung tâm cấp phát thời gian dựa trên độ ưu tiên về đường truyền , như trong phương pháp 100 VGAnyLAN .[/sentence]
[sentence id="2"] Tôi đang dùng hàm giao diện đó ở đây nhưng muốn làm một điều gì đó khác hẳn cho kiểu mới của tôi .[/sentence]
...
[sentence id="21"] A back - end MTA ( message transfer agent ) that transfers messages from the UA and delivers them to mail servers .[/sentence]
[sentence id="22"] A backbone can link multiple networks in the campus environment or connect networks over wide area network links .[/sentence]
...
[sentence id="21"] Trình ứng dụng back - end MTA ( đại lý truyền thư tín ) truyền tải các thông điệp từ UA và phân phối chúng đến các mail server .[/sentence]
[sentence id="22"] Một đường trục có thể nối nhiều mạng trong một môi trường campus hay nối kết các mạng trong diện rộng .[/sentence]
....


 

Soonthon LUPKITARO(Ph.D.)  Identity Verified
Thailand
Local time: 20:39
Member (2004)
English to Thai
+ ...
Excel sorting by condition Dec 6, 2015

gianghl1983 wrote:

Dear Prozers,

I have a huge bilingual text (English-Vietnamese) in this below format. 20 Vietnamese sentences followed after every 20 English sentences (Total ~ 40.000 sentences).

I am using EmEdit and Notepad++ to process text, my strategy is to select every 20 others lines, then paste into English file. Then, delete all these lines to save into Vietnamese file to create TMX from these 02 files. However, I could not find a suitable filter/regex to do my job.

Do anybody know how to solve my problem?
....


Seeing your text file content, I suggest as follows:
1. Copy all texts into Excel cells.
2. In Excel, sort rows by conditions e.g. extract by multipliers of each 10 rows or others into a new file (based on your contents) e.g. on row 1-10, 21-30, 41-50, ....... Do the same for target texts e.g. extract on row 11-20, 31-40, 51-60, .....
3. You get 2 new Excel files.
4. If you have CAT tools, use them to align 2 files to create a translation memory based TMX file.
[If your master MS Word, you can align text by using macros as well.]

Soonthon L.


 

FarkasAndras
Local time: 15:39
English to Hungarian
+ ...
Post file Dec 6, 2015

If you post the file I'll have a look. It sounds like it has a good structure so it should take 10 minutes to process. Try to avoid aligning the files with a CAT as all it will do is mess things up. This should be converted straight into a table and then into a TMX.

 

gianghl1983
Vietnam
Local time: 20:39
English to Vietnamese
TOPIC STARTER
I found solution! Dec 6, 2015

Thank you all for your suggestion, I found a solution for my problem by:

- Using EmEditor to rip off syntax, so I will have 20 Vietnamese sentences followed by 20 English sentences...
- Copy-Paste into Excel
- Using Kutools Excel (30 days trial) to use SELECT INTERVAL ROWS & COLUMN to select 20 rows with interval = 20 > Only English/Vietnamese Content.

Have a nice day!


 

Mikhail Zavidin
Ukraine
Local time: 16:39
English to Russian
+ ...
The following works in Notepad++ Dec 6, 2015

You could try deleting the English text using find-replace dialogue. Just input:

To find: (?:\[sentence id=\"\d+\"\].*\[\/sentence\]\r\n){20}((?:\[sentence id=\"\d+\"\].*\[\/sentence\]\r\n){20})

Raplace with: \1

Set the cursor to the beginig of first line and push Raplace All button.

This will raplace every 20 others English lines in your file.

Then to get English text file just delete first 20 lines of the original file set the cursor to the beginig of first line and run Replace All.
After that delete last 20 segments of Vietnamese text which remains in the end of the file.

Don't forget to set regular expression mode.

Hope this helps.

[Редактировалось 2015-12-06 14:26 GMT]


 

2nl (X)  Identity Verified
Netherlands
Local time: 15:39
Very important and generic techniques! Dec 6, 2015

Hi gianghl1983 and Mikhail,

These two totally different approaches are fascinating and very useful. They can come in handy in many workflows.

It's very important that all settings in the used tool are absolutely correct.

So, if you would be willing to post some screenshots here or even a screencast on YouTube, that would be great and I'd be much obliged.

Hans


 

FarkasAndras
Local time: 15:39
English to Hungarian
+ ...
nice Dec 6, 2015

Mikhail Zavidin wrote:

You could try deleting the English text using find-replace dialogue. Just input:

To find: (?:\[sentence id=\"\d+\"\].*\[\/sentence\]\r\n){20}((?:\[sentence id=\"\d+\"\].*\[\/sentence\]\r\n){20})

Raplace with: \1

Set the cursor to the beginig of first line and push Raplace All button.

This will raplace every 20 others English lines in your file.

Then to get English text file just delete first 20 lines of the original file set the cursor to the beginig of first line and run Replace All.
After that delete last 20 segments of Vietnamese text which remains in the end of the file.

Don't forget to set regular expression mode.

Hope this helps.

This is all academic becasue the problem has been solved, but your regex-fu is strong.
I guess you could have just written \[sentence id.*?\/sentence\] to shorten it a bit, but it's a solid concept for a quirky file like this. I rarely think to write a regex that spans several lines (mostly because I'm used to processing text line by line with perl, I guess).
You probably need the ? after the .* to stop it gobbling up all the text it can, though.

[Edited at 2015-12-06 17:51 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract text from this bilingual text to create TMX

Advanced search







Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search