Pages in topic:   [1 2] >
How to extract content from SGML to create TMX file
Thread poster: gianghl1983

gianghl1983
Vietnam
Local time: 05:55
English to Vietnamese
Sep 16, 2015

Dear ProZ users,

Recently, I got a bilingual text in SGML as below(about 40.000 English-Vietnamese sentences). In the code below, I changed < and > symbol by [[[ and ]]] accordingly.

I want to put them in my TM but could not find a way to convert this file to TMX.

Is there anyone know a solution for this.

Thank you!
------------------------------------------

[[[doc id='N0001']]]
[[[head]]]
[[[title]]]What is a Fenqing ?[[[/title]]]
[[[corpus url='http://code.google.com/p/evbcorpus/']]]EVBCorpus[[[/corpus]]]
[[[author email='hungnq@uit.edu.vn']]]Quoc-Hung Ngo, Werner Winiwarter[[[/author]]]
[[[citation]]]Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160, Ha Noi, Vietnam[[[/citation]]]
[[[/head]]]
[[[text]]]
[[[spair id='1']]]
[[[s id='en1']]]What is a Fenqing ?[[[/s]]]
[[[s id='vn1']]]Fenqing là gì ?[[[/s]]]
[[[/spair]]]
[[[spair id='2']]]
[[[s id='en2']]]Fenqing is a Chinese word which literally means " angry youth " .[[[/s]]]
[[[s id='vn2']]]Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .[[[/s]]]
[[[/spair]]]
[[[spair id='3']]]
[[[s id='en3']]]This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .[[[/s]]]
[[[s id='vn3']]]Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .[[[/s]]]
[[[/spair]]]
[[[spair id='4']]]
....
[[[/text]]]
[[[/doc]]]


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 23:55
Member (2009)
Dutch to English
+ ...
a few options Sep 16, 2015

Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half.

Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your English half.

This can easily be done in EmEditor (using the Filter Toolbar).

Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file.

Then use e.g. the open source Heartsome TMX editor to convert this into a TMX (Tools > Convert to TMX).

Or, let me do it for you (for £30/hour)


[Edited at 2015-09-16 09:59 GMT]


Direct link Reply with quote
 

Meta Arkadia
Local time: 05:55
English to Indonesian
+ ...
Solution, sort of Sep 16, 2015

You'll have to mark-down it first, I suppose. I used MDEdit for it, but there are dozens of (free) mark-down apps. You'll then have to convert the resulting bitext to TMX, which you can do in CafeTran, probably in other CAT tools as well.



I hope somebody comes up with an easier solution, because it looks like it'll need some more editing.

Cheers,

Hans (who loves problems, other people's problems)


Direct link Reply with quote
 

Soonthon LUPKITARO(Ph.D.)  Identity Verified
Thailand
Local time: 05:55
Member (2004)
English to Thai
+ ...
MS Word text to table convert Sep 16, 2015

Michael Beijer wrote:

Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half.

Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your Vietnamese [sic. English] half.

This can easily be done in EmEditor (using the Filter Toolbar).



I love the simple MS Word text to table converter.
1. I first convert the source (Vietnamese?) part into a table by using MS Word.
2. Insert an entire column to the right. Copy and paste every cell on the right with English texts.
3. Select the table and convert to text. save as text (Unicode coding).
4. Make translatable Trados bilingual file (tab delimited format with presetting of target as confirm.)
5. Then I get the TM or TMX as desire.

Soonthon L.


Direct link Reply with quote
 
Post removed: This post was hidden by a moderator or staff member for the following reason: Blank post

xxx2nl  Identity Verified
Netherlands
Local time: 00:55
My attempt Sep 16, 2015

Use a good editor that can delete lines that contain a certain search string (e.g. TextWrangler for Mac, I'm positive that similar editors exist for other operating systems).

Delete all lines that contain "garbage", so that you keep what's valuable:

[[[doc id='N0001']]]
[[[text]]]
[[[spair id='1']]]
[[[s id='en1']]]What is a Fenqing ?[[[/s]]]
[[[s id='vn1']]]Fenqing là gì ?[[[/s]]]
[[[/spair]]]
[[[spair id='2']]]
[[[s id='en2']]]Fenqing is a Chinese word which literally means " angry youth " .[[[/s]]]
[[[s id='vn2']]]Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .[[[/s]]]
[[[/spair]]]
[[[spair id='3']]]
[[[s id='en3']]]This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .[[[/s]]]
[[[s id='vn3']]]Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .[[[/s]]]
[[[/spair]]]
[[[spair id='4']]]
....
[[[/text]]]
[[[/doc]]]

Perform many Find and Replace operations to convert the tags between [[[ and ]]] to their "corresponding" TMX tags:



truetruefalsefalsefalsefalse10true-1size=3


What is a Fenqing ?Fenqing là gì ?
Fenqing is a Chinese word which literally means " angry youth " .Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .
This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .



[Edited at 2015-09-16 18:08 GMT]


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 23:55
Member (2009)
Dutch to English
+ ...
I don't see how this will work. Sep 16, 2015

Meta Arkadia wrote:

You'll have to mark-down it first, I suppose. I used MDEdit for it, but there are dozens of (free) mark-down apps. You'll then have to convert the resulting bitext to TMX, which you can do in CafeTran, probably in other CAT tools as well.



I hope somebody comes up with an easier solution, because it looks like it'll need some more editing.

Cheers,

Hans (who loves problems, other people's problems)


I don't see how this will work. If you remove all the markup, you have also removed the info you need to convert it into its two languages. The text in your screenshot is all run together. How are you going to turn that into vn-en?

Michael


Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 00:55
English
I like these puzzles too... Sep 16, 2015

... maybe try this if you have Studio.

Create a filetype for this sgml... simple with two rules:

//s (always translatable)
//* (Don't translate)

Then open the file in Studio and save it. Now you have an SDLXLIFF with source/target repeated in the source column only thorughout the file.

Now use the SDLXLIFF Converter for MSOffice (installed with Studio since 2011) and convert the SDLXLIFF to Excel.

Now you have an excel file with an ID column, a source column (populated) and a target column (and an empty notes column).

Use this formulae in the target column:

=IF(ISEVEN(A3),B3,"")

This will look at the ID column (column A) and check if's an even number or not. If it is then it will copy the contents in the cell. If it's an odd number it puts nothing. Once you did this copy the formulae down the spreadsheet. Now copy all of column C (target column) and paste as plain text to remove the formulae.

Now you have a spreadheet with every other row containing source on the left and target on the right. So filter on the target column and sort in alphabetical order. Now just delete all the rows with nothing in the target.

Now you have a simple spreadsheet you can drag into the Glossary Converter and convert to TMX.

Worked nicely, and easily, with your sample text

Regards

Paul
SDL Community Support


Direct link Reply with quote
 

Meta Arkadia
Local time: 05:55
English to Indonesian
+ ...
Bitext and regex Sep 16, 2015

Michael Beijer wrote:
I don't see how this will work. If you remove all the markup, you have also removed the info you need to convert it into its two languages. The text in your screenshot is all run together. How are you going to turn that into vn-en?


It looks like bitext, so the lot should be proceeded by the code. Unfortunately, I don't know how to do that, especially not for Vietnamese. We'll have to ask Andras.

SDL Community wrote:
This will look at the ID column (column A) and check if's an even number or not. If it is then it will copy the contents in the cell. If it's an odd number it puts nothing.


With my first try, I got something similar, if not easier:



However, I encountered encoding problems that I didn't want to try to solve because I used Mac-only apps nobody else uses, and my Vietnamese is lousy at that.

How did you get rid of all those superfluous spaces? Did you regex them away, or did they disappear automagically? What are they doing there anyway?

Cheers,

Hans


Direct link Reply with quote
 

Meta Arkadia
Local time: 05:55
English to Indonesian
+ ...
pro=pre Sep 16, 2015

Meta Arkadia wrote:
proceeded by the code. Unfortunately, I don't know how to do that, especially not for Vietnamese


... and if I edit it, it'll have to be vetted again. They don't trust me over here. And right they are.

H.


Direct link Reply with quote
 

gianghl1983
Vietnam
Local time: 05:55
English to Vietnamese
TOPIC STARTER
Thank you all! Sep 16, 2015

Michael Beijer wrote:

Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half.

Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your English half.

This can easily be done in EmEditor (using the Filter Toolbar).

Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file.

Then use e.g. the open source Heartsome TMX editor to convert this into a TMX (Tools > Convert to TMX).

Or, let me do it for you (for £30/hour)


[Edited at 2015-09-16 09:59 GMT]


Thank you all, I followed Michael Beijer method with EmEditor and I can easily extract text content separately into Vietnamese and English.

Many thanks!


Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 00:55
English
I didn't have to ;-) Sep 16, 2015

Meta Arkadia wrote:

How did you get rid of all those superfluous spaces? Did you regex them away, or did they disappear automagically? What are they doing there anyway?



There weren't any in my excel file. Finished TM in Studio looks ok too... no encoding issues.



Regards

Paul
SDL Community Support


Direct link Reply with quote
 

Adrien Esparron
Local time: 00:55
Member (2007)
German to French
+ ...
A little Macro... Sep 16, 2015

to clean the DOC :

Sub RemoveTags()

Dim MyRange As Range
Dim pos As Long

Set MyRange = ActiveDocument.Range
With MyRange.Find
Do While .Execute(findText:="(\ < * \ >)", _
MatchWildcards:=True, _
Wrap:=wdFindStop, Forward:=True) = True
MyRange.Delete
Loop

End With

End Sub

Then you have a "clean" text : just remove (or correct) what you want (if needed). Select the text and convert it in a table. Use one of the tools already mentioned to create a TMX.

Done!


What is a Fenqing ?
EVBCorpus
Quoc-Hung Ngo, Werner Winiwarter
Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160, Ha Noi, Vietnam

What is a Fenqing ?
Fenqing là gì ?
Fenqing is a Chinese word which literally means " angry youth " .
Fenqing là một từ tiếng Hoa mà nghĩa đen là " thanh niên phẫn nộ " .
This word has many translations in English such as cynical youth , young nationalists , hysterical youth and angry young men .
Từ này có nhiều cách dịch sang tiếng Anh như là thanh niên hoài nghi , thanh niên theo chủ nghĩa dân tộc , thanh niên cuồng loạn và thanh niên tức giận .


Regards


[Modifié le 2015-09-16 20:07 GMT]


Direct link Reply with quote
 

Michael Beijer  Identity Verified
United Kingdom
Local time: 23:55
Member (2009)
Dutch to English
+ ...
I ♥ EmEditor Sep 16, 2015

gianghl1983 wrote:

Michael Beijer wrote:

Extract every line from your file that contains "s id='vn" and save it as a new txt file. This is your Vietnamese half.

Extract every line from your file that contains "s id='en" and save it as a new txt file. This is your English half.

This can easily be done in EmEditor (using the Filter Toolbar).

Convert the two files into a tab-delimited txt file. There are many ways to do this. I have an EmEditor macro for this, but there is also a little tool in LF Aligner's "grab bag" (available on sourceforge) that can do this. Or just copy paste them both into a new file.

Then use e.g. the open source Heartsome TMX editor to convert this into a TMX (Tools > Convert to TMX).

Or, let me do it for you (for £30/hour)


[Edited at 2015-09-16 09:59 GMT]


Thank you all, I followed Michael Beijer method with EmEditor and I can easily extract text content separately into Vietnamese and English.

Many thanks!


Yes, that Filter Toolbar in EmEditor is priceless. Note that it also allows you to filter negatively. So much quicker and easier than messing around with Macros. The whole thing would take maybe 5 minutes in EmEditor.


Direct link Reply with quote
 

Meta Arkadia
Local time: 05:55
English to Indonesian
+ ...
Yes there are Sep 17, 2015

SDL Community wrote:
There weren't any in my excel file. Finished TM in Studio looks ok too




And I only have those encoding issues in a crazy editor, and I'm sure I can avoid them, but I'll have to find out how.* It's worth the trouble because of all the (AppleScript) goodies.

But those spaces are everywhere...

Anyway, Michael's solution works. No need to think. Good.

*EDIT: Open as *.rtf does the trick.

Cheers,

Hans

[Edited at 2015-09-17 02:28 GMT]


Direct link Reply with quote
 
Pages in topic:   [1 2] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to extract content from SGML to create TMX file

Advanced search







WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search