Wikipedia as a glossary source
Thread poster: FarkasAndras

FarkasAndras
Local time: 06:10
English to Hungarian
+ ...
May 10, 2009

I'm sure most of you know that most Wikpedia articles link to articles on the same topic in other languages (scroll down, look for a vertical list of links on the left hand side).

I'm sure very few of you know that Wikipedia offers a variety of different periodically updated dumps (database downloads) at http://download.wikimedia.org/ . I've looked around and found a way (two, actually) of generating giant glossaries out of the dumps. For example, out of the 120,000 articles in Hungarian, about 77,000 link to English articles. Using incantations and voodoo magic, I generated a 77,000 word two-column Excel spreadsheet out of that information, extracted from the dump. Obviously, part of it is close to useless, but there is lots of good stuff in there. Close to 4000 animal and plant species, all the towns of any size with Hungarian/Romanian, Slovakian etc. dual names that lie in border areas, lots of science and technology terms and gerenally everything that tends to be in Wikipedia.

If any of the Hungarian colleagues working with English, Spanish or Italian want the file(s) and have some interesting glossaries or TMs to offer in return, hit me up via PM. Make sure it's something that's not easily available online and at least mildly useful for a generalist En-Es-(It)->Hu translator. If you'd like me to generate glossaries of other language pairs, we may be able to work something out - but it may take ages as I have a lot of work to do. For the same reason I may not check back on this thread for a while.


Here's how you do it, in very broad terms. The two methods start with "Install Linux (or linux command line tools)" and "import the SQL file" so most people will not be able to do this. Those who are computer savvy enough to do it for themselves don't need much explanation so I'm not going to do a detailed writeup. This is long enough as it is.

1. SQL
I know exactly nothing about this. If you are different, use the langlinks.sql and the page.sql files. Langlinks.sql contains pairs of IDs and article titles and page.sql contains the list of IDs and article titles in the other language. Pair them up, substituting article titles for the IDs and away you go. Special characters like Hungarian accented letters get corrupted though...

2. XML
Get the full XML dump ( pages-articles.xml.bz2 ). Use the file of the more obscure language; in the case of English-Hungarian, use the Hungarian dump which I will use in this example. These files are large. Very large. The Hungarian XML, which is on the small side among Wikipedia dumps, is about 700 MB and about 15 million lines. Microsoft Word is not exactly the right tool for doing anything with it. The only text editor I have found that can open it is vim, and you even vim is not much use unless you enjoy reading XML files in your free time. You can chop it up with a Terminal command, but handling the resulting several files would be cumbersume.
Instead, I used grep to extract the relevant lines from the XML and create a txt file from them, along with the row number for each. For example, [[en: is the keyword you should look for to get English links, and "title" in angle brackets is the tag that identifies article titles. Just look at the dump file with vim to see what it looks like.
The resulting files are a lot smaller and easier to process with more everyday tools like search and replace.
Import the files to Excel, with the line numbers in a separate column. Clean up the files; remove the tags and remove the bulk of the irrelevant entries from the English link list.
Then you can use the line numbers to pair up the two languages: obviously, every English link goes with the Hungarian title before it in terms of row number. Just merge all the data into one spreadsheet, putting the row numbers in one column, Hungarian in another and English in a third. Order the whole thing alphabetically and then shift up the English column by one cell.
The fun part of this is that this method allows you to get extra information and subsets of data fairly easily. E.g. Wikipedia contains the scientific names of plants and animals, represented in the XML with a unique tag. So I just extracted that information with grep as well and now I have the latin names in a third column beside English and Hungarian, and a smaller, more manageable file in case I'm translating some biology-related material. The same could be done with any term that occurs in the actual text of the articles; it would be trivially easy to generate a glossary out of the titles of the articles that contain the word "astronomy" for an astronomy-related job.

If all of this is Greek to you, don't ask me to explain it... it would take hours of time that I don't have. If you get most of it and only need help with using grep or the like, just google it. That's what I did a dozen times before I worked out how to get it done. Have fun data mining!


Direct link Reply with quote
 

Hepburn  Identity Verified
France
Local time: 06:10
English to French
+ ...
Data mining May 11, 2009

I can only congratulate you on your endeavour and on the time you took to kindly explain in detail all the arduous details of the operation for our benefit. I thought it was a great idea and I jumped on the bandwagon quite cheerfully, took long pauses, read things over, frowned, then gave up. You had warned us: not for the faint hearted!

So, I now quietly tip-toe away, leaving all the venerable Linux, glossary fiends to wrench all this ore from Wlipedia and hit gold, I mean huge glossaries.

Amazing, truly amazing that such a task can be carried out to the end...

Congratulations for your idea, patience, willpower and also your generosity in sharing.

Claudette (feeling like a minute worm


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
Hmm... May 11, 2009

Looks like a Perl job... I might look into it. Could you share a fragment of the source files? I do not fancy to download half a gig just to have a look...

Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:10
Member (2006)
English to Afrikaans
+ ...
I have also done this, and so have Francis May 11, 2009

FarkasAndras wrote:
Using incantations and voodoo magic, I generated a 77,000 word two-column Excel spreadsheet out of that information, extracted from the dump.


My method was to start with a list of English words, and then attempt to visit Wikipedia articles with those words as their names, and then extract the translation of each term from the downloaded page. So my method doesn't work on a dump but on the online version. I suppose it can be done on the dump as well.

http://leuce.com/tempfile/omtautoit/wikterma.zip

Francis, the Apertium guy, also did something like you did, although I think his method must have been a lot more sophisticated than mine. He uses it to get glossaries for his MT research. His site is at:

http://www.apertium.org/


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
First try... May 11, 2009

Below is a very simple script extracting the English-Polish glossary in form of tabbed text. Please note that it is still wildly inadequate (it only saves one /last/ category, contains ugly formatting, etc.), it is rather a proof of concept.

Naturally, the data mined is pretty bad as well - for the glossary to be meaningful at all it needs to be heavily edited.

Code:


open (INFILE, "D:\Temp\test.xml");

open (OUTFILE, ">D:\Temp\output.txt");

while (<INFILE>) {
if (/<title>(.*?)<\/title>/) {
$Polish = ;
$English = "";
$Category = "";

}
if (/\[\[en:(.*?)\]\]/) {
$English = ;
}

if (/\[\[Kategoria:(.*?)\]/) {
$Category = ;
}
if ((/<page>/) && ( $English )) {
print OUTFILE "$Polish\t$English\t$Category\n";
}
}




[Edited at 2009-05-11 15:05 GMT]


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
Suggestions are welcome! May 11, 2009

Naturally, I have posted the above hoping for suggestions that might improve the code. Latin names and further categorization are the obvious ones, I bet there is more to get there...

Looking at the data, I tend to think that using one „Big Daddy” glossary does not make sense, possibly only subset searches will be useful. Those, however, are pretty trivial to do.

On the bright side, Perl is quite fast, as usual. With ActivePerl in WinXP, on a quite old machine, analyzing 40 MB of bz2 (150 MB of raw data) takes about 30 seconds and produces about 20k entries. Therefore search modifications can be done practically on the fly.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 06:10
Member (2006)
English to Afrikaans
+ ...
I also created a script for it... May 11, 2009

FarkasAndras wrote:
Here's how you do it, in very broad terms. ... Get the full XML dump ( pages-articles.xml.bz2 ). ...


Okay, I wrote a little script that extracts terms from the XML dump. The output looks like this:

Afrika :: Africa
Lys van Afrikaanse digters :: List of Afrikaans language poets
Alice :: Alice, Eastern Cape
Australasië :: Australasia
Asië :: Asia
Antofagasta :: Antofagasta
Apostoliese Geloofsending :: Apostolic Faith Mission
Argitektuur :: Architecture
ABC :: ABC
Algebra :: Algebra
AR :: AR

In the above example I downloaded the afwiki file (Afrikaans), and in my script I specified "en" as the target language code. The Afrikaans file is small, with only 25 000 items. The script extracted 13 000 terms from the Afrikaans, in about 1 minute.

I'm sure with tinkering, the script can be improved to extract more terms, but at the moment it extracts very "unforgiving" (i.e. if the article isn't written exactly the way it expects it, it skips it and moves on to the next item).

With a little hacking the script would also be able to extract more than one language, but the language codes would have to filled in alphabetically. I did not write that multilanguage version yet.

The script requires Windows, and if you want to customise it, you also need to install the scripting interpreter, AutoIt (freeware). The script asks for your source file, your output file, and your target language code. The output file is UTF-8 with BOM.

http://www.vertaal.org/tempfile/wikterma_xml.zip (300 KB)


Direct link Reply with quote
 

FarkasAndras
Local time: 06:10
English to Hungarian
+ ...
TOPIC STARTER
big daddy May 11, 2009

Jabberwock wrote:

Looking at the data, I tend to think that using one „Big Daddy” glossary does not make sense, possibly only subset searches will be useful. Those, however, are pretty trivial to do.

That's my thinking as well, although it also depends on your workflow. Multiterm for example is quite happy to work with 80,000 terms. Searches on opening a new TU are not annoyingly slow at all, even with my slow notebook HD. Of course slimming down the termbase might make sense just for the sake of limiting the number of junk hits you get.

Jabberwock, do I just save the script in a UTF-8 or ANSI txt with a .pl extension and run it as a perl script? I know nothing about programming, but I do have activeperl and I have run scripts before... I guess it should work more or less like that.
If anyone gets a foolproof, ready-for-primetime script going, please post it with a brief description/instructions. I guess customizing the search terms (languages or keywords from the text) should be very easy, and if the output is a two-column txt with line breaks between terms and tab or a special character between terms, it should be very convenient to process. It wouldn't hurt if the script could add the line number of each line so that data extracted at different times (different languages, subsets, extra info like scientific names etc) can be easily merged into one spreadsheet.


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
Should work... May 11, 2009

Saving in ANSI should do. I've realized that I have named the variables rather mindlessly, of course they should work without problems with any language (i.e. „Polish” is source and „English” is target. They can be switched after extraction). The only thing that needs localization is the word for „Category” („Kategoria” in my example). Of course, the file names (and paths) are hardcoded, so they should be changed, too (mind the double backslash).

Line numbers would be counterproductive, I am afraid, as they would differ between subsequent dumps (assuming we're talking about source file lines). I suppose that the title of the page would have to do for a unique identifier, as I assume there are no two different pages with the same title (in a given language). Therefore all data should be merged on the source term (extracted from the page title).


Direct link Reply with quote
 

FarkasAndras
Local time: 06:10
English to Hungarian
+ ...
TOPIC STARTER
numbers, names, methods May 11, 2009

Jabberwock wrote:

Saving in ANSI should do. I've realized that I have named the variables rather mindlessly, of course they should work without problems with any language (i.e. „Polish” is source and „English” is target. They can be switched after extraction). The only thing that needs localization is the word for „Category” („Kategoria” in my example). Of course, the file names (and paths) are hardcoded, so they should be changed, too (mind the double backslash).

Line numbers would be counterproductive, I am afraid, as they would differ between subsequent dumps (assuming we're talking about source file lines). I suppose that the title of the page would have to do for a unique identifier, as I assume there are no two different pages with the same title (in a given language). Therefore all data should be merged on the source term (extracted from the page title).


Thanks.

Yes, line numbers would get messed up with new dumps but if you decide to dl a new dump, just start over. I don't think I or anyone else will dl new dumps every month and still want to merge new data into the files generated from previous dumps.
I think line numbers are better than just the names because anyone can use them to merge in new data with a spreadsheet program (via alphabetic sorting) without having to write and use a different script.
You know what, just slap it on there for my sake : )

[Edited at 2009-05-11 17:43 GMT]


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
OK! May 11, 2009

Just to make sure I understand correctly: you want a line number from the source file for each term, presumably the one where the „title” tag has appeared?

By the way, I have downloaded full Polish dump and made Big Daddy out of it. (500 k terms, about five minutes.)

Then, just for fun, I have uploaded in ApSIC Xbench. Surprisingly, it works quite well! I know that Xbench does not index the database, so I expected it to choke, but the response was satisfactory. I've queried several technical terms and it did quite well... if you don't mind that it throws in some Star Wars vocabulary

Well, it's not exactly what we had in mind, but it really does not get simpler than that - with some freebies you get a huge glossary in ten minutes... And it is sometimes even usable!


Direct link Reply with quote
 

FarkasAndras
Local time: 06:10
English to Hungarian
+ ...
TOPIC STARTER
Yes May 11, 2009

Pretty much, yes. The line no. of the title AND the line no. of anything else you are extracting. That allows you to, say, get English-Polish from the Polish file, then get German later and add it to the same data to create a three-column glossary instead of two separate two-column files. Or, if you find some other worthy information then you can add that on as well (categories, words that just occur in the text etc., to create subsets and add info).
I mostly have what I want already, so I might not even use the script but it never hurts to have the option.

BTW you can eliminate some of the rubbish in the data by sorting the table alphabetically by the two languages in turn and removing unneeded chunks. A lot of it comes at the beginning (numbers), at C for "Category: XX" which you could just delete or remove the "Category:" bit, at "user" and maybe at "wikipedia". I think I removed about 5% from mine... maybe worth your time, maybe not. I think Excel 2007 can handle 500,000 lines. Excel 2003 and OOo can't.

BTW I'm envious of the Polish Wikipedia... a lot bigger than Hungarian.


Direct link Reply with quote
 

Jaroslaw Michalak  Identity Verified
Poland
Local time: 06:10
Member (2004)
English to Polish
Editing May 11, 2009

Below is the script with line numbers.

I would do merging by sorting by a unique key (ie. if the base file is Polish and I extract English and then German, then I sort on the Polish title tag), then pasting the relevant column from the other file. Please note that for merging you have to output the empty translations as well, so you have to skip the second condition (&& $English) - then you get the same number of entries.

As for the rubbish you describe, I did not get it at all - the scripts extracts only the title, the relevant translation and the category. Only the numbers remained, I got rid of that in the corrected version of the script. In general, I would remove as much as possible automatically - doing it manually means that it might need repeating in the future (e.g. for other languages).

Excel is fine for most tasks, but for quick and dirty table manipulation I prefer CSVEd:
http://csved.sjfrancke.nl/

Code:

open (INFILE, "D:\Temp\test.xml");

open (OUTFILE, ">D:\Temp\output.txt");

while (<INFILE>) {

$count == $count++;

if (/<title>(.*?)<\/title>/) {
$Source = $1;
$SourceCount = $count;
$Target = "";
$TargetCount = "";
$Category = "";
$Species = "";

}
if (/\[\[en:(.*?)\]\]/) {
$Target = $1;
$TargetCount = $count;
}

if (/\[\[Kategoria:(.*?)\]/) {
$Category = $1;
}

if (/Wikispecies=(.*?)\n/) {
$Species = $1;
}

if ((/<page>/) && ( $Target ) && ( $Source !~ /^[0-9]/ )) {
print OUTFILE "$SourceCount\t$Source\t$TargetCount\t
$Target\t$Category\t$Species\n";
}
}



[Edited at 2009-05-11 20:24 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Wikipedia as a glossary source

Advanced search







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search