Seeking feedback on online dictionary-generating tool
Thread poster: Thomas Johansson
Thomas Johansson
Thomas Johansson  Identity Verified
Peru
Local time: 00:10
English to Swedish
+ ...
Dec 11, 2009

Hello,

I would like to present a new "dictionary-generating" tool I am building online. I have just now uploaded a first experimental version of it and _would appreciate feedback_ from my colleagues here.

The tool, which is located online at my website, is designed to enable visitors to generate dictionaries between any combination of (supported) languages and have them displayed in full on a web page on the site. The dictionaries are generated based on language data ta
... See more
Hello,

I would like to present a new "dictionary-generating" tool I am building online. I have just now uploaded a first experimental version of it and _would appreciate feedback_ from my colleagues here.

The tool, which is located online at my website, is designed to enable visitors to generate dictionaries between any combination of (supported) languages and have them displayed in full on a web page on the site. The dictionaries are generated based on language data taken from the site's own database and can be customized in various ways for specific purposes.

I believe these sorts of dictionaries may be useful for instance for language studies, language teaching, traveling and certain kinds of language research.

The tool is meant to be a free online resource available for anyone to use.

You'll find shortcuts to specific language combinations at http://www.thomasteahouse.com/dict/index.php. Otherwise, the main form is at http://www.thomasteahouse.com/dict/dict.php?start .

One important objective of this tool is that people should be able to customize their dictionaries to very specific needs. At this moment, for instance, a person can:
- select what sorts of data should be displayed in each dictionary entry
- filter and group dictionary entries (still very limited capacity)
- specify various design settings to make dictionaries look the way he or she wants

Well, everything is still at a very experimental level.

Feel free to have a look and, again, I would very much appreciate your feedback.

Thank you for your attention,

Thomas Johansson


[Edited at 2009-12-11 06:34 GMT]

[Edited at 2009-12-11 06:34 GMT]
Collapse


 
RobertNJoung (X)
RobertNJoung (X)
Local time: 06:10
Swedish to English
+ ...
great initiative and resource! Dec 13, 2009

Indeed very useful, keep up the good work!

Thanks,

Robert


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 06:10
English to Hungarian
+ ...
How does it work? Dec 13, 2009

Does it do multi-word expressions?
Does it extract word pairs from a bicorpus (TM)?
If so, how well does it scale?
If it can handle a couple of million sentence pairs, you could feed it the EU stuff (DGT-TM and europarl corpus) and get sizeable dictionaries in many languages.

If you allow users to upload TMs (tab separated or TMX) and process them, the thing could work as a sort of terminology extractor. Perhaps introduce a mechanism for filtering out the uninteres
... See more
Does it do multi-word expressions?
Does it extract word pairs from a bicorpus (TM)?
If so, how well does it scale?
If it can handle a couple of million sentence pairs, you could feed it the EU stuff (DGT-TM and europarl corpus) and get sizeable dictionaries in many languages.

If you allow users to upload TMs (tab separated or TMX) and process them, the thing could work as a sort of terminology extractor. Perhaps introduce a mechanism for filtering out the uninteresting terms, say, the ones that occur in a 3000-word dictionary acquired from whatever source (3000 most frequent words in large corpus, say).

ATM I don't really have an opinion on it because I have no idea if it's designed to do anything that could be useful for me as a translator.
If it could extract word pairs from TMs, maybe I'd find a use for it at some point.
Collapse


 
Thomas Johansson
Thomas Johansson  Identity Verified
Peru
Local time: 00:10
English to Swedish
+ ...
TOPIC STARTER
clarifications Dec 13, 2009

FarkasAndras', thank you for your great input and ideas.

Before I respond to your questions, here are some clarifications:

At this early stage, the system will work best mainly for _language learners_ who are looking for targeted dictionary lists to print out and use for building new vocabulary.

(But yes, at later stages, the system should also be usable by persons looking for more specialized terminology, e
... See more
FarkasAndras', thank you for your great input and ideas.

Before I respond to your questions, here are some clarifications:

At this early stage, the system will work best mainly for _language learners_ who are looking for targeted dictionary lists to print out and use for building new vocabulary.

(But yes, at later stages, the system should also be usable by persons looking for more specialized terminology, e.g. translators.)

Try for instance http://thomasteahouse.com/dict/dict.php?start&dict&l0=ch&l1=en to get a Chinese-English dictionary. You can play with the settings to include more types of data in the records, filter records, and adjust the design (colors, fonts etc.).

Maybe I should also clarify that the system is in principle not limited in capacity, scope or objective to any particular group of language, e.g EU languages. It can in principle handle any language, and it is actually my hope to gradually add new languages.

(My personal little fancy here will be to add a wide range of "exotic", minority and ancient languages to the system. Imagine dictionaries from Tsutujil to Lao, Ancient Greek to Gamo-Gofa, and so so. One objective with the system is to make it a useful online tool for documenting language vocabularies.)

In specific response to some of the questions you mention:

Multi-word expressions:
Yes, in principle it can handle terms of single words, multiple words, phrases, sentences, affixes, etc., but my question is whether I understand exactly what you have in mind here... The system will for instance be able to filter, sort and group dictionary entries based on whether the terms are e.g. multiple words or sentences or etc. (I need to do some additional work on this aspect though.)

Extracting word pairs from files (e.g. TMs):
This sort of functionality is planned to be implemented.
In rough outline, individual users will be able to upload glossary files (e.g. bilingual TMs) and have the term pairs extracted and inserted into the system.
Site users will be able to generate dictionary outputs from the terms contributed by the entire user community or, if they like, from terms contributed by specific users only.
I personally hope this part of the functionality will become a useful resource for people to be able to share terminology for such purposes as translations, language-learning, etc.

Scaling:
Not entirely sure what you mean... I have no measure of its overall capacity available right now. This should depend on the server and database system, etc. What you see on the site is just a preliminary test version with some preliminary functionality.

I have put up some general information about the system for reference at http://thomasteahouse.com/dict/index.php.


[Edited at 2009-12-13 18:33 GMT]

[Edited at 2009-12-13 19:57 GMT]
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 06:10
English to Hungarian
+ ...
Scaling and other things Dec 13, 2009

Thomas Johansson wrote:

Scaling:
Not entirely sure what you mean... I have no measure of its overall capacity available right now. This should depend on the server and database system, etc.


Exactly that... whether the hardware and software you're throwing at this now or in the future is capable of chewing up a million sentence pairs in reasonable time.

By multiword entries I meant whether it will identify "European Central Bank" as a phrase in a text if it occurs frequently enough, and pair it up with "Banco Central Europeo". This would obviously increase the amount of computing power needed quite significantly. Going from single words to 3-word expressions is a ~6 times increase in potential pairs .

Anyway, best of luck with the project. I like the spirit very much.
What you are doing is a distant relative of http://mymemory.translated.net/, a large multilingual online TM by an agency. They crawl the web for multilingual content, align it and make it searchable. The system also tries to identify what the equivalent of the search term is in the target text, with awful results.
You may also want to take a look at the tools here: http://mokk.bme.hu/resources/
You'll need something like hunmorph to stem words in languages that use suffixes heavily, unless you're doing that already. I'm sure it increases the quality of the output greatly.
And I'd warmly recommend hunalign if you plan to automatically align bilingual content. I use it to align texts myself, and it works really well.
Incidentally, it is also capable of producing a dictionary from the raw bilingual content (it can roughly align two texts, generate a dictionary and use it to refine the alignment in a second pass, printing the dictionary as well).

[Edited at 2009-12-13 21:17 GMT]


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Seeking feedback on online dictionary-generating tool






Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »