ProZ.com global directory of translation services
 The translation workplace
Ideas

 
User
Thread poster: cyrine84
build a model language
cyrine84
Tunisia
Jul 26, 2010

Hi all,
I want to buid a 5 gram language model, so i want to know if it is necessary to do these three steps :
Tokenize training data,

Filter out long sentences
and
Lowercase training data
and why ?

Any one can help me

Best regards


Direct link Reply with quote
 
plotinus
Brazil
Local time: 15:15
Italian to Portuguese
+ ...
It depends Aug 12, 2010

Hi cyrine84,

a language model is not a unique solution/tool that you can apply to every single situation. In short, a language model will (and should) only model the characteristics that you want it to model. You should also be aware that the size of the model (you want a 5-gram model) is also subject to these restrictions: for many languages, such a large gram size would either result in an extremely large model in terms of bytes (which can make it slow or even impossible to load in memory) or in a model that needs to be purged.

Anyway, the filtering of long sentences is usually not necessary for language models if you are not concerned about the time it will take to build the model itself (however, it matters a lot if you want candidates for statistical machine translation), because you will usually filter out uncommon grams (i.e., grams whose frequencies are below a given value, usually obtained with an estimator).

The lowercasing of the training data is usually done both to make a short/faster language model and to make it more general and less context-specific; if you are not sure about lowercasing or not (i.e., if what you want to model is not case-significant), you should lowercase it.

Could you be a little more specific about what you want to model and why?

- Tiago


Direct link Reply with quote
 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


build a model language






Wordfast Pro 3.0
Changing the face of translation memory

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro 3.0 through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

More info »
Déjà Vu X2
Enjoy 20% off!

DVX2 Professional is the most popular version of Déjà Vu X2 and with good reason. Fast and flexible, Déjà Vu X2 Professional combines Atril’s Intelligent Quality technology with an array of powerful, customisable productivity and quality assurance

More info »