SQL for personal glossaries -- does it make sense?
Thread poster: opolt

opolt  Identity Verified
Germany
Local time: 22:41
English to German
+ ...
Aug 2, 2010

Hi all,

I am considering moving to an SQL database for my personal glossaries (probably MySQL, PostgreSQL, or maybe SQLite), but before making this step, I'd like to hear from other freelancers who might have some experience with this kind of setup. (BTW I'm sure many agencies are using such databases, but frankly I'd prefer to hear from people working in a home office context, because that's the situation I'm in, which is quite different form a mulit-user office.)

Looking around for anything in the way of advice and references specifically related to glossaries, I haven't found too much, though I have read this article: http://www.proz.com/translation-articles/articles/225/
So I'm finding myself in the situation mentioned at the end of the article, i.e. I'm not satisfied with plain text or spread sheet files anymore. I find this approach far too inflexible, cumbersome, and error-prone. It "doesn't scale well", as they say.

I believe I'm roughly aware of the main advantages and disadvantages of SQL versus file based storage, and I'm not a computer newbie. I know there's going to be some admin work ahead for me, but I'm quite sure that in the long run, switching to SQL will pay off in huge way, given that over the years, one accumulates thousands and thousands of entries in many different fields, together with abbreviations, context, explanations, possibly client names, etc. (well that's at least how my glossaries are shaped -- they aren't simple two-column tables). Querying the entire data set in one go, and in a flexible way, looks very enticing to me. Plus, once set up, one could use the database engine for other purposes.

(For the record, I'm mainly a Linux user but that shouldn't matter too much in this context, given that the available engines should run on all major platforms.)

Right now I'm not much of a TM/CAT user, but that might change in the future; therefore, I would be grateful if someone could give me an overview of the CAT tools out there which rely on SQL databases (and, if at all possible, which SQL implementation).

So what do you think, does it make sense to set up SQL for a single freelancer, or would it be pure overkill and too much of a hassle? What are your experiences? How does the overhead compare to the convenience once you have set up the whole thing? Am I mad to even consider this?

Thanks in advance for any comments and advice.


Direct link Reply with quote
 

Stanislav Pokorny  Identity Verified
Czech Republic
Local time: 22:41
English to Czech
+ ...
Why reinvent the wheel? Aug 2, 2010

Hello Opolt,
most CAT tools have a terminology database management component that will do exactly what you need.
If you want to use a server-client approach, the CAT tools developer offer advanced server solutions (e.g. SDL has MultiTerm Server) that will enable you to manage use your termbases remotely.

Although the servers do cost a lot, the time, efforts and money you will invest to develop a solution of your own will not be much less.


Direct link Reply with quote
 

Peter Linton  Identity Verified
Local time: 21:41
Member (2002)
Swedish to English
+ ...
Facilities Aug 2, 2010

I admire your courage, but Stanislav has a point about not reinventing wheels. More importantly, these terminology management systems are tightly integrated and have facilities that I doubt you could program in SQL. A good example of such a facility is the ability to get a transition from a database and feed it directly into Microsoft Word. Of course you could always copy and paste, but that involves several keystrokes.

Check out Trados MultiTerm. Not cheap, but a vast improvement on lists and spreadsheets.


Direct link Reply with quote
 

opolt  Identity Verified
Germany
Local time: 22:41
English to German
+ ...
TOPIC STARTER
@Stanislav: Backend is what matters to me Aug 2, 2010

Stanislav Pokorny wrote:

Hello Opolt,
most CAT tools have a terminology database management component that will do exactly what you need.
If you want to use a server-client approach, the CAT tools developer offer advanced server solutions (e.g. SDL has MultiTerm Server) that will enable you to manage use your termbases remotely.

Although the servers do cost a lot, the time, efforts and money you will invest to develop a solution of your own will not be much less.


Thanks a lot for the reply, Stanislav! -- I had gathered as much, but the thing is, I'm somewhat wary of proprietary solutions when it comes to the backend (which normally affects the data format), not only because I have been "educated" computer-wise on Linux, but also because of costly updates, portability and future compatibility.

The main advantage of those CAT DB tools, it seems to me, would be that they plug right into your CAT tool, and that you don't have to fiddle around to get a decent front-end. If (only if) they can be set up with a popular standard SQL backend, such as MySQL, I would certainly consider this as a solution. Obviously this saves a lot of time, and, as you say, one doesn't have to reinvent the wheel.

(Which is the format/backend used by MultiTerm anyway?)

OTOH, there are many graphical tools out there to set up SQL querying front-ends (including in OpenOffice). Both front and backends (servers) are available for free. And my impression was that it wouldn't be overly difficult so build such a front-end for a maximum of, say 10 fields per glossary entry -- maybe I'm wrong.

I still have to investigate this a little further.



[Edited at 2010-08-02 15:57 GMT]


Direct link Reply with quote
 

Joakim Braun  Identity Verified
Sweden
Local time: 22:41
German to Swedish
+ ...
Go for it! Aug 2, 2010

A basic terminology database is simple stuff in MySQL.

You'll probably have more trouble installing/compiling the components you need (Apache, MySQL, PHP with a bunch of additions...) than actually building the database and interface, which could be done in a day if you keep it basic.

When you start thinking in terms of MySQL tables, though, you might want to consider stuff like having the references in a separate table. That way, you can add an arbitrary number of references to an entry.

You can write export functions to dump the database in whatever format is useful (tab-delimited, XML etc.) and readable by CAT tools.

Long term you'll hate having to go to a web browser to maintain and use the database, so you will need to write a small native application (Windows, MacOSX) to interface with it. I've done this on MacOSX and it's not difficult. You get much better tie-in with word processors etc. that way, and also a cleaner and more functional UI. (Yes, you could have a "look up a transition for the selected word in the database and paste the result" functionality... )

In summary, if you're an IT oriented guy, who thinks stuff like this is fun, don't pay too much attention to the "reinvent the wheel" comments. Even with a somewhat more advanced implementation, this is not a hugely complicated project.


Direct link Reply with quote
 

Stanislav Pokorny  Identity Verified
Czech Republic
Local time: 22:41
English to Czech
+ ...
Replies Aug 2, 2010

Hello Olaf,
here are a few replies.

opolt wrote:
Thanks a lot for the reply, Stanislav! -- I had gathered as much, but the thing is, I'm somewhat wary of proprietary solutions when it comes to the backend (which normally affects the data format), not only because I have been "educated" computer-wise on Linux, but also because of costly updates, portability and future compatibility.

It will not hurt to consider these issues, but if you are a freelancer, I wouldn't bother too much with them. Most terminology management tools (if not all of them) offer wide range of data exchange possibilities (TBX, Excel sheets, plain TXT etc.), so the migration from one system to another is not that difficult.

The main advantage of those CAT DB tools, it seems to me, would be that they plug right into your CAT tool, and that you don't have to fiddle around to get a decent front-end.

Exactly. This is called active terminology recognition. Read: terminology is recognized automatically for the individual segments.

If (only if) they can be set up with a popular standard SQL backend, such as MySQL, I would certainly consider this as a solution. Obviously this saves a lot of time, and, as you say, one doesn't have to reinvent the wheel.

Sorry, but I didn't investigate them that deep. Never needed that.

(Which is the format/backend used by MultiTerm anyway?)

Microsoft Jet.

OTOH, there are many graphical tools out there to set up SQL querying front-ends (including in OpenOffice). Both front and backends (servers) are available for free. And my impression was that it wouldn't be overly difficult so build such a front-end for a maximum of, say 10 fields per glossary entry -- maybe I'm wrong.

Yes, there may be and probably are some, but you still lose the active terminology recognition feature, unless you are a skilled programmer and the CAT tool developer released his SDK.

[Upraveno: 2010-08-02 16:16 GMT]


Direct link Reply with quote
 

Joakim Braun  Identity Verified
Sweden
Local time: 22:41
German to Swedish
+ ...
Table suggestions Aug 2, 2010

table 1 (language)
* uid
* language name

table 2 (glossary entry)
* uid
* source language uid
* source text
* target language uid
* target text
* comments...

table 3 (reference entry)
* uid
* glossary entry uid
* source language uid
* source text
* target language uid
* target text
* comments, source URL...
* uploaded file blob...


Direct link Reply with quote
 

opolt  Identity Verified
Germany
Local time: 22:41
English to German
+ ...
TOPIC STARTER
Clarifications Aug 2, 2010

Thanks a lot for your input, Stanislav and especially Joakim!

Maybe I should clarify a bit: I just want so set up an SQL server program on my local home server, and build a simple interface/front-end with one of the available programs out there, such as this one: http://www.glom.org/wiki/index.php?title=Screenshots (this is a Linux-based PostgreSQL front-end)

I want my growing glossary database to be absolutely future compatible and non-proprietory -- I'm talking decades here, not years. I don't want my hard-earned data to be in a format which isn't accessible after some software firm goes bust, or is bought up by AOL, or because someone decides to drop Linux support, or whatever.

What I don't want (and can't) is to program a tool which inserts words into my word processor, or anything which is even remotely comparable to what CAT programs are able to do. I just want to get away from text files and spreadsheets, which I find unwieldly. I just want to be able to search those data and have them displayed in a certain fashion -- primitive, some may say, but sufficient for my purposes. Once the SQL setup is ready, it should be relatively easy to feed the data to spreadsheets again, or to tables, text files, and convert to from there to PDF, HTML, or to print them, whatever.

Now, if there are CAT tools (for Linux) relying on SQL or comparatively flexible, open backends, I would consider buying one of those CAT tools, of course. That would allow me to get the best of both worlds, using the same database for both application (simple query and CAT), if possible.

@Joakim: I don't see why I should have to use PHP and Apache? As I said, there are tools to build front-ends (and without having to rely on browsers).

[Edited at 2010-08-02 16:48 GMT]


Direct link Reply with quote
 

Stanislav Pokorny  Identity Verified
Czech Republic
Local time: 22:41
English to Czech
+ ...
About CATs and platforms Aug 2, 2010

Hello Olaf,
actually very few CAT tools are platform-independent (I can think of WordFast, Swordfish, and OmegaT; all based on Java), while only OmegaT is open-source. However, I don't know if it uses SQL as its database format.


Direct link Reply with quote
 

Joakim Braun  Identity Verified
Sweden
Local time: 22:41
German to Swedish
+ ...
So Aug 2, 2010

Interesting! Didn't know about Glom. I'm only familiar with accessing MySQL through HTTP, whether on the same computer or not. That setup assumes a HTTP server running on the server hardware and PHP (or something) to talk to MySQL.

In any case, once you get the data into a native OS app, as you do with Glom, it shouldn't be too hard to hack together word insertion, "live" term lookup etc.


Direct link Reply with quote
 

Rossana Triaca  Identity Verified
Uruguay
Local time: 17:41
Member (2002)
English to Spanish
CAT Tools... Aug 2, 2010

What you want *is* to make a CAT tool, albeit a very basic one (in the end, they are all just databases). I sincerely don't understand why, if you are so IT-oriented, you haven't explored the different solutions offered in this area, particularly when you have such long-term goals that are the very be-all and end-all of all CAT tools (my databases are approaching the 10 year mark and, goodness gracious, I honestly can't live without them). CAT tools shine after extensive population and polishing, and the time factor takes care of this almost effortlessly.

As has been mentioned, you'd be hard press to find a CAT tool that doesn't have a decent data exchange format (to import-output the databases), so the risk of getting tied-up to a proprietary format is really non-existent, particularly for freelance in-house volumes; your biggest of the big TM won't get over 1 GB in your whole life; it is, after all plain text. (Disclaimer: This could change in the future (i.e., global-access databases with multimedia support, my personal holy grail), but if it does, everyone in the industry would have to migrate and you can be sure there will be some type of automation available. For what is worth, I doubt this will ever come to fruition).

As for specific recommendations, everyone has preferences that are mostly biased by their workload (the CAT tool the majority of their clients use or require, the type of files they receive, etc.), or your workflow (what you're used to and what you can manage in terms of complexity or environment), so everyone will steer to different solutions. Me? I'd advice you to try the 30-days free trial of DVX using CrossOver Office (but talk to Atril if you like the software and plan to purchase it since I'm not sure the dongle is working under Linux right now). DVX is SQL based, and it allows you to execute SQL commands within the same software to tinker with your databases. Mind you, I still believe it's a little bit under the hood for a CAT beginner, but it's good to know it's there for the long term.

In any event, if you take the plunge and enter the wonderful world of CAT tools, you're in for a fantastic surprise


Direct link Reply with quote
 

opolt  Identity Verified
Germany
Local time: 22:41
English to German
+ ...
TOPIC STARTER
Reply to Rossana Aug 2, 2010

Rossana, yours is a very inspiring post with some good points, which encourages me to delve a bit deeper into the world of CAT (which I haven't done very much so far -- see below), but maybe I should explain a bit where I come from (this is going OT a bit for which I apologize beforehand):

Rossana Triaca wrote:

What you want *is* to make a CAT tool, albeit a very basic one (in the end, they are all just databases). I sincerely don't understand why, if you are so IT-oriented, you haven't explored the different solutions offered in this area, particularly when you have such long-term goals that are the very be-all and end-all of all CAT tools (my databases are approaching the 10 year mark and, goodness gracious, I honestly can't live without them). CAT tools shine after extensive population and polishing, and the time factor takes care of this almost effortlessly.


Though my main point was about glossaries, yes, I'm soon going to investigate one of the commercial CAT options on Linux. As you can see, I don't know much about what's out there -- that's why I'm asking.

However I'm not looking for perfection, Rossana, only for some common sense in the form of a software solution I have been using computers since 1993 or so and my experience shows that the software industry in general can't be relied on when it comes to future compatibility and open formats -- though it constantly claims to solve all of the problems of humankind within the next 3 years. Based on that, my attitude is much less optimistic than yours wrt to compatibility and easy data exchange.

(As an aside: yes, I'm quite IT-oriented, but OTOH rather sceptical when it comes to language automation. I was talking to an electrical engineer back in the mid nineties[!] who told me that MT was going to take over real soon now ... And despite having learned my IT stuff, I'm still old-fashioned enough to value creativity much more than, say, fuzzy matches. Maybe it's because I have dedicated so much time to computers, I'm aware of their limits a bit more than others; my thinking has always been that binary logic has very little to do with the way human languages work.)

There are other reasons why I haven't dedicated much time to CAT tools so far, including they type of my work and clients, and my decision to choose Linux as my main platform many years ago: until quite recently, CAT tools weren't available under Linux at all.

Be this as it may, I'm primarly interested in a pure glossary solution right now. I still often find myself entering terms manually after reading an article, for instance. (Or, just now I'm wondering which is the best way to keep a local copy of my little Proz glossary.) Now, if the CAT tools available can solve that problem for me, fine, maybe I'm going to use one of them.



As has been mentioned, you'd be hard press to find a CAT tool that doesn't have a decent data exchange format (to import-output the databases), so the risk of getting tied-up to a proprietary format is really non-existent, particularly for freelance in-house volumes; your biggest of the big TM won't get over 1 GB in your whole life; it is, after all plain text. (Disclaimer: This could change in the future (i.e., global-access databases with multimedia support, my personal holy grail), but if it does, everyone in the industry would have to migrate and you can be sure there will be some type of automation available. For what is worth, I doubt this will ever come to fruition).


Well, "decent data exchange format", yes, but by that I understand CSV and plain text files, spreadsheet formats, HTML tables, and the like (which are different from databases). I want to get away from these formats, managing the entries as a proper database, which should be robust, non-proprietary, future compatible (i.e. direct migration with no ex-/import, within 20 years), and allow me to query/enter data in a flexible way, from different frontends (such as, well, different CAT tools). I'm not so sure current CAT tool databases could be relied on for that. But maybe they can -- again, that's why I'm asking. I mean, I'm really interested to learn about the database backends used by those programs. Most of them use TMX of course, but AFAIK TMX is not very useful for glossaries, and is difficult to query by any tool which is not a CAT tool


As for specific recommendations, everyone has preferences that are mostly biased by their workload (the CAT tool the majority of their clients use or require, the type of files they receive, etc.), or your workflow (what you're used to and what you can manage in terms of complexity or environment), so everyone will steer to different solutions. Me? I'd advice you to try the 30-days free trial of DVX using CrossOver Office (but talk to Atril if you like the software and plan to purchase it since I'm not sure the dongle is working under Linux right now). DVX is SQL based, and it allows you to execute SQL commands within the same software to tinker with your databases. Mind you, I still believe it's a little bit under the hood for a CAT beginner, but it's good to know it's there for the long term.


I didn't know DVX is SQL based, which in itself is very good news. OTOH CrossOver isn't really for me; if I need Windows, I reboot into that. There's also Wordfast for Linux; I have read that it works with MS Word only (but not OOffice) even on Linux; if true, that would be most pathetic. OmegaT, which I use, doesn't rely on any database engine that I'm aware of. Right now my best option seems to be Heartsome Translation Studio which claims built-in SQL compatibility. I think I'm going to test that one first.

Cheers, opolt


[Edited at 2010-08-02 22:21 GMT]

[Edited at 2010-08-02 22:23 GMT]


Direct link Reply with quote
 

Rossana Triaca  Identity Verified
Uruguay
Local time: 17:41
Member (2002)
English to Spanish
CATs rule :) Aug 3, 2010

Glad my post was of some use, although you made me feel old with the comment about using computers since the nineties... my first computer was a Spectrum when they were just the latest fad

(...) that the software industry in general can't be relied on when it comes to future compatibility and open formats -- though it constantly claims to solve all of the problems of humankind within the next 3 years. Based on that, my attitude is much less optimistic than yours wrt to compatibility and easy data exchange.


The thing is, you have to consider that language industry standards are quite a novelty yet; XLIFF has been around for about 8 years, and noone can predict if encoding/data exchange standards will change. I don't think that expecting future software to support new data exchange standards it's optimism on my part, rather that as long as the big players have a vested interest in this they will push for them (and for big I mean L-3 Comm, Lionbridge, etc.), not out of the goodness of their hearts but rather to profit from them as clients or providers (Lionbridge now has his own CAT tool based on the old Logoport, Geoworkz, which I honestly don't like one bit but hey, they are the industry leaders if you omit the US military and, guess what, they included a native XLIFF editor so you too can migrate to their platform).

(As an aside: yes, I'm quite IT-oriented, but OTOH rather sceptical when it comes to language automation (...) my thinking has always been that binary logic has very little to do with the way human languages work.)


You see, I don't know when or where this idea that MT equal CATs got entrenched in the collective psyche; let me assure you they are completely different things, and that the automation applies to your workflow and quality assurance procedures and *not* the language itself. The result always feeds on your own labor and is hence limited by your skill; "fuzzy matches" is just a glorified way of saying "a search result of your own archives using loose constraints" (which you control in their entirety).

Now, if the CAT tools available can solve that problem for me, fine, maybe I'm going to use one of them.


They can do much more than that, but even if you only consider the purpose you have in mind (using your example), they can remember exactly the issue of the magazine of the term you found, the author of the article, which industry it applied to, in short, anything and everything you want to include when you enter the term in your database. And the fantastic thing about a good CAT is that it will remind you of this little term exactly when you need it (and even then it's just a suggestion you can ignore).

Well, "decent data exchange format", yes, but by that I understand CSV and plain text files, spreadsheet formats, HTML tables, and the like (which are different from databases). I want to get away from these formats, managing the entries as a proper database, which should be robust, non-proprietary, future compatible (i.e. direct migration with no ex-/import, within 20 years), and allow me to query/enter data in a flexible way, from different frontends (such as, well, different CAT tools).


By "decent data exchange" I was thinking XLIFF/XML; the rest are nice but they don't pass muster. As for the rest, I don't think any CAT tool can offer you that (future compatibility for 20 years? Not without a crystal ball into the encoding standards future I fear), but I don't think you can develop it in-house either.

Right now my best option seems to be Heartsome Translation Studio which claims built-in SQL compatibility. I think I'm going to test that one first.


I'm afraid I know little of HTS (here's an interesting review) since my Linux days are long past due to business and gaming reasons , but by all means I think that any new(ish) CAT will include the basics (although converting every file that comes your way so you can process them under Linux would surely get old for me, but then again, I deal with proprietary formats most of the time.)


Direct link Reply with quote
 

opolt  Identity Verified
Germany
Local time: 22:41
English to German
+ ...
TOPIC STARTER
@Rossana Aug 4, 2010

Thanks for your latest input (and for the link), Rossana, which I have read carefully. We have gone a bit OT now, which is really the only reason why I'm not going to reply to each of your points.

Re the industry standards topic, by and large I disagree. Witness the "browser wars", for instance. Standards are useful and in most cases can be adhered to, and interoperability, exchangeability can be implemented -- if the companies are willing to. Often they are not, though technically it's not difficult in most cases. And if you have been using computers that long, you should remember some cases of "Well, how do I open that old file now?" Anyway, I just wanted to make the point that one should try to keep one's data in a format which promises to be viable long-term.


(As an aside: yes, I'm quite IT-oriented, but OTOH rather sceptical when it comes to language automation (...) my thinking has always been that binary logic has very little to do with the way human languages work.)


You see, I don't know when or where this idea that MT equal CATs got entrenched in the collective psyche; let me assure you they are completely different things, and that the automation applies to your workflow and quality assurance procedures and *not* the language itself. The result always feeds on your own labor and is hence limited by your skill; "fuzzy matches" is just a glorified way of saying "a search result of your own archives using loose constraints" (which you control in their entirety).


Yes, search results, and indexing, which overall is some rather old-fashioned technology -- only that it's now much better integrated with some applications, and more feasible owing to powerful computers. -- I'm not toally unfamiliar with CAT, Rossana; CAT tools are useful as automated search tools, my only question was if they're the best solution for glossaries. And I wasn't suggesting MT and CAT are the same. Of course they're not. I only wanted to make the point that languages don't follow boolean logic; hence the invention of the term "fuzzy match", which in itself is a bit "fuzzy". Or "approximate", as they say. Only that languages aren't subject to "approximate logic" either; it's like saying that psychology, for instance, or human societies "favour discrete values over true/false decisions" -- which would be pure nonsense. (This reminds me of that joke among programmers where they refer to the "world out there" as "the big blue room".)


Well, "decent data exchange format", yes, but by that I understand CSV and plain text files, spreadsheet formats, HTML tables, and the like (which are different from databases). I want to get away from these formats, managing the entries as a proper database, which should be robust, non-proprietary, future compatible (i.e. direct migration with no ex-/import, within 20 years), and allow me to query/enter data in a flexible way, from different frontends (such as, well, different CAT tools).


By "decent data exchange" I was thinking XLIFF/XML; the rest are nice but they don't pass muster. As for the rest, I don't think any CAT tool can offer you that (future compatibility for 20 years? Not without a crystal ball into the encoding standards future I fear), but I don't think you can develop it in-house either.


Well, SQL has been out for decades now, that's why I was considering it. And XML and XLIFF are basically just text files, with all the pros and cons already mentioned (and some additional ones). I was really asking about databases.

Sea como sea, un saludo amable a la República Oriental




[Edited at 2010-08-04 11:27 GMT]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

SQL for personal glossaries -- does it make sense?

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use SDL Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search