Brewing an open source CAT: Anyone had experience with XML/SGML?
Thread poster: Florian v. Savigny

Florian v. Savigny  Identity Verified
Local time: 22:59
English to German
Apr 25, 2003

Hi there,



I\'m just posting this to find out if there are more translators out there trying to brew their \"home-grown\" solutions for translating.



I\'ve been using SGML for various purposes over several years now (own texts, terminology management (albeit only data storage, thus far), \"synoptic\" translations, management of linguistic knowledge).



What would interest me a lot would be the possibility to use it simply to create an open source CAT solution, for which SGML/XML would seem the natural language to use. I would think that the following requirements are already fulfilled:



- the editing environment (I\'d favour Emacs\' psgml mode, but I think there are plenty of SGML/XML editors around)

- tools to convert the SGML source into whatever one needs (especially RTF, Postscript, PDF and HTML would be quite simple)





while the following tasks would be at least quite straightforward to tackle:



- fragmentation of a source text

- spellchecking (at least inside Emacs)



- integration with terminology management would be a bit more work, but could be made to work with different systems, via a standard interface



The hard bits would be the translation memory and all that matching, I suppose, since that would have to be realised in a truly time-efficient manner.





A more advanced (but maybe impossible) thing would be the fragmentation of already formatted text, leaving all formatting information untouched. The charm of that, certainly, would be the possibility of editing any kind of file without having to use the program itself. I am wondering whether this could be achieved by enclosing the formatting information in marked sections (such as CDATA), but that has several problems:

1, I suppose it does not work with binary formats (is that so? I mean it should be no problem with RTF, but one with Word, for instance)

2, you could not easily add any formatting of your own in the translation, nor could you leave out any (if you have no idea what it is, you can\'t edit it), and that would mean you could not use italics where they aren\'t in the source text.

3, you\'d have to know at least how characters are represented for every single format, to distinguish them from formatting stuff. Diacritics are possibly particularly arbitrary.



The benefits of an open source system in general would be the possibility to apply all sorts of modifications once it is there. The benefits of SGML/XML would be the possibility to re-use the source for different purposes (such as producing several formats, making different outputs for proof-reading, and the like), and maybe a high degree of stability.



If anybody feels inspired by or interested in anything of the above, please feel free to add your own ideas. I\'m keen to know if anyone has similar ideas.



Florian


Direct link Reply with quote
 

Ralf Lemster  Identity Verified
Germany
Local time: 22:59
English to German
+ ...
OmegaT Apr 26, 2003

Hi Florian,

Have a look at this thread - I guess you might be interested to look at OmegaT, for example.



HTH - Ralf

[ This Message was edited by: Lemster on 2003-04-26 06:17]


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 22:59
German to English
+ ...
Open source/XML Apr 26, 2003

Hallo Florian,



Take a look at the \"Linux for translators\" resource pages at



www.marcprior.de/linux/linux.html



In particular, OmegaT at



www.marcprior.de/OmegaT/OmegaT.html



Quite a lot of progress in the direction you envisage has already been made with OmegaT. Why not try it out - I would be interested to hear your comments.



Regards,

Marc



Direct link Reply with quote
 

Florian v. Savigny  Identity Verified
Local time: 22:59
English to German
TOPIC STARTER
Comments on OmegaT May 6, 2003

Marc wrote:



<quote>

Quite a lot of progress in the direction you envisage has already been made with OmegaT. Why not try it out - I would be interested to hear your comments.

</quote>





Well, first of all, congratulations for providing a free and open source CAT to the community. Though Java is not my kind of thing, I do appreciate its advantage of cross-platform availability.



I have installed and had a closer look at OmegaT, and the files it produces, and these are the questions it gave rise to:



- OmegaT does not keep a complete \"bilingual\" file, but it compiles the target file from the translation memory and the source file. Why exactly did you choose that approach?



- In translating HTML, non-ASCII Characters are stored in the TM as very weird things (e.g. German u umlaut (ü) is represented as \"ü\", but correctly transformed into HTML entities (&uuml; in this case) in the target file. Just curiousity, but why?



- inline markup seems to be represented in an abbreviated form, i.e. two <a ... > tags in one segment will be represented in the TM as <a0> and <a1>, respectively. It is easy to guess that the numbers refer to the first and the second <a>-tag in the source segment, and that these references will be resolved in the compiling process.



- to summarise: can OmegaT, by virtue of its design, deal with other markup than SGML/XML, e.g. RTF or TeX? Your solution seems to rely on SGML/XML markup in the source rather than using it to encapsulate alien markup. (Well, that\'s my impression.)



I have decided to comment on other issues in a separate article (those that are less technical).



Cheers, Florian


Direct link Reply with quote
 

Florian v. Savigny  Identity Verified
Local time: 22:59
English to German
TOPIC STARTER
OmegaT: user interface May 6, 2003

Quote:


Take a look at the \"Linux for translators\" resource pages at







I\'ve forgot to say that these pages are truly genious. Thanks a lot for providing them!





Now I\'ll go on commenting on the more apparent features of OmegaT:



- first of all, I have to say the user interface is very intuitive. I personally like to be able, however, to make modifications (such as changing the sizes of the windows).



- I was puzzled by the mode of segmenting. With a text file, the segments were lines, which is complete nonsense. With an HTML or OO file, the segments were paragraphs, which is much too big to yield any matches at all. I think this is a point that really needs fixing. I think that sentences should be the default segments.



- an \"advanced\" feature which would be nice would be the possibility to change the size of individual segments by hand.



- there\'s one thing I haven\'t quite understood, and that\'s whether matches will only be found in the same document (since the TM has a double purpose: it also serves to document the translation - thus, \"alien\" matches would not seem to fit there). I think TMs should also provide the possibility to contain matches which are not from the text itself.





That\'s all I can see so far. What makes me most sceptical is the impression that it is restricted to SGML/XML marked up files. Apart from that, I think it is a real feat, and go on developing it!



Florian

Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 22:59
German to English
+ ...
OmegaT May 7, 2003

Hallo Florian



Tbanks for taking the time to try out OmegaT. I\'ll try to answer your questions

one by one. The developer, by the way, is Keith Godfrey. I am responsible for documenation, publicity, etc.



>Though Java is not my kind of thing<



It is not my kind of thing either. In fact, I had been working on a CAT tool of my own, first in ELF (the Applixware macro language), and later in tcl/tk, before I became involved in OmegaT. The most important thing is of course the cross-platform nature. Almost all open-source applications are written for Linux; almost all translators use Windows. So with a nascent open-source CAT

tool, we have the unusual situation that developers (existing and potential) are likely to be using a different platform to that of the users. If OmegaT had been a Windows-only application, it would have been no use to me.



>OmegaT does not keep a complete \"bilingual\" file, but it compiles the target file from the translation memory and the source file.<



That is correct. You\'ll have to ask the program\'s developer why he chose this solution. However, I believe other CAT apps use the same approach. Daja Vu, perhaps?



>In translating HTML, non-ASCII Characters are stored in the TM as very weird things (e.g. German u umlaut (ü) is represented as \"ü\", but correctly transformed into HTML entities (ü in this case) in the target file.<



The wierd things are Unicode. OmegaT\'s internal translation memory format is TMX, the industry standard, and TMX is Unicode.



>inline markup seems to be represented in an abbreviated form (...)<



Your description is broadly correct. Font tags are , others . The tags are numbered by tag type, not by instance, so if two different words are in bold in a sentence, they will both have the tag pair e.g. . You should find a more detailed but translator-oriented description on handling the tags in the documentation.



>Can OmegaT, by virtue of its design, deal with other markup than SGML/XML, e.g.RTF or TeX?<



OmegaT currently handles plain text, HTML, and StarOffice/OpenOffice.org Writer. The arrangement is ideally suited to other XML formats, and also similiar non-XML arrangements like Applix Words and DTP tagged file formats. Writing such parsers should require very little time and programming skill.



>I personally like to be able, however, to make modifications (such as changing

the sizes of the windows).<



You have probably noticed that OmegaT changes the size of the windows automatically. There are limitations to this mechanism, although I must say that I didn\'t encounter them until I had used OmegaT on a daily basis for at least six months. User-driven resizing is probably a function available in the Java API.



>I was puzzled by the mode of segmenting.<



OmegaT segments by the paragraph. Always. This is unusual for a CAT tool, and most people who have used a different CAT tool don\'t like it - for the reason you give. Personally, though, I believe it is better to keep paragraphs together, in order to provide more context. The ideal solution, as I see it, would be to retain paragraph-level segmenting but for the search mechanism to split the segment up into sentences before searching for fuzzy matches.



>With a text file, the segments were lines, which is complete nonsense.<



I would guess that the reason for this was that your text file had hard line breaks.



>an \"advanced\" feature which would be nice would be the possibility to change the size of individual segments by hand.<



See above. The problem is essentially that OmegaT does not perform any segmenting; instead it uses the existing paragraph elements as the segment boundaries. There is no way the existing segment boundaries can be moved, because there aren\'t any (unless, of course, you open the source text and inserta paragraph break). I honestly have no idea how difficult it would be to change the segmenting mechanism, but before boundaries can be moved, they obviously have to be created in the first place.



>there\'s one thing I haven\'t quite understood, and that\'s whether matches will

only be found in the same document<



Yes and no. OmegaT accesses translation memories in two locations:



- project_save.tmx (by default in the directory /omegat)

- external memories, any number of them, e.g. from previous projects, by default in the directory /tm.



Fuzzy match searches search all of these memories. The keyword search function (the equivalent of Trados \"Concordance\"), however, only searches

project_save.tmx. In order for the data in external memories to be accessible to the keyword search function, the original source file(s) of the TM must be

included in the project, as well as the TMs derived from them. For this purpose,

OmegaT provides a mechanism for presenting these files for searches, but not for

translation. From a translator\'s perspective, this means that if you want to use - or, more precisely, have full access to - a TM for, for example, one customer,

you would create a project for that customer and then add any future orders/files for that customer to the project already created. In terms of file

management, this is not a user-friendly solution and needs to be fixed.



>That\'s all I can see so far. What makes me most sceptical is the impression that it is restricted to SGML/XML marked up files.<



Providing dedicated filters to proprietary formats is a huge amount of programming effort, and a massive waste of resources in an open-source project. OmegaT relies on OpenOffice.org\'s excellent filters to MS Word and RTF. I have been using this procedure for months now: open Word file in OOo, convert to OOo (.sxw) format, import into OmegaT; follow the procedure in reverse after translation.



>Apart from that, I think it is a real feat, and go on developing it!<



It may not be perfect, but a bird in the hand is worth two in the bush, as they

say. Development has slowed at the moment due to lack of time/resources. My real

hope for OmegaT is that it will prove to be the link between the open-source

programming and mainstream translation communities.



Schöne Grüße aus Bergisch Gladbach,



Marc



Direct link Reply with quote
 

Florian v. Savigny  Identity Verified
Local time: 22:59
English to German
TOPIC STARTER
OmegaT: further discussion of "software-political" stuff May 7, 2003

Dear Marc,



thank you for answering so quickly. Though there a quite a few interesting points, I understand it is better to discuss the more technical stuff with Keith Godfrey. I\'d be very grateful if you could provide me with his email address (or maybe he\'d like to participate here?).



Here, I\'ll focus on the more apparent stuff:





> we have the unusual situation that developers (existing and potential) are likely to be using a different platform to that of the users. <



This does sound very wise. I hardly use Windows at all, but I can imagine Java is the most widespread scripting language there (it does take long to start, though, but not to run).



> The wierd things are Unicode.<



Oops. I should have guessed.



> You have probably noticed that OmegaT changes the size of the windows automatically. <



Well, what I am mainly thinking of is that segmenting also has the function of providing a more ergonomic environment (i.e., you do not overlook sentences and so on); and I imagine that translators will sit for hours staring at their screen with the OmegaT window. I think that this merits special attention to customization functions (e.g. how about tearing off windows, changing colours of the background or the frames?)



> The ideal solution, as I see it, would be to retain paragraph-level segmenting but for the search mechanism to split the segment up into sentences before searching for fuzzy matches.<



I couldn\'t agree more! Yes!



(With both, I mean, but I maintain this should be at the choice of the user.)



>>With a text file, the segments were lines, which is complete nonsense.<



I would guess that the reason for this was that your text file had hard line breaks.<



Well, yes, but that is not unusual for text files (esp. if they were written with an editor); a paragraph break would be a sequence of two line breaks. I personally have never seen soft line breaks but in the text files exported by word processors and some email programs.



> I honestly have no idea how difficult it would be to change the segmenting mechanism, but before boundaries can be moved, they obviously have to be created in the first place. <



Well, the code that handles text files must have a segmenting mechanism, if a primitive one. How about starting there? - Honestly, I think the segmenting mechanism (and its flexibility) is crucial for the success of the program.



> Providing dedicated filters to proprietary formats is a huge amount of programming effort, and a massive waste of resources in an open-source project.<



Hmm, maybe, but aren\'t you contradicting what you said above:



> The arrangement is ideally suited to other XML formats, and also similiar non-XML arrangements like Applix Words and DTP tagged file formats. Writing such parsers should require very little time and programming skill. <



> OmegaT relies on OpenOffice.org\'s excellent filters to MS Word and RTF. <



One suggestion: since OpenOffice is open source as well, parts of it should be freely reusable (don\'t know about its licence, though). How about extracting its filters and make OmegaT use them \"in the background\" (I know: this is Unix philosophy; and I have no idea whether it is possible to start system processes from inside Java code)? The reason I am asking is that OO takes hours to start on my desktop, 95% of which are presumably redundant.



> My real hope for OmegaT is that it will prove to be the link between the open-source programming and mainstream translation communities. <



I think you have taken the right direction in choosing Java, OO, Win-typical keybindings, an intutive interface, and I think it is really an interesting project. I am just sorry I cannot contribute due to my complete ignorance of Java.



> Schöne Grüße aus Bergisch Gladbach, <



Auch aus Bielefeld!



Florian


Direct link Reply with quote
 
xxxMarc P  Identity Verified
Local time: 22:59
German to English
+ ...
More OmegaT May 7, 2003

>This does sound very wise.<



Keith did in fact produce a forerunner to OmegaT in C (or C++, I don\'t know which). The two languages are quite similar and he did consider switching back to C at some stage.



>I can imagine Java is the most widespread scripting language there (it does take long to start, though, but not to run).<



I don\'t have an IT background but technically, I don\'t think Java is a scripting language. At any rate, it does rely on an interpreter and you are right in implying that that results in a performance hit. There are trade-offs in terms of the multi-platform capability and also in the ease with which things like GUIs can be produced. If performance became an issue, I think a switch back to C would not be too difficult. There are programs which supposedly will convert Java to native C code; I am sceptical that they deliver as promised but the fact that they exist at all suggests that porting is not so difficult. We have one OmegaT user with Mac OS X who is talking about porting OmegaT to native Mac code.



>Well, what I am mainly thinking of is that segmenting also has the function of providing a more ergonomic environment<



I would also like to see more in the way of ergonomic controls. A couple of points about that:



- The current 1.0.2 version is a massive improvement over previous versions - if you had seen the February 2001 release, you would think the current version fantastic! (Read my review in Gabe Bokor\'s Translation Journal if you like. There have been six or seven releases since then.)



- A lot of the look and feel of Java applications is controlled at Java, not application level. This has both benefits and drawbacks. Some things are in fact quite difficult to change at application level.



>Well, yes, but that is not unusual for text files (esp. if they were written with an editor); a paragraph break would be a sequence of two line breaks.<



I can\'t remember when I last processed a plain text file in OmegaT. Have you tried opening and saving in a different editor?



>Well, the code that handles text files must have a segmenting mechanism, if a primitive one.<



It has a segmenting mechanism, yes, but it doesn\'t insert any segment boundaries. It recognizes the existing paragraph boundaries. I\'ve no doubt that a solution could be found, but it is somewhat more fundamental than moving boundaries in, say, Trados, where you effectively tell Trados NOT to do something it would otherwise do.



>How about starting there? - Honestly, I think the segmenting mechanism (and its flexibility) is crucial for the success of the program.<



I suspect in fact that it would be easier to change the fuzzy search mechanism to regard text strings between full stops as search units. But that\'s just a hunch.



>Hmm, maybe, but aren\'t you contradicting what you said above:<



I\'m not sure where the contradiction is. If you mean: easy to add filters for Applix or DocBook, for example, but not for Word, no, that\'s not a contradiction. Any format that is XML or XML-like is easy to add. Any format that is not open/documented involves a tremendous amount of work. For this reason, MS Word\'s file format is one of the world\'s most valuable secrets.



RTF lies somewhere inbetween. It is a binary format and does not resemble XML. But a) it is documented (albeit badly) and b) Java contains an integral RTF interpreter, so I imagine an RFT filter could be written relatively easily. But it would be a different solution to the current filters.



>since OpenOffice is open source as well, parts of it should be freely reusable (don\'t know about its licence, though). How about extracting its filters and make OmegaT use them \"in the background\"<



Yes, we\'ve already thought of doing that. It\'s not as easy as it sounds, though. If it were, someone (like the KWord or Abiword teams) would already have done it.



>The reason I am asking is that OO takes hours to start on my desktop<



And on mine. But you only need start it once, in order to convert Word files to .sxw, and once again, to convert them back. And once it\'s loaded, I don\'t find it THAT slow. I do in fact keep it, and a copy of the source file, open so that I can see the layout.



>I am just sorry I cannot contribute due to my complete ignorance of Java.<



Right now, we do need developers, it\'s true. But we also need testers, documentation authors, translators, word-spreaders, and a whole lot besides.



Marc


Direct link Reply with quote
 

Florian v. Savigny  Identity Verified
Local time: 22:59
English to German
TOPIC STARTER
OmegaT: clarifications May 7, 2003

Dear Marc,



> At any rate, it does rely on an interpreter and you are right in implying that that results in a performance hit.<



In my impression, it wasn\'t perceptible once the application was running; it was just the startup that took long.



> If performance became an issue, I think a switch back to C would not be too difficult. <



Well, I think it was not really an important point I made there, but C is always a good choice on any kind of *nix platform (or so it seems to me; C programs seem to be running fast and smooth).



>>Well, yes, but that is not unusual for text files (esp. if they were written with an editor); a paragraph break would be a sequence of two line breaks.<



I can\'t remember when I last processed a plain text file in OmegaT. Have you tried opening and saving in a different editor? <



No. Hard linebreaks is what emacs inserts in text-mode, and I think all the documentation you find in, say, Linux HOWTO files is formatted that way.



> It has a segmenting mechanism, yes, but it doesn\'t insert any segment boundaries. It recognizes the existing paragraph boundaries. I\'ve no doubt that a solution could be found, but it is somewhat more fundamental than moving boundaries in, say, Trados, where you effectively tell Trados NOT to do something it would otherwise do. <



I\'m not sure whether I understand what this exactly means, but I think segment boundaries should be something the CAT tool should be able to insert at the user\'s choice, and something that has completely disappeared in the target file. Something independent from source text markup, in other words.



> I\'m not sure where the contradiction is. If you mean: easy to add filters for Applix or DocBook, for example, but not for Word, no, that\'s not a contradiction. Any format that is XML or XML-like is easy to add. Any format that is not open/documented involves a tremendous amount of work. <



Oh, I see. That\'s what wasn\'t entirely clear to me.



BTW, my idea of a DejaVu-like CAT tool would be of one that merely distinguishes between markup and text, such that it does not really need to understand (parse) the markup. But I\'ll discuss this technical question (maybe naive of me, but promising) in other forums also.



I agree basically about what you say about OO. [Maybe it would help to contact the developers?]



> Right now, we do need developers, it\'s true. But we also need testers, documentation authors, translators, word-spreaders, and a whole lot besides.<



I\'ll do my very best





Florian



[Thanks for your private email!]


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Brewing an open source CAT: Anyone had experience with XML/SGML?

Advanced search







Across v6.3
Translation Toolkit and Sales Potential under One Roof

Apart from features that enable you to translate more efficiently, the new Across Translator Edition v6.3 comprises your crossMarket membership. The new online network for Across users assists you in exploring new sales potential and generating revenue.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs