Multilingual digital library - looking for ideas
Thread poster: Alan Campbell

Alan Campbell  Identity Verified
Local time: 05:52
Russian to English
+ ...
Jul 31, 2006

I wasn't quite sure in which forum to post this, so hopefully I've made the right choice.

A client of mine has asked that I do some research into how they might go forward in terms of languages. I’ve done a fair bit of work already and I thought it might make an interesting discussion for the forum.

I’ll be as brief as I can and look forward to ideas and suggestions.

First of all, my client: it is a non-profit organisation that has put together what is essentially a digital multimedia library charting the history of the European Union. It includes press articles, speeches, treaties, interviews, letters, etc. The content is available over the web free of charge and they try to maintain as objective an approach as possible.

The interface is currently in English and French with the German version now under way. Content is presented in its original language and, where possible, in translation. Translations of content are dealt with externally through agencies and this is a costly affair. I myself work for them as a freelance translator, but am in the unique position of being based in-house. As I know how the internal structure works and have an interest in IT, they considered me a good candidate for this research project.

All content is accompanied by synopses and titles in both the language of the original and in translation.

The aim of the digital library is to serve as a research tool at all levels, from primary schools through to universities, and to serve as a repository for historical and cultural content on the EU.

I seem to be coming to a conclusion that translating the content should be stopped and all our resources spent on translating the interface, synopses and titles into as many European languages as possible. It would also be required to make the search engine work in all EU languages and I wondered if anyone had any experience of such a thing. Basically, searching for, for example, the Treaty of Rome would show results for content on that topic in whatever language it happened to be in. That way, researchers could find a document on the Treaty of Rome in French, read the synopsis in the language used to make the search, and decide for themselves whether the document is relevant or not.

And that leads to the meat of the matter. If the researcher decides from the synopsis that the text is relevant, but it is in a foreign language, what’s the next step?

Basically my client cannot afford to translate all content into all languages, not least of all because of the copyright issue that this raises. So, what’s the researcher to do? One idea would be for us to provide some sort of machine translation widget and include a note to the effect that the MT content was for personal use only. Or the MT content could be put online and improved upon in a Wiki-like fashion.

I’m really keen to hear from anyone who has any ideas or examples of similar such operations where multilingual access to content is an issue.

If anyone would care to see my research notes, you can browse through them in my Backpack blog-like page:

http://cams.campbell.backpackit.com/pub/563064

Feel free to swing as wise as you like with any comments or suggestions. Nothing is too wacky.

Thanks

Cams


Direct link Reply with quote
 
Claudio Chagas
Brazil
Local time: 01:52
English to Portuguese
+ ...
Wiki-based system Jul 31, 2006

Alan Campbell wrote:
...the MT content could be put online and improved upon in a Wiki-like fashion.
Cams


Hi Alan,

From your explanation you seem to have covered all the possibilities. I like the collaborative approach to put content online on some kind of wiki-based system, and make it available under an Attribution-NonCommercial-ShareAlike licence. Check out the Creative Commons website for more details.


Direct link Reply with quote
 

Magda Dziadosz  Identity Verified
Poland
Local time: 06:52
Member (2004)
English to Polish
+ ...
What about Jul 31, 2006

- charging researchers for translated versions of articles;
- finding sponsors to finance translation;
- providing a link to ProZ.com directory so the researchers can find translators?



Best,
Magda


Direct link Reply with quote
 

Alan Campbell  Identity Verified
Local time: 05:52
Russian to English
+ ...
TOPIC STARTER
Thanks Jul 31, 2006

Good call on the Creative Commons thing. I didn't know about ShareAlike and will follow it up for sure.

I'm not sure that charging for translations would work, being as my client is an NGO and therefore is not engaged in any commercial activity. However, it might be possible to allow researhers to request translations and have some sort of system in place for that.

I had considered using the resources (budget and manpower) to ensure that documents could be found and their content understood through having high-quality translations of the descriptions and titles. That's why I I think that the content itself should be provided in its original language.

Seeking sponsorship is a good idea, although it might be difficult to find. I'm positive that my client would not agree to having advertisements embedded in the content and am not sure that, under their legal status, they would be entitled to do so even if they wanted to. Perhaps some funding in an academic sense could be found, although I know that academic institutes are usually seeking sponsorship themselves. Maybe some kind of partnership with translation schools could be worked out.

Anyway, just thinking out loud. Thank you both for taking the time to read through my post and for your contribution.

Alan


Direct link Reply with quote
 
xxxmediamatrix
Local time: 00:52
Spanish to English
+ ...
My experience ... Jul 31, 2006

Hi Alan,

Some years ago (1997-99) I was involved in the setting up a project bearing some similarities to yours - and also some substantial differences.

From what I gather from your description, similarities included:

- pan-European content and user-base
- very diverse range of documentation (legacy paper documents --> web content --> DVDs --> you name it!)
- multi-lingual content (literally all East and West European languages)
- multi-lingual user base (the organization's own staff, supposed to be bilingual eng/fre, plus staff of the organization's members, i.e. professional people working in 57 different countries)
- requirement to catalogue large quantities of documentation available already in multiple languages (EU Directives, Council Decisions, etc.), as well as the organization's own output (official languages eng/fre, working languages ger/rus) and significant quantities of documents held by the member organizations in there respective national languages.


Differences included:

- closed user group (initially, in-house only, now extended to the organization's extranet - that still makes for several thousand of accredited users - but access for children, or for the general public, was never a consideration)
- it could be assumed that all users would be capable of using an interface provided in just English and French
- the organization had virtually unlimited resources (manpower and cash) and a very strong IT department (although the initial doc centre project development was outsourced)

My involvement began at the earliest stages of the project design and prototype implementation phase, as representative of a user service that would be both a content provider and an intensive user of the system (I also had relevant experience as head of the organization's publications service).

So, what do I recall of that project?

First, the eternal arguments! The first thing the bosses did (one might say, their first mistake...) was to hire someone to serve as project manager, and they chose a charming fellow fresh out of university with a shiny degree in librarianship. As was to be expected, the main requirements were actually in the fields of project management, workflow analysis and IT, not librarianship! *

It was decided early on that document retrieval should be based on full text searching. But then the project manager decided to add a 3-tier classification system (partly, I think, to justify his appointment to the project and partly as a result of pressure from one user departnment with specific requirements). So, when anyone registered a document it had to have a class, sub-class, sub-sub-class. Well, OK. But the search facility was then set up in such a way that you could only find a specific item if you knew which class, sub-class and sub-sub-class it had been registered under. Consequently, the only person who could find document XYZ was ... the person who had registered it! I eventually persuaded the developers to arrange to ignore (flatten) the hierarchy when searching - but it took 6 months' hard haggling to get them to see the need for this.

Other user-friendliness considerations (and arguments) were of the kind: if X is searching for 'TV reality shows', and chooses to display results regardless of the document language, then it would be helpful if translations of the same document were listed together in the search results, as sub-items of a 'hit'. I never did get that one implemented...

Coming back to Alan's project, I would agree wholeheartedly with the ideas expressed at the end of his most-recent blog entry:

"(...) it's not about providing access to multilingual content, but rather about providing multilingual access to content.

(...) the solution is simply to stop translating the content. Let the institutions do the translating as they pretty much have to, but we don't; we're just a repository of information, a digital library, and how many traditional libraries are there in the world that go about translating their content?"

Very very true. It is inconceivable that any NGO could handle the translation, into every potential target language, of everything that's available in all languages in its field of activity. The NGO should concentrate on what it does best - and any self-respecting NGO will pride itself on having its finger on the pulse and being aware of who's saying/writing what about them or their field of activity. Making that information available to a large multi-lingual, multi-cultural community is task enough alone...

So, yes, I would support your approach in respect of concentrating on multi-lingual access. And where a demand arises for translations of specific items from the catalogue, then your NGO can use its limited resources to satisfy real demand rather than 'first-guessing' what users 'might' want some time in the future.

As for obtaining translations, one option might be a form of sponsorship. Suppose Norwegian user X finds in your catalogue a reference to a useful document, YYY, but it's in French; (s)he requests a translation of document YYY into Norwegian. Ok, you go to the publisher of YYY (assumed to be an organization active in the field of interest to your NGO) and propose to them that you will arrange for a translation of their document into Norwegian if they cover half the cost and agree to you cataloguing it afterwards and making it available to your users. Many organizations will be more than happy to comply with such a request, as it gives them greater exposure at half the usual cost. (This is a variant of a scheme I offer here: http://www.mediamatrix.cl/GenInfo/ServicesTranslation_eng.asp ).

To close, you are fortunate to be involved in such a project and I'd recommend you make the most of it. There is a growing market for expertise in documentation management - success could carry you far. Whatever course you , or your client, take - good luck with it!

MediaMatrix

* To be fair, the librarian was a fast learner and the project did eventually work more or less as intended - although I don't know what its current status is as I left the organization while it was being deployed.

PS Why don't you give us a link to the NGO's site? - we'd get a better idea of what it's all about and other ideas might come to mind ...


Direct link Reply with quote
 

Alan Campbell  Identity Verified
Local time: 05:52
Russian to English
+ ...
TOPIC STARTER
Thank you! Aug 1, 2006

Thank you so much for your response mediamatrix and for taking the time to read through my material. I'm constantly amazed at the giving nature of strangers on the Internet!

I did not provide a link to my client's site as I was not sure of the protocol on these forums as regards spamming, etc. However, you have put my mind at rest on that score, so here is the site of the organisation:

www.cvce.lu

And the site of their digital library:

www.ena.lu

Your comments on full text searching are of interest to me as this is something that I am not sure that our techies fully understand. To be fair, I myself am not sure how our site works (or, from personal experience, does NOT work) in terms of search and I really must talk to some people and figure that out. If you have any examples of a multilingual search that finds texts in langauges other than the search language, I'd be keen to see how they operate. I presume this would be achieved by having the same multilingual keywords for each language version of a document? Or would this be done using doctype declarations? As I say, examples would be most useful to me.

Your idea of having content translation done according to real demand is a good one, as indeed is approaching publishers to arrange to split the cost. It does get into the realms of copyright though, as much of the content is licensed (such as newspaper articles) and I'm sure that the terms of the licenses would not include translation. That is what made me think of providing users the means to have content machine translated whilst covering my client's back with a EULA to the effect that translations are for personal use only, quality is not guaranteed, etc. That way, the user gets a fair idea of the content at a deeper level than s/he would from the synopsis provided by my client, and can then make a decision about how to proceed in terms of arranging for a full translation.

The further I get with my research, the more interesting it becomes!

There are some initiatives from the European Commission with with my client has been invloved with a view to finding partnerships and funding and they should carry on with that aspect of things.

It's certainly an interesting place to work, even if I am kind of shooting myself in my freelance translator's foot!

Alan


Direct link Reply with quote
 
xxxmediamatrix
Local time: 00:52
Spanish to English
+ ...
Further thoughts ... Aug 1, 2006

First, thanks for the links to the project sites. Not only do they contain a lot of very interesting information, they are very nicely presented (although they are very slow when accessed from this part of the world). Strongly recommended to all Proziens with an interest in Europe!

It is clear that the organization has some quite skilled IT and graphics people. On the other hand, as you commented yourself in your first post:

To be fair, I myself am not sure how our site works (or, from personal experience, does NOT work) in terms of search ...


As a newcomer to the site, I would tend to agree. As with most reference works I am unfamiliar with, the first thing I did was dig around to see what www.ena.lu has to say about a topic I know something about. I went looking for information about the UK joining the EU. Working with the English interface I did a search for 'UK adhesion' - and got a list, in totally random order, with entries like the following (the absence of upper case and punctuation is as per the website, and hampers legibility):
...
demonstration against uk membership of the common market
the uk joins the european common market 1 january 1973
'the european island' from le figaro
speech by tony benn 17 march 1972
...

There is evidently a lack of guidance as to the drafting of document titles and/or catalogue data ... The key thing in any such project is not 'how do we get documents INTO the system?' but, rather, 'how can we make sure users can GET STUFF OUT?' Not only 'something' relevant to their enquiry but, to the greatest extent possible, 'everything' that's relevant - and as little irrelevant material (clutter) as possible. At the end of the day, that comes down to strong rules for cataloguing.

I clicked "'the european island' from le figaro" and got a translation of the article, followed by a complete reference to the source. Great! Well, almost - actually this document was found only because the word 'adhÉsion' appears only in the French caption; 'ahesion' is not in the English translation or caption (because the correct word is 'accession'). If the system had been accents-conscious, I'd not have found it.

I did another search, for 'UK pound'. It came up with two documents - one of which "Note secrète de la Commission européenne sur les problèmes financiers du Royaume-Uni (Septembre 1969)" does not contain the word 'pound' either in the text or in the caption/title. There is no English translation of this document. So, although it was relevant to my search, I'm left wondering how the document was found ...

Your comments on full text searching are of interest to me as this is something that I am not sure that our techies fully understand. (...)


I would tend to agree that your IT colleagues may be out of their depth in the search domain as applied to document resources - perhaps not as regards the programming aspects but in deciding what data needs to go in the catalogue and the search logistics as seen from a user's standpoint. Indeed, there is something of a contradiction between the claims here: http://www.cvce.lu/mce.cfm :

"It is by using the most advanced technologies that the CVCE has been able to overcome constraints imposed by the geographical distribution of sources and to promote, organise and disseminate knowledge, thus becoming a virtual forum for knowledge on the history of Europe, a model for the development of new ways of accessing knowledge."

and the practical reality in www.ena.lu .

Your comments on full text searching are of interest to me as this is something that I am not sure that our techies fully understand (...) If you have any examples of a multilingual search that finds texts in langauges other than the search language, I'd be keen to see how they operate. I presume this would be achieved by having the same multilingual keywords for each language version of a document? Or would this be done using doctype declarations? As I say, examples would be most useful to me.


I think you might be confusing two concepts here. A full-text search does not use keywords (except, perhaps, as a top-level filter, of the kind "find all docs containing 'football' except those keyworded as 'legal'") - it looks directly at document content (and/or catalogue content - titles, summaries, etc.), in much the same way as Google works (and in Google, the 'keyword filter' part is the optional search language selector).

The way I wanted our system to work (and bear in mind that it was 8 years ago... the technology has moved on a lot since then) was that when a full-text search on document content came up with a hit, it then searched the catalogue to find translations of the same text, detected by their sharing a common ID; in the simplest case, if the first hit was doc ID '1234_en', then the French and German translations would be found with IDs '1234_fr' and '1234_de'.* The difficult part was getting colleagues to set up the IDs properly, partly because there was no easy way to check whether a document currently being registered was new material or a translation of something that had already been registered. (There was also a problem when we had several translations of the same document in the same language, or a full translation and a translated abstract.)

That is what made me think of providing users the means to have content machine translated whilst covering my client's back with a EULA to the effect that translations are for personal use only, quality is not guaranteed, etc. That way, the user gets a fair idea of the content at a deeper level than s/he would from the synopsis provided by my client, and can then make a decision about how to proceed in terms of arranging for a full translation.


On my first reading of your post yesterday, I thought (but didn't write in my post) "you'd would be well-advised to steer clear of MT in this project". Having now seen the quality of your client's sites, I'm now certain that you should avoid MT. As you say, "... quality is not guaranteed ..."; I would add ' ... to the extent that it may do more harm than good.'. With all due respect to MT professionals who may be reading this, MT would be a blot on the www.ena.lu landscape. As a means of determining the usefulness of a given document I doubt that it would be more helpful than a short, well-worded synopsis delivered by your client in, say, 3 or 4 primary European languages, coupled with high-quality catalogue data. And let's not forget that a user can copy any document they think 'might' be interesting and submit it to any MT system they care to trust, but that would be entirely under their own responsibility and they would have no cause to complain to your client if the result was less than satisfactory.

MediaMatrix

* The principle is the same as a method I sometimes use when searching the Internet for the translation of an obscure term: I Google the term in the source language, scan the URLs in the search results until I find one that includes a language code, copy that URL, change the language code to my target language - and, fingers crossed, it may come up with a translation of the searched page containing the term I'm looking for.


Direct link Reply with quote
 
Claudio Chagas
Brazil
Local time: 01:52
English to Portuguese
+ ...
User point of view Aug 1, 2006

mediamatrix is indeed an excellent consultant regarding the information architecture for this project.

I visited the site, and as a first time user I found it extremely difficult to navigate and slow (I would refrain from using Flash, unless programmers can make it load faster).

Also unless you know what your after and are very persistent, you'll get something out of this site. There's a wealth of information on this site, which often result in a large amount of information being cramped into a small section.

Having used The ProQuest Information and Learning site when I was at uni I found it a much more flexible and user friendly site in my opinion. I would recommend that you check other similar sites to see what works best in each case.


Direct link Reply with quote
 
xxxmediamatrix
Local time: 00:52
Spanish to English
+ ...
Another idea ... Aug 9, 2006

Hi Alan,

You might find this interesting: http://www.proz.com/topic/52766

MediaMatrix


Direct link Reply with quote
 

Alan Campbell  Identity Verified
Local time: 05:52
Russian to English
+ ...
TOPIC STARTER
Duly acknowledged Aug 10, 2006

Hi mediamatrix

Many apologies for not adding to the thread after your wonderful response. I fully intended to do that but have simply been too busy with other life things with which I shan't bore you on the forum. My research continues and some of the points raised in your extremely helpful posts will most assuredly find their way into my research. I'm meeting with my boss tomorrow to discuss my progress and the final report is due for the end of the month. I shall endeavour fully to digest your post tomorrow after our meeting and shall respond in due course.

I guess I should have responded, even if it were only to acknowledge and thank you for your post and to let you know that I intended to respond more fully when time permitted. I hope you stick around and watch my progress as I'm sure we're on the same wavelength and your input is very much appreciated.

Thank you

Alan


Direct link Reply with quote
 

Alan Campbell  Identity Verified
Local time: 05:52
Russian to English
+ ...
TOPIC STARTER
Follow up Aug 11, 2006

My meeting went well and my research seems to be going in the right direction.

I spoke to some people in the IT department yesterday and they are awaiting installation of a Google Mini. I hadn't heard of such a thing, but some research shows that this could be the very thing that will get the content to a high level of findability.

And so I guess that much of the discussion as regards search is no longer quite so relevant, however I shall respond to some points that stood out from your post.

I did the same searches that you did and got the same results (which in its own way is promising I guess).

the absence of upper case and punctuation is as per the website, and hampers legibility


You have my full agreement on that score!

I did another search, for 'UK pound'. It came up with two documents - one of which "Note secrète de la Commission européenne sur les problèmes financiers du Royaume-Uni (Septembre 1969)" does not contain the word 'pound' either in the text or in the caption/title. There is no English translation of this document. So, although it was relevant to my search, I'm left wondering how the document was found ...


I found the same text and noted straight away that there was no indication of where the document is from in terms of the structure of the content (the arborescence on the left). One of the icons above the text is provided in order to Locate [the document] in the Documentary Resources, and on so doing, I found that it is in a section called "The International Role of the Pound Sterling". I'm still not quite sure why that one particular text was found when others were not, but at least there seems to be a connection to the word "pound".

I think you might be confusing two concepts here. A full-text search does not use keywords (except, perhaps, as a top-level filter, of the kind "find all docs containing 'football' except those keyworded as 'legal'") - it looks directly at document content (and/or catalogue content - titles, summaries, etc.), in much the same way as Google works (and in Google, the 'keyword filter' part is the optional search language selector).


Thank you for your explanation of the two concepts. I use desktop search a lot and so am familiar with full-text searching and how useful it can be. I'm fairly familiar with how keywords work, although I know that achieving high rankings can sometimes seem to be more of an art than a science.

The way I wanted our system to work (and bear in mind that it was 8 years ago... the technology has moved on a lot since then) was that when a full-text search on document content came up with a hit, it then searched the catalogue to find translations of the same text, detected by their sharing a common ID; in the simplest case, if the first hit was doc ID '1234_en', then the French and German translations would be found with IDs '1234_fr' and '1234_de'.* The difficult part was getting colleagues to set up the IDs properly, partly because there was no easy way to check whether a document currently being registered was new material or a translation of something that had already been registered. (There was also a problem when we had several translations of the same document in the same language, or a full translation and a translated abstract.)


That sounds like an ideal system. I have discussed the concept of "descripteurs" with our documentalist, but haven't quite grasped it yet. I should sit down with her and get to the nub of it - she really knows how it works and has done wonders with the Eurovoc thesaurus. I have a lot to learn there.

Having now seen the quality of your client's sites, I'm now certain that you should avoid MT... As a means of determining the usefulness of a given document I doubt that it would be more helpful than a short, well-worded synopsis delivered by your client in, say, 3 or 4 primary European languages, coupled with high-quality catalogue data. And let's not forget that a user can copy any document they think 'might' be interesting and submit it to any MT system they care to trust, but that would be entirely under their own responsibility and they would have no cause to complain to your client if the result was less than satisfactory.


That was precisely the conclusion that I came to as well. I'm proposing that the synopses and titles be well worded and high quality and available in at least four languages. The resources would be better spent getting that and the searching right so that content can be found and its relevancy evaluated. What the user does with it after that is at his own discretion and, as you say, the responsibility is out of my client's hands.

Where the MT idea comes into play is when I consider what action, if any, my client should take with content when a translation is required. I think it will still be important for my client to offer a means of providing content in translation and those are the areas that I have to study. As you suggested, second guessing is not the best way forward, and blanket translation of content simply cannot continue. We could, of course, simply leave the content as it is and mention nothing about translation, but that's not the way my client wants to go. I think there should be a system in place whereby users could request that a particular document be translated, so that at least we can remove the guessing element.

What I need to consider then is what to do when a translation is requested. MT is one idea, but perhaps keeping any MT content separate from www.ena.lu so as not to blot the landscape (which I agree with by the way). MT content could be used as a basis for wiki-style improvements.

Another idea I had is to use the education system. Since one of my client's main target audiences is students and teachers, why not take advantage of that and form partnerships with translation schools in various European countries? Then we could ask that our content be translated by students, whilst in return we provide some sort of training through translation memories, glossaries and online tutorials that focus on translation technology.

I'm keen to hear any ideas about how else we might fill the need to have content translated for the community, bearing in mind that our content is provided free of charge and has quality hallmarks.

The principle is the same as a method I sometimes use when searching the Internet for the translation of an obscure term: I Google the term in the source language, scan the URLs in the search results until I find one that includes a language code, copy that URL, change the language code to my target language - and, fingers crossed, it may come up with a translation of the searched page containing the term I'm looking for.


I do that almost every day using site:europa.eu.int searches (or site:ec.europa.eu as it is now becoming). It's a powerful and intuitive way of finding translated texts, but I'm not sure how that could work in my client's system, being as it is Flash based and direct links to documents are not available.

Right, time for me to shut down for the weekend and spend some family time. I'll be logging in over the weekend and back in the office on Monday for a few days, then off on holiday for a week. Responses will be most gratefully received, and I will reply as soon as I can, even if only to say thanks and that I'll be back later. I really appreciate the responses I've had on the forum and frankly am quite overwhelmed by the generosity shown here. You have my thanks!

Alan


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Maria Castro[Call to this topic]

You can also contact site staff by submitting a support request »

Multilingual digital library - looking for ideas

Advanced search







SDL MultiTerm 2017
Guarantee a unified, consistent and high-quality translation with terminology software by the industry leaders.

SDL MultiTerm 2017 allows translators to create one central location to store and manage multilingual terminology, and with SDL MultiTerm Extract 2017 you can automatically create term lists from your existing documentation to save time.

More info »
PDF Translation - the Easy Way
TransPDF converts your PDFs to XLIFF ready for professional translation.

TransPDF converts your PDFs to XLIFF ready for professional translation. It also puts your translations back into the PDF to make new PDFs. Quicker and more accurate than hand-editing PDF. Includes free use of Infix PDF Editor with your translated PDFs.

More info »



All of ProZ.com
  • All of ProZ.com
  • Term search
  • Jobs