Grabbing raw text from Studio / extracting raw text from sdlxliff
Thread poster: FarkasAndras

FarkasAndras  Identity Verified
Local time: 13:33
English to Hungarian
+ ...
Feb 24, 2015

I'm working on developing a workflow for running texts through Studio for segmentation/text extraction (for subsequent alignment). The goal is to dump a bunch of files (doc, xls, ppt etc.) into Studio and have raw, clean, segmented text come out the other end without any fuss. I can import the files and get the text out by copy-pasting from the editor, but that's not a reasonable option if there are many files. I want a solution that processes any number of files in one go. I tried a couple of openexchange apps hoping to extract the raw text from sdlxliff files with them, but they don't seem to work too great.
Now I'm thinking about just writing an sdlxliff->txt converter myself. Anyone worked with sdlxliff before? It seems that it would be just a case of discarding everything that's not between <source> </source> tags, removing all other tags to get rid of inline formatting tags, and then converting any character references to literal characters. It seems like I could just process files line by line because the opening and closing source tags will always be on the same line (no line breaks inside segments). Does all of that sound right?
The sdlxliff I checked has the same text repeated in source tags and seg-source tags. Looks like they will be the same under all normal circumstances, I think I will stick with using source.


 

SDL Community  Identity Verified
United Kingdom
Local time: 13:33
English
If you use... Feb 24, 2015

... the API then you could probably extend your app to any filetype supported by Studio. You could offer it is as an alternative alignment tool built into Studio and accessible via the ribbon. This is probably the most sensible way to get at the text you want because the API will parse the files correctly irrespective of the filetype.

The APIs are freely acessible and if you need help we do have a developer community you can use... also free.

Regards

Paul
SDL Community Support


 

FarkasAndras  Identity Verified
Local time: 13:33
English to Hungarian
+ ...
TOPIC STARTER
That's an interesting option... Feb 24, 2015

But not one I'm likely to take. Learning the API is probably more trouble than it's worth, and in the end one would need to click around in Studio on the famous ribbon to export the files to txt. If I just integrate the converter into my own code, it can be called automatically if an sdlxliff is selected. Obviously, that requires manually importing the files into studio beforehand to get the text loaded into and sdlxliff, and but that might be a good compromise.
It would be interesting if I could use the API to write an app that works as a hands-off converter.
I.e. if I go
sdlxliff_converter.exe en infile.doc outfile.txt
in the console, it should call up Studio, import the doc file as an English source file and export its contents into outfile.txt. Ideally, it wouldn't add the file to any project and it wouldn't leave other files behind.
It seems that this is probably possible with the project automation API, although given Studio's excruciatingly slow startup, automatically opening/closing Studio separately for each file would probably end up being slower than just opening it by hand and batch adding all the files in one go.

I'm not sure what you mean by an alternative alignment tool built into Studio. That I could write an openexchange app that converts the files to txt and then calls my aligner on the txt files? I guess that might be possible but I have no interest in doing it. People would expect it to be an "SDL/Studio aligner", which LF Aligner definitely isn't. It doesn't look pretty enough and it's generally too industrial strength.

I'm also not sure what you mean by extending my app to any filetype supported by Studio. If you mean aligning ppt files or similar, then yes, that's part of the goal. Improved handling of doc/docx/etc is also attractive, as the converters I have integrated into LF Aligner don't do a stellar job on messy files.
It seems that processing sdlxliff isn't hard at all, and that would open that door completely without messing with the API. Sdlxliff is the native format, so Studio converts everything it can import into sdlxliff, doesn't it?
If you mean extracting text from ttx and bilingual word and similar bilingual formats other than sdlxliff, I have no real interest in messing with those.

[Edited at 2015-02-24 14:32 GMT]


 

SDL Community  Identity Verified
United Kingdom
Local time: 13:33
English
I guess I was thinking a couple of things... Feb 24, 2015

... but they were just ideas. You do whatever you want.

  1. Integration means Studio users who like your tool might find it beneficial as no import export would be needed as a separate operation. Call it what you like... it wouldn't be a Studio aligner it would be yours and just called through the studio interface. Xbench isn't the Studio Xbench is it just because they created a plugin for Studio?
  2. Of course you could use the API as a separate converter... there are apps and many custom solutions built for specific customers that do this already. But obviously you still need a Studio license to run it. So no need to even start Studio at all!
  3. By extending the filetype support, yes that's what I meant. IDML, INX, PPTX, XML.... whatever!
  4. If you need SDLXLIFF without using the API then users still need to create the projects with all those terrible features... although I have no idea why you'd do these one at a time. Maybe a limitation you're imposing? You could probably use a powershell script to do this without too much hassle. We have an opensource powershell capability as well if you're interested? Might be easier for you if you don't want to dig into the API.

Anyway... it was just an idea seeing as you asked in here about using SDLXLIFF as the way to get at text from files you don't currently support.

Regards

Paul
SDL Community Support


 

FarkasAndras  Identity Verified
Local time: 13:33
English to Hungarian
+ ...
TOPIC STARTER
Thanks Feb 24, 2015

I'll probably do a quick and dirty sdlxliff converter for the time being, and rely on manually importing files. Yes, it's a bit of a pain but it's the easiest option for now.
When I have some time to kill, I'll look into using powershell or the API... maybe. Perl is the only computer language I speak so a project like this would take a fair bit of learning/research.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Grabbing raw text from Studio / extracting raw text from sdlxliff

Advanced search







WordFinder Unlimited
For clarity and excellence

WordFinder is the leading dictionary service that gives you the words you want anywhere, anytime. Access 260+ dictionaries from the world's leading dictionary publishers in virtually any device. Find the right word anywhere, anytime - online or offline.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search