How to handle large XML file
Thread poster: Peter Sass

Peter Sass
Germany
Local time: 14:37
Member
English to German
+ ...
Jan 22, 2015

Hi there,

From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website.
For sure, this must be split up using a XML split programme.

1) Should the splitting be done on the client side preferably, as to make sure they can piece it together again from the translation files OR could I do this just as well?

2) Which XML split programme (preferable Freeware or Shareware) would you recommend?

3) Is there anyother way to 'shrink' the file in some way?

Thanks for your advice in advance!


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:37
Member (2006)
English to Afrikaans
+ ...
Post in the Trados forum Jan 22, 2015

Peter Sass wrote:
From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website. For sure, this must be split up using a XML split programme.


If your CAT tool can handle it, why do you need to split it? I suggest you post this question also in the Trados forum.

That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?


Direct link Reply with quote
 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 15:37
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Sam, most of the today's websites are databases, so it is quite difficult to split a solid massive of data. However, any database can be exported into a flat file (ttx, xml). The topic starter looks like having this exported content at hand. I think any CAT-tool can handle it today (all you need is a fast PC, like i5 or i7 with SSD and a lot of RAM, which is a must today to avoid latency, as the TMs are quite huge). It will take time, the PC will look halted, but it will do it (you can take coffee or walk with the dog in the meanwhile). Then, in the CAT-tool, it will turn into a database again, and will work much faster, than the source flat file. Also, it can be split into smaller CAT-files (there is a corresponding tool for SDL Studio).

[Редактировалось 2015-01-22 21:21 GMT]


Direct link Reply with quote
 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 15:37
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Samuel Murray wrote:
That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.

As to the word count, it can be smaller in the end. I would not judge before I have the file at hand.


Direct link Reply with quote
 

Samuel Murray  Identity Verified
Netherlands
Local time: 14:37
Member (2006)
English to Afrikaans
+ ...
DOS command won't split XML smartly Jan 22, 2015

Sergei Leshchinsky wrote:
Samuel Murray wrote:
Sorry, I know of no XML splitter...

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.


No, a DOS command might split a piece of translatable text right down the middle (in fact, the DOS commands that I know will happily split a word in two). Or it might split a tag in two, which would cause the CAT tool to misinterpret the tag (or worse: try to fix it). And even if it doesn't split a segment or a tag in two, it might not split nested tags cleanly, which may also affect the way the CAT tool interprets the XML.


Direct link Reply with quote
 

Peter Sass
Germany
Local time: 14:37
Member
English to German
+ ...
TOPIC STARTER
Thanks so far Jan 23, 2015

..for all your comments!
Actually, the problem is that Trados Studio cannot process the file properly because it is simply too big (and yes I do have a proper PC with i5 processor + 8 GB RAM).

From previous website translations I recollect that there would normally be a set of separate translation files that followed the structure of the website.
As far as I delved into the matter now, one needs a proper XML split programme to preserve this structure (header tags etc.), so I couldn't just split the XML file in a text editor.
I'll see what the client thinks..


Direct link Reply with quote
 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 15:37
Member (2008)
English to Russian
+ ...
... Jan 23, 2015

http://www.hongkiat.com/blog/split-large-xml-for-wordpress/
https://www.npmjs.com/package/xml-splitter
http://www.xponentsoftware.com/XmlSplit.aspx


Direct link Reply with quote
 
FarkasAndras
Local time: 14:37
English to Hungarian
+ ...
Oof Jan 23, 2015

I'd be wary about splitting a huge XML with a random tool off the internet. After you stitch it back together at the end (I presume you plan to do that), it may not be exactly the same as before. The client's software might complain about it. Perhaps you could ask the client to export the site in several reasonably-sized chunks, and note that the alternative option is for you to use xml splitter XXX.

Direct link Reply with quote
 

SDL Community  Identity Verified
United Kingdom
Local time: 14:37
English
I can recommend... Jan 24, 2015

... this program for splitting XML files : http://www.xponentsoftware.com/XmlSplit.aspx

Not freeware though, but not expensive and very capable. I used this to split the IATE TBX files here : http://multifarious.filkin.com/2014/07/13/what-a-whopper/

I guess 700,000 words is a lot for one file! I've never tried to handle anything that large in the Studio Editor but I can imagine it would be a fruitless and frustrating exercise.

We also have the Split and Merge tool on the OpenExchange but you'd have to process the XML first to split the SDLXLIFF and I don't know who much success you'd have handling that even without opening it in the Editor.

Regards

Paul


Direct link Reply with quote
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to handle large XML file

Advanced search







PerfectIt consistency checker
Faster Checking, Greater Accuracy

PerfectIt helps deliver error-free documents. It improves consistency, ensures quality and helps to enforce style guides. It’s a powerful tool for pro users, and comes with the assurance of a 30-day money back guarantee.

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search