How to handle large XML file
Thread poster: Peter Sass

Peter Sass
Germany
Local time: 07:50
Member
English to German
+ ...
Jan 22, 2015

Hi there,

From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website.
For sure, this must be split up using a XML split programme.

1) Should the splitting be done on the client side preferably, as to make sure they can piece it together again from the translation files OR could I do this just as well?

2) Which XML split programme (preferable Freeware or Shareware) would you recommend?

3) Is there anyother way to 'shrink' the file in some way?

Thanks for your advice in advance!


 

Samuel Murray  Identity Verified
Netherlands
Local time: 07:50
Member (2006)
English to Afrikaans
+ ...
Post in the Trados forum Jan 22, 2015

Peter Sass wrote:
From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website. For sure, this must be split up using a XML split programme.


If your CAT tool can handle it, why do you need to split it? I suggest you post this question also in the Trados forum.

That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?


 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 08:50
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Sam, most of the today's websites are databases, so it is quite difficult to split a solid massive of data. However, any database can be exported into a flat file (ttx, xml). The topic starter looks like having this exported content at hand. I think any CAT-tool can handle it today (all you need is a fast PC, like i5 or i7 with SSD and a lot of RAM, which is a must today to avoid latency, as the TMs are quite huge). It will take time, the PC will look halted, but it will do it (you can take coffee or walk with the dog in the meanwhile). Then, in the CAT-tool, it will turn into a database again, and will work much faster, than the source flat file. Also, it can be split into smaller CAT-files (there is a corresponding tool for SDL Studio).

[Редактировалось 2015-01-22 21:21 GMT]


 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 08:50
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Samuel Murray wrote:
That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.

As to the word count, it can be smaller in the end. I would not judge before I have the file at hand.


 

Samuel Murray  Identity Verified
Netherlands
Local time: 07:50
Member (2006)
English to Afrikaans
+ ...
DOS command won't split XML smartly Jan 22, 2015

Sergei Leshchinsky wrote:
Samuel Murray wrote:
Sorry, I know of no XML splitter...

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.


No, a DOS command might split a piece of translatable text right down the middle (in fact, the DOS commands that I know will happily split a word in two). Or it might split a tag in two, which would cause the CAT tool to misinterpret the tag (or worse: try to fix it). And even if it doesn't split a segment or a tag in two, it might not split nested tags cleanly, which may also affect the way the CAT tool interprets the XML.


 

Peter Sass
Germany
Local time: 07:50
Member
English to German
+ ...
TOPIC STARTER
Thanks so far Jan 23, 2015

..for all your comments!
Actually, the problem is that Trados Studio cannot process the file properly because it is simply too big (and yes I do have a proper PC with i5 processor + 8 GB RAM).

From previous website translations I recollect that there would normally be a set of separate translation files that followed the structure of the website.
As far as I delved into the matter now, one needs a proper XML split programme to preserve this structure (header tags etc.), so I couldn't just split the XML file in a text editor.
I'll see what the client thinks..


 

Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 08:50
Member (2008)
English to Russian
+ ...
... Jan 23, 2015

http://www.hongkiat.com/blog/split-large-xml-for-wordpress/
https://www.npmjs.com/package/xml-splitter
http://www.xponentsoftware.com/XmlSplit.aspx


 

FarkasAndras
Local time: 07:50
English to Hungarian
+ ...
Oof Jan 23, 2015

I'd be wary about splitting a huge XML with a random tool off the internet. After you stitch it back together at the end (I presume you plan to do that), it may not be exactly the same as before. The client's software might complain about it. Perhaps you could ask the client to export the site in several reasonably-sized chunks, and note that the alternative option is for you to use xml splitter XXX.

 

SDL Community  Identity Verified
United Kingdom
Local time: 07:50
English
I can recommend... Jan 24, 2015

... this program for splitting XML files : http://www.xponentsoftware.com/XmlSplit.aspx

Not freeware though, but not expensive and very capable. I used this to split the IATE TBX files here : http://multifarious.filkin.com/2014/07/13/what-a-whopper/

I guess 700,000 words is a lot for one file! I've never tried to handle anything that large in the Studio Editor but I can imagine it would be a fruitless and frustrating exercise.

We also have the Split and Merge tool on the OpenExchange but you'd have to process the XML first to split the SDLXLIFF and I don't know who much success you'd have handling that even without opening it in the Editor.

Regards

Paul


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to handle large XML file

Advanced search







Déjà Vu X3
Try it, Love it

Find out why Déjà Vu is today the most flexible, customizable and user-friendly tool on the market. See the brand new features in action: *Completely redesigned user interface *Live Preview *Inline spell checking *Inline

More info »
BaccS – Business Accounting Software
Modern desktop project management for freelance translators

BaccS makes it easy for translators to manage their projects, schedule tasks, create invoices, and view highly customizable reports. User-friendly, ProZ.com integration, community-driven development – a few reasons BaccS is trusted by translators!

More info »



Forums
  • All of ProZ.com
  • Term search
  • Jobs
  • Forums
  • Multiple search