Wednesday, October 11, 2006

Reading xml fast

Using the System.Diagnostics.StopWatch object, I've made an interesting experiment on how to read a 1mb xml file fast. For this experiment, I've made a small console application and for each of the different code examples, I then traverse the xml tree and write the values to the console. Last, I've tried a smaller file 5kb in order to see if the result is the same when working with smaller files.

The xml is quite simple and looks something like this:

  • xml
    • entry
      • name
      • adress
      • (...)


For a reference, I've started off with a simple stream read:

StreamReader reader = File.OpenText("c:\\testfile2.xml");
string input = null;
while ((input = reader.ReadLine()) != null)

And the result:
watch.Elapsed = {00:00:02.3910558}

The result is pretty much as expected, about two seconds to read all of the lines an put them on the screen.


Next up is the xml reader. It traverses the xml straight forward and reads all of the nodes.

FileStream stream = new FileStream("c:\\testfile2.xml", FileMode.Open);
XmlReader reader = new XmlTextReader(stream);

And the result:
watch.Elapsed = {00:00:12.7904473}

12 seconds is about as expected. There's some overhead with finding the xml nodes, but it all seem pretty much like expected. If we would try this experiment over the internet or a slower file network, I believe that the xml reader overhead would not be that visible.


Next, we try the xPath aproach:

FileStream stream = new FileStream("c:\\testfile2.xml", FileMode.Open);
XPathDocument document = new XPathDocument(stream);
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator node = navigator.Select("xml/entry");

And the result:
watch.Elapsed = {00:03:29.5681325}

I knew it would take some time, but this is not acceptable. XPath is still a kind of favourite as it makes it possible to navigate the xml tree in a absolutely beautiful way.


Next is the dataset approach.

DataSet ds = new DataSet();
foreach (DataTable tbl in ds.Tables)
foreach (DataRow dr in tbl.Rows)
for (...){...}

And the result:
watch.Elapsed = {00:00:03.6352829}

I'm a bit surprised. The dataset seems like the fastest way of traversing an xml file. Note that dataset navigation can be cumbersome when containing a lot of tables, like the one.

A smaller xmlfile

I've also tried the same aproaches to a 5kb xml file, here's the results and now it turns out XPath is the fastest method:

watch.Elapsed = {00:00:00.0144736}

watch.Elapsed = {00:00:00.0302896}

watch.Elapsed = {00:00:00.0151563}

watch.Elapsed = {00:00:00.0225523}

kick it on