By Django Johnson


2013-08-29 18:36:14 8 Comments

I'm trying and need some help doing the following:

I want to stream parse a large XML file ( 4 GB ) with PHP. I can't use simple XML or DOM because they load the entire file into memory, so I need something that can stream the file.

How can I do this in PHP?

What I am trying to do is to navigate through a series of <doc> elements. And write some of their children to a new xml file.

The XML file I am trying to parse looks like this:

<feed>
    <doc>
        <title>Title of first doc is here</title>
        <url>URL is here</url>
        <abstract>Abstract is here...</abstract>
        <links>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
       </link>
    </doc>
    <doc>
        <title>Title of second doc is here</title>
        <url>URL is here</url>
        <abstract>Abstract is here...</abstract>
        <links>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
            <sublink>Link is here</sublink>
       </link>
    </doc>
</feed>

I'm trying to get / copy all the children of each <doc> element into a new XML file except the <links> element and its children.

So I want the new XML file to look like:

<doc>
    <title>Title of first doc is here</title>
    <url>URL is here</url>
    <abstract>Abstract is here...</abstract>
</doc>
<doc>
    <title>Title of second doc is here</title>
    <url>URL is here</url>
    <abstract>Abstract is here...</abstract>
</doc>

I would greatly appreciate any and all help in streaming / stream parsing / stream reading the original XML file and then writing some of its contents to a new XML file in PHP.

2 comments

@higuaro 2013-08-29 20:28:27

For this scenario you can't afford to use a DOM parser, as you stated, it will not fit in memory due to the file size, and even if you could, it'll be slow as it first load the entire file and after that you have to iterate through it, so, for this case you should try a SAX parser (event/stream oriented), add a handler for those tag you're insterested in (doc, title, url, abstract) and for every event append the node found in the new XML file.

Here you have more information:

What is the fastest XML parser in PHP?

Here is a (not tested) sample of what the code would be:

<?php
    $file = "bigfile.xml";
    $fh = fopen("out.xml", 'a') or die("can't open file");
    $currentNodeTag = "";    
    $tags = array("doc", "title", "url", "abstract");

    function startElement($parser, $name, $attrs) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            $currentNodeTag = strtolower($name);
            fwrite($fh, sprintf("<%s>\n"));
        }
    }

    function endElement($parser, $name) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            fwrite($fh, sprintf("</%s>\n"));
            $currentNodeTag = "";
        }
    }

    function characterData($parser, $data) {
        if (!empty($currentNodeTag)) {
            fwrite($fh, $data);
        }
    }    

    $xmlParser = xml_parser_create();
    xml_set_element_handler($xmlParser, "startElement", "endElement");
    xml_set_character_data_handler ($xmlParser, "characterData");

    if (!($fp = fopen($file, "r"))) {
        die("could not open XML input");
    }

    while ($data = fread($fp, 4096)) {
        if (!xml_parse($xmlParser, $data, feof($fp))) {
            die(sprintf("XML error: %s at line %d",
                        xml_error_string(xml_get_error_code($xmlParser)),
                        xml_get_current_line_number($xmlParser)));
        }
    }

    xml_parser_free($xmlParser);
    fclose($fh);
?>

@Django Johnson 2013-08-29 23:24:28

I'm getting an error with the code that I can't seem to fix. It also doesn't make sense. The error I am getting is: `PHP Parse error: syntax error, unexpected ';' in /Users/irfanm/Desktop/mamp/xml2.php on line 12'.

@DeeDee 2013-08-29 19:08:56

Here's a college try. This assumes a file is being used, and that you want to write to a file:

<?php

$interestingNodes = array('title','url','abstract');
$xmlObject = new XMLReader();
$xmlObject->open('bigolfile.xml');

$xmlOutput = new XMLWriter();
$xmlOutput->openURI('destfile.xml');
$xmlOutput->setIndent(true);
$xmlOutput->setIndentString("   ");
$xmlOutput->startDocument('1.0', 'UTF-8');

while($xmlObject->read()){
    if($xmlObject->name == 'doc'){
        $xmlOutput->startElement('doc');
        $xmlObject->readInnerXML();
        if(array_search($xmlObject->name, $interestingNodes)){
             $xmlOutput->startElement($xmlObject->name);
             $xmlOutput->text($xmlObject->value);
             $xmlOutput->endElement(); //close the current node
        }
        $xmlOutput->endElement(); //close the doc node
    }
}

$xmlObject->close();
$xmlOutput->endDocument();
$xmlOutput->flush();

?>

@Django Johnson 2013-08-29 19:32:47

What was in your latest edit? I can't tell the difference between this , current version and the version I was reading before.

@Django Johnson 2013-08-29 19:33:20

This looks like exactly what I was looking for, thank you. I will try it out later tonight and let you know what happens.

@DeeDee 2013-08-29 19:33:24

I closed the <?php tag

@Django Johnson 2013-08-29 19:35:33

Ah, okay. So no changes to the code. I will try it later tonight and let you know how it goes.

@Django Johnson 2013-08-29 23:13:40

I got an error when I tried to run a file with only your code in it. Here is the error I got in my error log: PHP Fatal error: Call to undefined method XMLReader::startElement() in xml.php on line 15

@DeeDee 2013-08-30 02:54:16

Yep, I got my $xmlObject and $xmlOutput variables confused. Try my edit!

@Django Johnson 2013-08-31 14:59:07

Ah, okay. The resulting file is created with a doctype and a unclosed <doc tag, but I am getting another error now. Call to undefined method XMLReader::endElement() in xml.php on line 22. Sorry for all the trouble and thank you for the help!

@DeeDee 2013-08-31 22:06:53

Hahaha, I got in trouble with my variable names. They're very easy to confuse when working outside of an IDE. Try the latest update :)

@Django Johnson 2013-09-03 14:51:57

Ah, No worries. Thank you for the help. The latest update is running without error, but all I am getting is <doc/> elements one on each line.

Related Questions

Sponsored Content

18 Answered Questions

[SOLVED] Reference — What does this symbol mean in PHP?

15 Answered Questions

[SOLVED] How do I parse XML in Python?

  • 2009-12-16 05:09:24
  • randombits
  • 1161615 View
  • 935 Score
  • 15 Answer
  • Tags:   python xml

37 Answered Questions

[SOLVED] Deleting an element from an array in PHP

  • 2008-12-15 20:28:55
  • Ben
  • 2435461 View
  • 2351 Score
  • 37 Answer
  • Tags:   php arrays unset

34 Answered Questions

[SOLVED] Reference - What does this error mean in PHP?

16 Answered Questions

[SOLVED] Why shouldn't I use mysql_* functions in PHP?

  • 2012-10-12 13:18:39
  • Madara Uchiha
  • 211561 View
  • 2428 Score
  • 16 Answer
  • Tags:   php mysql database

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

28 Answered Questions

[SOLVED] How can I prevent SQL injection in PHP?

30 Answered Questions

[SOLVED] How do you parse and process HTML/XML in PHP?

1 Answered Questions

Stream parse 4GB XML file and write part of file to new XML file in PHP

Sponsored Content