[SUPPORT NOTICE] We will close from 27 April to 01 May to celebrate our Reunification Day and Labor Day!

Cart

PHP DOMDocument ignores first table's closing tag

  • This topic is empty.
Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #10656
    tvanc
    Participant

    I was writing a tool to convert HTML tables to CSV and I noticed some bizarre behavior. Given this code

    $html = <<<HTML
    <table>
    <tr><td>A</td><td>Rose</td></tr>
    </table>
    
    <h1>Leave me behind</h1>
    
    <table>
    <tr><td>By</td><td>Any</td></tr>
    </table>
    
    <table>
    <tr><td>Other</td><td>Name</td></tr>
    </table>
    HTML;
    
    $dom = new \DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
    
    foreach ($dom->getElementsByTagName('table') as $table) {
        foreach ($table->getElementsByTagName('tr') as $row) {
            echo trim($row->nodeValue) . PHP_EOL;
        }
    }
    

    I would expect output like this:

    ARose
    ByAny
    OtherName
    

    But what I get is this:

    ARose
    ByAny
    OtherName
    ByAny
    OtherName
    

    I get the same result if I omit the first closing tag. It appears DOMDocument is nesting the second and third <table> inside the first.

    Indeed, if I use xpath to only get immediate children from each table I get the correct output:

    $xpath = new \DOMXPath($dom);
    
    foreach ($dom->getElementsByTagName('table') as $table) {
        foreach ($xpath->query('./tr', $table) as $row) {
            echo trim($row->nodeValue) . PHP_EOL;
        }
    }
    
    #10657
    ken-lee
    Participant

    Enclose your $html with <body> and </body>

    Revised Code (Note: I commented out the $stream lines)

    <?php
    $html = <<<HTML
    <body>
    <table>
    <tr><td>A</td><td>Rose</td></tr>
    </table>
    
    <h1>Leave me behind</h1>
    
    <table>
    <tr><td>By</td><td>Any</td></tr>
    </table>
    
    <table>
    <tr><td>Other</td><td>Name</td></tr>
    </table>
    </body>
    HTML;
    
    $dom = new \DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
    
    $tables = $dom->getElementsByTagName('table');
    // $stream = \fopen('php://output', 'w+');
    
    for ($i = 0; $i < $tables->length; ++$i) {
        $rows = $tables->item($i)->getElementsByTagName('tr');
    
        for ($j = 0; $j < $rows->length; ++$j) {
            echo trim($rows->item($j)->nodeValue) . "<br><br>";
        }
    }
    
    // fclose($stream);
    ?>
    

    Alternatively, change

    $dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
    

    to

    $dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
    
Viewing 2 posts - 1 through 2 (of 2 total)
  • You must be logged in to reply to this topic.