First getElementsByTagName() returns all elements in HTML (Strange behaviour)

I am using PHP to parse HTML provided to me by WordPress.

This is a post’s PHP returned my WordPress:

Read More
<p>Test</p> 
<p>
    <img class="alignnone size-thumbnail wp-image-39" src="img.png"/>
</p> 
<p>Ok.</p>

This is my parsing function (with debugging left in):

function get_parsed_blog_post()
{
    $html = ob_wp_content(false);

    print_r(htmlspecialchars($html));
    echo '<hr/><hr/><hr/>';

    $parse = new DOMDocument();
    $parse->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

    $xpath = new DOMXpath($parse);
    $ps = $xpath->query('//p');

    foreach ($ps as $p) 
    {
        $imgs = $p->getElementsByTagName('img');

        print($imgs->length);
        echo '<br/>';

        if ($imgs->length > 0)
        {
            $p->setAttribute('class', 'image-content');

            foreach ($imgs as $img)
            {
                $img->removeAttribute('class');
            }
        }        
    }

    $htmlFinal = $parse->saveHTML();

    print_r(htmlspecialchars($htmlFinal));
    echo '<hr/><hr/><hr/>';

    return $htmlFinal;
}

The purpose of this code is to remove the classes WordPress adds to the <img>s, and to set any <p> that contains an image to be a class of image-content.

And this returns:

1
1
0
<p class="image-content">Test
<p class="image-content">
    <img src="img.png">
</p>
<p>Ok.</p></p>

Somehow, it has wrapped the first occurrence of <p> around my entire parsed post, causing the first <p> to have the image-content class incorrectly applied. Why is this happening? How do I stop it?

Related posts

1 comment

  1. METHOD 1

    As to use exactly your code, I have done some changes to make it working.

    If you will print out each $p you will be able to see first element will contain all your HTML. The simplest solution is to add a blank <p> before your HTML and skip it when foreach.

    function get_parsed_blog_post()
    {
        $page_content_html = ob_wp_content(false);
        $html = "<p></p>".$page_content_html;
        print_r(htmlspecialchars($html));
        echo '<hr/><hr/><hr/>';
    
        $parse = new DOMDocument();
        $parse->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
        $xpath = new DOMXpath($parse);
        $ps = $xpath->query('//p');
        $i = 0;
        foreach ($ps as $p) 
        {
            if($i != 0) {
                $imgs = $p->getElementsByTagName('img');
    
                print($imgs->length);
                echo '<br/>';
    
                if ($imgs->length > 0)
                {
                    $p->setAttribute('class', 'image-content');
    
                    foreach ($imgs as $img)
                    {
                        $img->removeAttribute('class');
                    }
                }
            }
            $i++;
        }
    
        $htmlFinal = $parse->saveHTML();
    
        print_r(htmlspecialchars($htmlFinal));             
        echo '<hr/><hr/><hr/>';
    
        return $htmlFinal;
    }
    

    Total execution time in seconds: 0.00034999847412109

    METHOD 2

    The problem was caused by LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD (which is making first <p> as a parent too), but you can remove document tags without this. So, you can do it as here:

    function get_parsed_blog_post()
    {
    $page_content_html = ob_wp_content(false);
    $doc = new DOMDocument();
    $doc->loadHTML($page_content_html);
    foreach($doc->getElementsByTagName('p') as $paragraph) {
        $imgs = $paragraph->getElementsByTagName('img');
        if ($imgs->length > 0)
        {
            $paragraph->setAttribute('class', 'image-content');
    
            foreach ($imgs as $img)
            {
                $img->removeAttribute('class');
            }
        }        
    }
    
    
    /* REMOVING DOCTYPE, HTML AND BODY TAGS */
    
    // Removing DOCTYPE
    $doc->removeChild($doc->doctype);
    
    // Removing HTML tag
    $doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);
    
    // Removing Body Tag
    $html = $doc->getElementsByTagName("body")->item(0);
    $fragment = $doc->createDocumentFragment();
    while ($html->childNodes->length > 0) {
        $fragment->appendChild($html->childNodes->item(0));
    }
    $html->parentNode->replaceChild($fragment, $html);
    
    $htmlFinal = $doc->saveHTML();
    
    print_r(htmlspecialchars($htmlFinal));             
    echo '<hr/><hr/><hr/>';
    
    return $htmlFinal;
    }
    

    Total execution time in seconds: 0.00026822090148926

Comments are closed.