Changing DOMNode tag name creating anomalies within the saved HTML

In the process of making a WordPress post parser for my personal website but hitting some behaviour I can’t explain.

Here’s the code:

Read More
// WordPress uses <p></p> sections for new lines
$sections = $doc->getElementsByTagName('p');

foreach ($sections as $section) 
{
    $hasChilderen = $section->hasChildNodes();
    $contents = $section->nodeValue; 

    // If we have text, assume we are a paragraph (for the time being)
    if (!empty($contents))
    {
        $section->setAttribute('class', 'post-inner-content-paragraph');
    }
    elseif ($hasChilderen)
    {
        $section->setAttribute('class', 'post-inner-content-media');
        $section = change_tag_name($section, 'div');

        $imgs = $section->getElementsByTagName('img');
        foreach ($imgs as $img)
        {
            $img->removeAttribute('class');
        }
    }
    else
    {
        $section->setAttribute('class', 'post-inner-content-empty');
    }        
}

change_tag_name:

function change_tag_name($node, $name) 
{
    $doc = $node->ownerDocument;

    $newnode = $doc->createElement($name);

    foreach ($node->childNodes as $child)
    {
        $child = $doc->importNode($child, true);
        $newnode->appendChild($child);
    }

    if ($node->hasAttributes())
    {
        foreach ($node->attributes as $attr) 
            {
                $name = $attr->nodeName;
                $value = $attr->nodeValue;
                $newnode->setAttribute($name, $value);
            }
    }

    $node->parentNode->replaceChild($newnode, $node);

    return $newnode;
}

There’s no way for a <p> block to passed as a section and NOT get an attribute assigned to it, however:

Parsed post

The highlighted <p> block doesn’t have a class!

Here’s the HTML loaded into the DOMDocument $dom: http://pastebin.com/biVSyWn9

Here’s the HTML leaving my parse function: http://pastebin.com/RhzgeWAS

I can’t detect any reason why this particular <p> block isn’t being set a class.

Related posts

2 comments

  1. I ran this using DOMDocument (assuming that you’re using it for parsing). I also commented out your change_tag_name function since the source code for that was not posted.
    It works. I got class attributes added to all the <p> tags.

    Now, as to why it doesn’t work for you, I can think of only two reasons:

    • The end tag of the <p> just before the one that doesn’t work is not recognized due to some reason. Because of this, the parser reads the next <p> tag as part of the previous <p>.
    • The change_tag_name function may be doing something which it is not intended to do (highly unlikely, but that is something you may want to rule out).
  2. Solution

    You have to traverse the node list backwards in order to make the kind of changes I want to do. crnix’s answer helped identify that the problem occured with replaceChild within the change_tag_name function. Changing my foreach loop to the following fixed my issue:

    $sections = $doc->getElementsByTagName('p');
    $i = $sections->length - 1;
    while ($i > -1)
    {
        $section = $sections->item($i);
    
        // Change tag name of section
    
        $i--;
    }
    

Comments are closed.