wordpress the_content() formating naked text nodes in WordPress

I was looking at the source output of the WordPress function the_content(), and noticed that the html formatting was strange.

<div>
    <p> <inline element> 'text node' </inline element> </p>
    'text node'
    <p> <inline element> 'text node' </inline element> </p>
    'text node'
</div>

I was using a php DOM parser editing textContent, and found that all of the text nodes except those in inline element tags were not in a p tag.
So they are textContent of the div tag that contains the content.

I was wondering if this was my fault or if wordPress just has bad output. I think it would be unlikely for such a widely used cms to have such a basic formatting issue.

EDIT :
I still don’t know if other theme developers have found this issue with wordPress.
in any case I wrote a small snip-it to fix it.

function setDOM(){
$html = get_the_content();
$html = trim( preg_replace( '/s+/', ' ', $html ) ); 
$dom = new DOMDocument;
$dom->loadHTML($html);

$xpath = new DOMXpath($dom);
$textNodes = $xpath->query('//text()');

foreach($textNodes as $textNode){
    $parent = $textNode->parentNode;
    if (($parent->nodeName !== 'em') &&
    ($parent->nodeName !== 'strong') &&
    ($parent->nodeName !== 'a') &&
    ($parent->nodeName !== 'dt')) {

        $txt = $textNode->textContent;  
        $newP = $dom->createElement('p');
        $newTxt = $dom->createTextNode($txt);
        $newP->appendChild($newTxt);

        $parent->replaceChild($newP, $textNode);

    }
}
$dom->saveHTML();
return $dom;
}

$dom = setDOM();

echo $dom->saveHTML();

I am admittedly a PHP novice and any tips or feedback on that snip-it would be appreciated.

Post Views: 2

wordpress the_content() formating naked text nodes

Leave a Reply Cancel reply

Social Network

Related posts

Leave a Reply Cancel reply

Social Network