I am using PHP to parse HTML provided to me by WordPress.
This is a post’s PHP returned my WordPress:
<p>Test</p>
<p>
<img class="alignnone size-thumbnail wp-image-39" src="img.png"/>
</p>
<p>Ok.</p>
This is my parsing function (with debugging left in):
function get_parsed_blog_post()
{
$html = ob_wp_content(false);
print_r(htmlspecialchars($html));
echo '<hr/><hr/><hr/>';
$parse = new DOMDocument();
$parse->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXpath($parse);
$ps = $xpath->query('//p');
foreach ($ps as $p)
{
$imgs = $p->getElementsByTagName('img');
print($imgs->length);
echo '<br/>';
if ($imgs->length > 0)
{
$p->setAttribute('class', 'image-content');
foreach ($imgs as $img)
{
$img->removeAttribute('class');
}
}
}
$htmlFinal = $parse->saveHTML();
print_r(htmlspecialchars($htmlFinal));
echo '<hr/><hr/><hr/>';
return $htmlFinal;
}
The purpose of this code is to remove the classes WordPress adds to the <img>
s, and to set any <p>
that contains an image to be a class of image-content
.
And this returns:
1
1
0
<p class="image-content">Test
<p class="image-content">
<img src="img.png">
</p>
<p>Ok.</p></p>
Somehow, it has wrapped the first occurrence of <p>
around my entire parsed post, causing the first <p>
to have the image-content
class incorrectly applied. Why is this happening? How do I stop it?
METHOD 1
As to use exactly your code, I have done some changes to make it working.
If you will print out each
$p
you will be able to see first element will contain all your HTML. The simplest solution is to add a blank<p>
before your HTML and skip it whenforeach
.METHOD 2
The problem was caused by
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
(which is making first<p>
as a parent too), but you can remove document tags without this. So, you can do it as here: