aren't
becomes arenââ¬â¢t
and various other silliness.
Here’s the code; this is working within WordPress to automate removal of an element from several hundred posts.
function removeImageFromPages() {
$pages = get_pages(array('exclude' => '802,6,4'));
foreach($pages AS $page) {
if($page->post_content == '') { continue; }
$doc = new DOMDocument('1.0', 'UTF-8');
$post_content = stripslashes($page->post_content);
@$doc->loadHTML($post_content);
$content = $doc->saveXML();
echo($content); exit;
}
}
Originally the post content I’m manipulating was stored in a custom CMS. The initial scrape was done with DOMDoc, without any encoding issues. However, there seems to be some kind of trouble the second time around. All headers on everything are set as UTF-8, but I’m not very experienced with encoding. The first time, it was a pure HTML scrape. Now, I’m dealing with values directly from the database. What am I missing? (And is DOMDoc even the right tool for this job?)
Update – I’m still having the problem, but have new information.
If I print/echo/var_dump the content directly from WordPress ($page->post_content), there is no issue. Once it goes through $doc->saveXML or $doc->saveHTML, the characters become confused. They don’t become predictably confused, though.
$doc->loadHTML($page->post_content);
echo($doc->saveXML());
Yields arenââ¬â¢t
. However
$doc->loadHTML($page->post_content);
$ps = $doc->getElementsByTagName('p');
echo($ps->item(3)->nodeValue);
echo($doc->saveXML($ps->item(3)));
Yields arenât
(in both echos).
Also, if I copy/paste a string from the document directly into the function, it works perfectly. It’s only when dealing with values passed from WordPress.
Going through the comments on the PHP documentation page for DOMDocument::loadHTML, it appears that
loadHTML
does not respect the encoding you might have set on theDOMDocument
.Instead, it will read it from the
meta
tag in the HTML. With the original scraping, I presume you were dealing with complete pages including meta tags.The
post_content
of a WordPress page, however, as far as I know, is only a document fragment, not a complete HTML page (or did you change that?). So now it can’t figure out the encoding from the content and defaults to ISO 8859-1 and screws everything up. Not to mention it adds a doctype andhtml
andbody
tags etc. around the fragment.I’m not entirely sure if DOMDocument is the right tool here, but I’m not sure what the alternative are in your case (apart from regular expressions, obviously).
What you can probably do, though, is wrap a simple HTML structure around the post content, including a meta tag to make sure it’s UTF-8, before you pass it to
loadHTML()
and then use XPath to save just the body of it.