DOMDocument changing characters

aren't becomes aren’t and various other silliness.

Here’s the code; this is working within WordPress to automate removal of an element from several hundred posts.

Read More
function removeImageFromPages() {
    $pages  = get_pages(array('exclude' => '802,6,4'));
    foreach($pages AS $page) {
       if($page->post_content == '') { continue; }      
       $doc     = new DOMDocument('1.0', 'UTF-8');
       $post_content    = stripslashes($page->post_content);
       @$doc->loadHTML($post_content);
       $content = $doc->saveXML(); 
       echo($content); exit;
    }
}

Originally the post content I’m manipulating was stored in a custom CMS. The initial scrape was done with DOMDoc, without any encoding issues. However, there seems to be some kind of trouble the second time around. All headers on everything are set as UTF-8, but I’m not very experienced with encoding. The first time, it was a pure HTML scrape. Now, I’m dealing with values directly from the database. What am I missing? (And is DOMDoc even the right tool for this job?)

Update – I’m still having the problem, but have new information.

If I print/echo/var_dump the content directly from WordPress ($page->post_content), there is no issue. Once it goes through $doc->saveXML or $doc->saveHTML, the characters become confused. They don’t become predictably confused, though.

$doc->loadHTML($page->post_content);
echo($doc->saveXML());

Yields aren’t. However

$doc->loadHTML($page->post_content);
$ps = $doc->getElementsByTagName('p');
echo($ps->item(3)->nodeValue);
echo($doc->saveXML($ps->item(3)));

Yields arenât (in both echos).

Also, if I copy/paste a string from the document directly into the function, it works perfectly. It’s only when dealing with values passed from WordPress.

Related posts

Leave a Reply

1 comment

  1. Going through the comments on the PHP documentation page for DOMDocument::loadHTML, it appears that loadHTML does not respect the encoding you might have set on the DOMDocument.

    Instead, it will read it from the meta tag in the HTML. With the original scraping, I presume you were dealing with complete pages including meta tags.

    The post_content of a WordPress page, however, as far as I know, is only a document fragment, not a complete HTML page (or did you change that?). So now it can’t figure out the encoding from the content and defaults to ISO 8859-1 and screws everything up. Not to mention it adds a doctype and htmland body tags etc. around the fragment.

    I’m not entirely sure if DOMDocument is the right tool here, but I’m not sure what the alternative are in your case (apart from regular expressions, obviously).

    What you can probably do, though, is wrap a simple HTML structure around the post content, including a meta tag to make sure it’s UTF-8, before you pass it to loadHTML() and then use XPath to save just the body of it.