Need php to encode special characters but not html tags, for inclusion in a wordpress extended rss file

I’ve written a script that will export all the users, blogs and replies from an existing (non-wordpress) site to a wordpress extended rss file for ease of importing into a new wordpress installation, as part of a migration. This works well until it comes to a particular blog post with a special punctuation mark in a French or French Canadian phrase.

XML Parsing Error: not well-formed
Location: http://example.com/wordpress_xml/export-to-wp.php
Line Number 2000, Column 270:* ... <i>l'art du duffffplacement</i> ... 

I’ve cropped the full error above. Instead of uffff a character similar to a comma is shown. In the php code I have the html of the blog in a string. I need to encode this type of character without encoding any of the html tags, and after a lot of searching I so far drawn a blank. Anyone already done something like this?

Related posts

Leave a Reply

2 comments

  1. For Latin-1 you can escape characters easily with:

    $html = preg_replace('/[x80-xFF]/e', '"&#x".dechex(ord("$0")).";"', $html);
    

    For UTF-8 it’s a bit more involving:

    $html = preg_replace_callback("/(?!w)p{L}/u", "xmlent", $html);
    function xmlent($m) {
        $str = mb_convert_encoding( $m[0] , "UCS-2BE", "UTF-8");
        return "&#x" . bin2hex($str) . ";";
    }
    
  2. After discovering the problem was about accents, I found the following functions posted on php.net, and they worked for my case, and the export file I generated imported nicely into a wordpress blog.

    function xmlentities($string) {
        // Function from: http://php.net/manual/en/function.htmlentities.php
        // Posted by: snevi at im dot com dot ve 22-Jul-2008 01:10
        $string = preg_replace('/[^x09x0Ax0Dx20-x7F]/e', '_privateXMLEntities("$0")', $string);
        return $string;
    }
    
    function _privateXMLEntities($num) {
        // Function from: http://php.net/manual/en/function.htmlentities.php
        // Posted by: snevi at im dot com dot ve 22-Jul-2008 01:10
            $chars = array(
        128 => '€',
        130 => '‚',
        131 => 'ƒ',
        132 => '„',
        133 => '…',
        134 => '†',
        135 => '‡',
        136 => 'ˆ',
        137 => '‰',
        138 => 'Š',
        139 => '‹',
        140 => 'Œ',
        142 => 'Ž',
        145 => '‘',
        146 => '’',
        147 => '“',
        148 => '”',
        149 => '•',
        150 => '–',
        151 => '—',
        152 => '˜',
        153 => '™',
        154 => 'š',
        155 => '›',
        156 => 'œ',
        158 => 'ž',
        159 => 'Ÿ');
        $num = ord($num);
        return (($num > 127 && $num < 160) ? $chars[$num] : "&#".$num.";" );
    }