I’ve written a script that will export all the users, blogs and replies from an existing (non-wordpress) site to a wordpress extended rss file for ease of importing into a new wordpress installation, as part of a migration. This works well until it comes to a particular blog post with a special punctuation mark in a French or French Canadian phrase.
XML Parsing Error: not well-formed
Location: http://example.com/wordpress_xml/export-to-wp.php
Line Number 2000, Column 270:* ... <i>l'art du duffffplacement</i> ...
I’ve cropped the full error above. Instead of uffff a character similar to a comma is shown. In the php code I have the html of the blog in a string. I need to encode this type of character without encoding any of the html tags, and after a lot of searching I so far drawn a blank. Anyone already done something like this?
For Latin-1 you can escape characters easily with:
For UTF-8 it’s a bit more involving:
After discovering the problem was about accents, I found the following functions posted on php.net, and they worked for my case, and the export file I generated imported nicely into a wordpress blog.