How does WordPress support Unicode?

I’ve been looking through WordPress’ codes and they’re mostly using PHP’s regular string functions, like strlen, strpos, etc. Yet I know WordPress supports utf8, so how does it do that?

Does it overload the regular string functions with multibyte string functions?

Read More

If so, is that a good idea in practice? If not, then how do they do it?

Related posts

Leave a Reply

2 comments

  1. WordPress is written in PHP. As far as I know, PHP doesn’t support any character encoding (UTF-8, UTF-16, …). It just assumes the text to be ASCII encoded.

    The actual encoding and decoding is done by your browser. When you write a post, your browser sends the text you just entered as UTF-8 to the server. WordPress just stores it in the database.

    The encoding is specified in WordPress’ HTML code:

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    

    This instructs the browser to use “UTF-8” for all text on the page. This includes the actual text on the page as well as all input fields.

    So, WordPress doesn’t handle UTF-8 itself. It let’s the browser handle it. (This is also means, if you’d specify a different encoding for the backend and the frontend pages, you’d get garbage text on the frontend.)

    As a note: Unlike PHP, MySQL is UTF-8 aware. So, for example, a search for non-ASCII characters yields the correct result because the search is handled by MySQL rather than WordPress.

  2. No, WordPress doesn’t support unicode natively. But it’s an easy fix.

    Open up your wp.config in your root.

    Find these lines:

    /** Database Charset to use in creating database tables. */
    define('DB_CHARSET', 'utf8');
    
    /** The Database Collate type. Don't change this if in doubt. */
    define('DB_COLLATE', '');
    

    Just comment ’em out:

    /** Database Charset to use in creating database tables. */
    //define('DB_CHARSET', 'utf8');
    
    /** The Database Collate type. Don't change this if in doubt. */
    //define('DB_COLLATE', '');
    

    Now it won’t display unicode as a ? … so you’re all set.