Comparing Unicode Characters in PHP

I am unable to compare two unicode characters that to my mind should be exactly the same. I suspect that somehow they are encoded differently, but don’t know how to change them to the same encoding.

The characters I want to compare are from the Myanmar Unicode block. I’m running wordpress on php 5 and am trying to make a custom plugin to handle Myanmar Unicode. All my files are encoded in UTF-8, but I don’t know what wordpress does.

Read More

Here is what I’m doing:

function myFunction( $inputText ) {
    $outputText = '';
    $inputTextArray = str_split($inputText);
    foreach($inputTextArray as $char) {
        if ($char == "က") // U+1000, a character from the Myanmar Unicode block 
            $outputText .= $char;
    }
    return $outputText;
}
add_filter( 'the_content', 'myFunction');

At this stage in working things out, the function is supposed to only return က where it appears in the content. However, it never returns anything but empty strings, even when က is clearly present in the post content. If I change the character to any latin characters, the function works as expected.

So, my question is, how do I encode these characters (either $char or "က") so that when $char contains this character, they compare equal.

Related posts

Leave a Reply

1 comment

  1. str_split is not unicode aware. For multibyte characters it’ll split the them in single character. Try to use either multi-byte string functions or preg_split with /u switch

    $inputTextArray = preg_split("//u", $inputText, -1, PREG_SPLIT_NO_EMPTY);
    

    http://codepad.viper-7.com/ErFwcy

    Using multi-byte function mb_substr_count you can reduce your code too. Like this,

    function myFunction( $inputText ) {
        return str_repeat("က", mb_substr_count($inputText, "က"));
    }
    

    Or using regular expression,

    preg_match_all("/က/u", $text, $match);
    $output = implode("", $match[0]);