Is sanitize_title_with_dashes formatting function too liberal (in terms of accepted characters)?

sanitize_title_with_dashes (see code below for reference) is the function WordPress uses to format “pretty” urls. However, contrary to the function’s comment header, it allows much more than alphanumeric characters, underscore (_) and dash (-). It also allows signs like °, etc.

How would I go about really allowing only alphanumeric characters and dashes?

/**
 * Sanitizes title, replacing whitespace with dashes.
 *
 * Limits the output to alphanumeric characters, underscore (_) and dash (-).
 * Whitespace becomes a dash.
 *
 * @since 1.2.0
 *
 * @param string $title The title to be sanitized.
 * @return string The sanitized title.
 */
function sanitize_title_with_dashes($title) {
    $title = strip_tags($title);
    // Preserve escaped octets.
    $title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
    // Remove percent signs that are not part of an octet.
    $title = str_replace('%', '', $title);
    // Restore octets.
    $title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);

    $title = remove_accents($title);
    if (seems_utf8($title)) {
        if (function_exists('mb_strtolower')) {
            $title = mb_strtolower($title, 'UTF-8');
        }
        $title = utf8_uri_encode($title, 200);
    }

    $title = strtolower($title);
    $title = preg_replace('/&.+?;/', '', $title); // kill entities
    $title = str_replace('.', '-', $title);
    $title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
    $title = preg_replace('/s+/', '-', $title);
    $title = preg_replace('|-+|', '-', $title);
    $title = trim($title, '-');

    return $title;
}

Related posts

Leave a Reply

1 comment

  1. Consider this function as a rough placeholder. It has more flaws than you might imagine … 🙂
    There are many plugins to improve the conversion for different languages and needs. You may take a look at my plugin Germanix to see how this could be done.