Regular expression to replace words in web pages

I was looking for a regular expression (php) to find/replace some words in a web page. But, it cant replace words between all html tags, only between: italic <i>, bold <b> and plain text.

Example:

Read More

word: “hello” (case insensitive)

<a href="#">Hello</a> im a writer that i like to say hello everyday. <b>Hello</b> Spiderman.

Replaces: in anchor cant replace, only hello and < b>Hello< /b> can be replaced.

I tested some regular expressions but none is working properly:

1) from SMART SEO LINKS (WP plugin)

$reg = '/(?!(?:[^<[]+[>]]|[^>]]+</a>))b($word)b/Imsu';

Doesn’t work well, sometimes, deletes the content and put the simbol “>”
I made some ​​modifications to this regexp, removing “?!” or “?:” (i dont know whats mean), but stop working.

2) Others I’ve been tried:

$reg = "/<([w]+)[^>]*>b('.$word.')b</1>/Imsu";
$reg = '/<+s*/sb('.$word.')b[^>]/s>+/I';

not replace anything

$reg = '/<(w+)[^>]*>b('.$name.')b</1>/Imsu';

Sometimes works.

The truth is that im not regexp expert, and I was few days testing, trying to create a new regexp, but not meeting the results that I need.

The fact is that replace will be use in a WP plugin, which sometimes affects the template or others plugins or DOM isnt well created

Anyone have any idea why not work correctly? Thanks.

Related posts

Leave a Reply

1 comment

  1. Try combination of these patterns

    $reg = '/(?:<(w+)[^>]*>)?bhellob(?!</a>)(</\1>)?/i';
    $reg0 = '/<w[^>]*bhellob[^>]*>/Ui';
    

    Example

    $word = preg_quote('hello','/'); // to avoid PCRE injection
    $str = '<a href="hello.php">Hello</a> I say hello everyday. <b>Hello</b> Spiderman.';
    $reg = '/(?:<(w+)[^>]*>)?b'.$word.'b(?!</a>)(</\1>)?/i';
    $reg0 = '/<w[^>]*b'.$word.'b[^>]*>/Ui';
    
    function handler($m) { return str_replace($GLOBALS["word"],'!X!',$m[0]); }
    
    $str = preg_replace_callback($reg0,'handler',$str); // replace "hello" for say !X! inside tags    
    $str = preg_replace($reg,'[deleted]',$str); // delete "hello" elsewhere
    $str = str_replace('!X!',$word,$str); // put "hello" inside tag back
    print_r($str);
    

    Result

    <a href="hello.php">Hello</a> I say [deleted] everyday. [deleted] Spiderman.
    

    Notes to your question

    Explanation

    See the link about assertions above: ?<! for negative lookbehind assertion can not be used to match <a href="#">, because it is not fixed length and causes compile error. Therefore I used lookahead assertion ?! to match </a> after hello. The brackets at the beginning and end include any surrounding HTML tag, so everything except following </a> assertion is replaced.

    The trick to avoid hello replacement inside tags is to replace them for some unique string (say !X!) then do the original replacement, then replace back the !X! for hello back. It may not be the best solution, but it works.

    Why your regexps didn’t work

    You used /I modifier (at the end of your pattern). Modifiers are case-sensitive, /i means case-insensitive evaluation, see the list of modifiers. I believe the b (word boundary) in your patterns is redundant.