Regex PHP – Find a substring inside a <div></div> tag

First of all: I know I should not use regex to parse HTML. I’ve read it a zillion times already. But the tool I have to make the job is a regex-based tool, so I cannot use a HTML parser or any other thing. Anyways, I thank you all the concerns, but if what I need to do works with regex, great. If not, sorry, we’ll have to drop this feature off.

The question is:

Read More

Short explanation: I need a regex expression to return a substring that is contained inside tags in a PHP-generated webpage (WordPress, for what matters).

Long explanation: I need to find every instance of a game’s name (in this example, the game’s name Batman: Arkham City) that is located inside the various <div class="post-bodycopy clearfix"> that exist inside my page. It means that I only want the game’s name that is inside the post body, not in the post title or the sidebar or wherever. Then, I’ll replace this name with a link using preg replace or something likewise.

I’ve searched the web for a similar question, but I could only find such “gimme all that is inside tags” questions.

Here is a typical post from within my generated code:

<div class="post-268445 post hentry category-world-community-gamer tag-games tag-geral tag-lancamentos tag-noticias tag-pc tag-ps3 tag-xb360" id="post-268445">
<div class="post-kicker"><?php get_cat_icon(); ?><a href="http://www.gameblogs.com.br/category/world-community-gamer/" title="World Community Gamer" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')"><img src="http://www.gameblogs.com.br/wp-content/uploads/world-community-gamer.png" width="48" height="48" alt="" title="World Community Gamer" /></a></div>
<div class="post-headline">     <h2>    <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" rel="bookmark" title="Permanent Link to Data para Batman: Arkham City no PC" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc','')">Data para Batman: Arkham City no PC</a></h2>   </div>
<div class="post-byline"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/user.gif" alt="" /> <a href="http://www.gameblogs.com.br/author/_otaviofqueiroz/" title="Posts de @_otaviofqueiroz" onclick="return TrackClick('http://www.gameblogs.com.br/author/_otaviofqueiroz/','')">@_otaviofqueiroz</a>, do <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/home.gif" alt="" /> <a href="http://www.worldcommunitygamer.com/" target="_blank" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/','')">WCG | World Community Gamer: Jogos, Análises e Tecnologia</a>, <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/calendar_month.png" alt="" /> 18/10/11 | Compartilhe: <a href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" data-text="WCG | World Community Gamer: Jogos, Análises e Tecnologia: Data para Batman: Arkham City no PC" data-count="horizontal" data-via="GameBlogsBR" data-lang="fr" target="_blank" onclick="return TrackClick('http://twitter.com/share','')">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div><div class="post-bodycopy clearfix"><p> <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html" imageanchor="1" style="margin-left: 1em; margin-right: 1em;" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html','')"><img src="/wp-content/plugins/wordpress-image-resizer/thumb/phpThumb.php?fltr=usm&src=http://2.bp.blogspot.com/-9oKlgIND3qY/Tp3Aimju2nI/AAAAAAAABxA/Q585nqpdsRI/s1600/batman_arkham_city_screens16-620x348.jpg&w=200" align='left'></a>
<p>A Warner divulgou a data de lançamento para Batman: Arkham City no PC. O jogo que terá a sua versão para os consoles (PS3 e Xbox 360) lançada nessa sexta-feira, chegará as lojas na versão PC no dia 18 de Novembro. Apesar da demora [...]<br /><a href=http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc>[continua no site original...]</a></p></div>
<div class="post-footer"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/tag.gif" alt="" /> <a href="http://www.gameblogs.com.br/tag/games/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/games/','')">Games</a>, <a href="http://www.gameblogs.com.br/tag/geral/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/geral/','')">Geral</a>, <a href="http://www.gameblogs.com.br/tag/lancamentos/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/lancamentos/','')">lançamentos</a>, <a href="http://www.gameblogs.com.br/tag/noticias/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/noticias/','')">Notícias</a>, <a href="http://www.gameblogs.com.br/tag/pc/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/pc/','')">PC</a>, <a href="http://www.gameblogs.com.br/tag/ps3/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/ps3/','')">PS3</a>, <a href="http://www.gameblogs.com.br/tag/xb360/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/xb360/','')">XB360</a><br>Todos os posts do <a href="http://www.gameblogs.com.br/category/world-community-gamer/" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')">World Community Gamer</a></div></div><!-- / Post -->

I’ve already tried the following for the find:

$<div class="post-bodycopy clearfix">(.+?)(Batman: Arkham City)(.+?)(?=<div class="post-footer">)$s

Meaning: find the div opening tag, followed by anything, followed by Batman: Arkham City, followed by anything, until the opening div tag for the post footer, considering multilines.

And the following for the replace:

<div class="post-bodycopy clearfix">/1<a href="http://www.mylink">Batman: Arkham City</a>/3

For some reason the regex works in http://regexlib.com, returning all the expected parts, but not in my live website. It must be some minor issue.

However, I am sure that my soluton is not the most elegant (and server consuming/less-expensive) way to find such a substring, since I save various parts to just change one of them.

Is there a cleverer way to achieve this? Please?

Thanks a lot!

Related posts

Leave a Reply

3 comments

  1. I put together an example here with the following regex in PHP:

    '|(<div class="post-bodycopy clearfix">)(.*?)(Batman: Arkham City)(.*?)(</div>)|e'
    

    I added a Batman: Arkham City at the bottom of the html string, just to test. It seems to be working. Let me know.

  2. If you insist on using regex, and your <div class="post-bodycopy clearfix">...</div> elements will never contain any nested DIVs, here is a double-callback solution that should do a decent job:

    // Linkify title inside post-bodycopy DIV text.
    function p($text) {
        global $title, $link;
        // Set title to be found and linkify URL address.
        $title = 'Batman: Arkham City';
        $link = 'http://www.mylink';
        // Match non-nested "post-bodycopy" class DIV element.
        $re = '%<div class="post-bodycopy clearfix">(.+?)</div>%si';
        return preg_replace_callback($re, 'p_cb', $text);
    }
    function p_cb($matches) {
        // Match tag (in $1) and non-tag stuff (in $2).
        $re = '%
              ( </?w+   # Either $1: An open or close tag.
                (?:s+[w-.:]+(?:s*=s*(?:"[^"]*"|'[^']*'|[^s<>]+))?)*
                s*/?>
              )
            | ( [^<]+ )  # Or $2: Non-tag stuff.
            %x';
        $matches[1] = preg_replace_callback($re, 'p_cb_cb', $matches[1]);
        return '<div class="post-bodycopy clearfix">'. $matches[1] .'</div>';
    }
    function p_cb_cb($matches) {
        global $title, $link;
        # Return open and close tags unchanged.
        if (isset($matches[1]) && $matches[1]) return $matches[1];
        # Process non-tag text, converting text to link.
        $matches[2] = str_replace(
            $title,
            '<a href="'. $link .'">'. $title .'</a>',
            $matches[2]);
        return $matches[2];
    }
    

    The p() function processes the HTML file contents. Its regex matches the <div class="post-bodycopy clearfix">...</div> element and passes the DIV contents to the p_cb() callback function. This first callback function then walks/processes the contents of the DIV using a regex which matches either open or close tags (into capture group $1), or non-tag stuff (into capture group $2). This in turn calls a second callback function p_cb_cb() which simply returns open and close tags (in $1) unchanged and then uses str_replace() to convert all instances of the $title text into the desired link.

    Note that your HTML markup is not-valid. It has many unquoted tag attribute values (which should be quoted).

  3. $title = ‘Batman: Arkham City’;

    search: {(?<=<div class="post-bodycopy clearfix">)(.+?)($title)(.+?)(?=<div class="post-footer">)}s

    replace:
    1<a href="http://www.mylink">2</a>3
    or
    $1<a href="http://www.mylink">$2</a>$3

    Edit
    You can try the below. Example php is here http://ideone.com/JtH4s

    $title = 'Batman: Arkham City';
    $divclass = 'post-bodycopy clearfix';
    
    $rxtag =
    '<
     (?:
         ?phps+.*??
      |  (?:
           (?:
               (?:script|style)s*
             | (?:script|style)s+(?:".*?"|'.*?'|[^>]*?)+s*
           )> .*? </(?:script|style)s*
         )
      |  (?:
             /?[A-Za-z_:][w:.-]*s*/?
           |  [A-Za-z_:][w:.-]*s+(?:".*?"|'.*?'|[^>]*?)+s*/?
           | !(?:DOCTYPE.*?|--.*?--)
         )
     )
     >
    ';
    
    // Or,
    // $rxtag_optional = '<[^<>]+?>';
    // $rxtag = $rxtag_optional;
    
    
    
    $rxmain =
    "~(?xs:
       ( <div (?=\s)[^>]*
              (?<=\s) class \s* = \s* " \s* (?i-x:$divclass) \s* "
              [^>]* (?<!/)
         >
         (?:
             (?! </?div | (?-x:$title))
             (?> $rxtag  | [^<] | <)
         )*?
       )
       ( (?-x:$title) )
       (
          (?: (?!</?div) (?> $rxtag  | [^<] | <) )*?
          </div \s*>
       )
     )
    ~";
    
    //print "$rxmainnn";
    
    $count = 0;
    
    $newhtml = preg_replace( $rxmain,
                             "$1<a href="http://www.mylink">$2</a>$3",
                             $html,
                             1,
                             $count );