First of all: I know I should not use regex to parse HTML. I’ve read it a zillion times already. But the tool I have to make the job is a regex-based tool, so I cannot use a HTML parser or any other thing. Anyways, I thank you all the concerns, but if what I need to do works with regex, great. If not, sorry, we’ll have to drop this feature off.
The question is:
Short explanation: I need a regex expression to return a substring that is contained inside tags in a PHP-generated webpage (WordPress, for what matters).
Long explanation: I need to find every instance of a game’s name (in this example, the game’s name Batman: Arkham City
) that is located inside the various <div class="post-bodycopy clearfix">
that exist inside my page. It means that I only want the game’s name that is inside the post body, not in the post title or the sidebar or wherever. Then, I’ll replace this name with a link using preg replace or something likewise.
I’ve searched the web for a similar question, but I could only find such “gimme all that is inside tags” questions.
Here is a typical post from within my generated code:
<div class="post-268445 post hentry category-world-community-gamer tag-games tag-geral tag-lancamentos tag-noticias tag-pc tag-ps3 tag-xb360" id="post-268445">
<div class="post-kicker"><?php get_cat_icon(); ?><a href="http://www.gameblogs.com.br/category/world-community-gamer/" title="World Community Gamer" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')"><img src="http://www.gameblogs.com.br/wp-content/uploads/world-community-gamer.png" width="48" height="48" alt="" title="World Community Gamer" /></a></div>
<div class="post-headline"> <h2> <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" rel="bookmark" title="Permanent Link to Data para Batman: Arkham City no PC" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc','')">Data para Batman: Arkham City no PC</a></h2> </div>
<div class="post-byline"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/user.gif" alt="" /> <a href="http://www.gameblogs.com.br/author/_otaviofqueiroz/" title="Posts de @_otaviofqueiroz" onclick="return TrackClick('http://www.gameblogs.com.br/author/_otaviofqueiroz/','')">@_otaviofqueiroz</a>, do <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/home.gif" alt="" /> <a href="http://www.worldcommunitygamer.com/" target="_blank" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/','')">WCG | World Community Gamer: Jogos, Análises e Tecnologia</a>, <img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/calendar_month.png" alt="" /> 18/10/11 | Compartilhe: <a href="http://twitter.com/share" class="twitter-share-button" data-url="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc" data-text="WCG | World Community Gamer: Jogos, Análises e Tecnologia: Data para Batman: Arkham City no PC" data-count="horizontal" data-via="GameBlogsBR" data-lang="fr" target="_blank" onclick="return TrackClick('http://twitter.com/share','')">Tweet</a><script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script></div><div class="post-bodycopy clearfix"><p> <a href="http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html" imageanchor="1" style="margin-left: 1em; margin-right: 1em;" target="_blank" onclick="return TrackClick('http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html','')"><img src="/wp-content/plugins/wordpress-image-resizer/thumb/phpThumb.php?fltr=usm&src=http://2.bp.blogspot.com/-9oKlgIND3qY/Tp3Aimju2nI/AAAAAAAABxA/Q585nqpdsRI/s1600/batman_arkham_city_screens16-620x348.jpg&w=200" align='left'></a>
<p>A Warner divulgou a data de lançamento para Batman: Arkham City no PC. O jogo que terá a sua versão para os consoles (PS3 e Xbox 360) lançada nessa sexta-feira, chegará as lojas na versão PC no dia 18 de Novembro. Apesar da demora [...]<br /><a href=http://www.worldcommunitygamer.com/2011/10/data-para-batman-arkham-city-no-pc.html?utm_source=gameblogs&utm_campaign=data-para-batman-arkham-city-no-pc>[continua no site original...]</a></p></div>
<div class="post-footer"><img src="http://www.gameblogs.com.br/wp-content/themes/atahualpa353/images/icons/tag.gif" alt="" /> <a href="http://www.gameblogs.com.br/tag/games/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/games/','')">Games</a>, <a href="http://www.gameblogs.com.br/tag/geral/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/geral/','')">Geral</a>, <a href="http://www.gameblogs.com.br/tag/lancamentos/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/lancamentos/','')">lançamentos</a>, <a href="http://www.gameblogs.com.br/tag/noticias/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/noticias/','')">NotÃcias</a>, <a href="http://www.gameblogs.com.br/tag/pc/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/pc/','')">PC</a>, <a href="http://www.gameblogs.com.br/tag/ps3/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/ps3/','')">PS3</a>, <a href="http://www.gameblogs.com.br/tag/xb360/" rel="tag" onclick="return TrackClick('http://www.gameblogs.com.br/tag/xb360/','')">XB360</a><br>Todos os posts do <a href="http://www.gameblogs.com.br/category/world-community-gamer/" onclick="return TrackClick('http://www.gameblogs.com.br/category/world-community-gamer/','')">World Community Gamer</a></div></div><!-- / Post -->
I’ve already tried the following for the find:
$<div class="post-bodycopy clearfix">(.+?)(Batman: Arkham City)(.+?)(?=<div class="post-footer">)$s
Meaning: find the div opening tag, followed by anything, followed by Batman: Arkham City, followed by anything, until the opening div tag for the post footer, considering multilines.
And the following for the replace:
<div class="post-bodycopy clearfix">/1<a href="http://www.mylink">Batman: Arkham City</a>/3
For some reason the regex works in http://regexlib.com, returning all the expected parts, but not in my live website. It must be some minor issue.
However, I am sure that my soluton is not the most elegant (and server consuming/less-expensive) way to find such a substring, since I save various parts to just change one of them.
Is there a cleverer way to achieve this? Please?
Thanks a lot!
I put together an example here with the following regex in PHP:
I added a Batman: Arkham City at the bottom of the html string, just to test. It seems to be working. Let me know.
If you insist on using regex, and your
<div class="post-bodycopy clearfix">...</div>
elements will never contain any nested DIVs, here is a double-callback solution that should do a decent job:The
p()
function processes the HTML file contents. Its regex matches the<div class="post-bodycopy clearfix">...</div>
element and passes the DIV contents to thep_cb()
callback function. This first callback function then walks/processes the contents of the DIV using a regex which matches either open or close tags (into capture group$1
), or non-tag stuff (into capture group$2
). This in turn calls a second callback functionp_cb_cb()
which simply returns open and close tags (in$1
) unchanged and then usesstr_replace()
to convert all instances of the$title
text into the desired link.Note that your HTML markup is not-valid. It has many unquoted tag attribute values (which should be quoted).
$title = ‘Batman: Arkham City’;
search:
{(?<=<div class="post-bodycopy clearfix">)(.+?)($title)(.+?)(?=<div class="post-footer">)}s
replace:
1<a href="http://www.mylink">2</a>3
or
$1<a href="http://www.mylink">$2</a>$3
Edit
You can try the below. Example php is here http://ideone.com/JtH4s