I’m trying to get the value of the href
attribute of the first <a>
tag in a post which is an image.
This is what I have so far:
$pattern = "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_link = $matches[1][0];
However, this does not work.
I have a code to get the src
value of an <img>
tag which does work:
$pattern = "/<img.+src=['"]([^'"]+)['"].*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_img = $matches[1][0];
As I’m no expert with regular expressions and php in general I have no idea what I’m doing wrong.
Also I couldn’t find any decent, organized guide to regular expressions so a link to one could be useful as well!
This isn’t a problem you should be solving with regular expressions. If you want to parse HTML, what you need is an HTML parser and PHP already has one for you that works great!
See the DOMDocument documentation for more details.
You should use a DOM parser for this. If you can use 3rd party libraries, check out this one. It makes your task incredibly easy:
If you cannot use this library for one reason or another, using PHP’s built-in DOM module is still a better option than regular expressions.
Just some notes about your regular expression:
I can’t test it now as i don’t have PHP here, but correct these issues and maybe your problem is already solved. Also check the pattern modifier
/U
which toggles the default “greedyness”.This problem however has been solved many times so you should use the existing solutions (a DOM parser). For example you’re not permitting quotes in the href (which is probably ok for href but later you’ll copy + paste your regex for parsing another html attribute where quotes are valid characters).