Regular Expressions + preg_match_all – Getting the value of an attribute

I’m trying to get the value of the href attribute of the first <a> tag in a post which is an image.
This is what I have so far:

$pattern = "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_link = $matches[1][0];

However, this does not work.

Read More

I have a code to get the src value of an <img> tag which does work:

$pattern = "/<img.+src=['"]([^'"]+)['"].*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_img = $matches[1][0];

As I’m no expert with regular expressions and php in general I have no idea what I’m doing wrong.

Also I couldn’t find any decent, organized guide to regular expressions so a link to one could be useful as well!

Related posts

Leave a Reply

3 comments

  1. This isn’t a problem you should be solving with regular expressions. If you want to parse HTML, what you need is an HTML parser and PHP already has one for you that works great!

    $html = <<<HTML
    <a href="http://somesillyexample.com/some/silly/path/to/a/file.jpeg">
    HTML;
    
    $dom = new DomDocument;
    $dom->loadHTML($html); // load HTML from a string
    $elements = $dom->getElementsByTagName('a'); // get all elements with an 'a' tag in the DOM
    foreach ($elements as $node) {
        /* If the element has an href attribute let's get it */
        if ($node->hasAttribute('href')) {
            echo $node->getAttribute('href') . "n";
        }
    }
    /*
    will output:
    
    http://somesillyexample.com/some/silly/path/to/a/file.jpeg
    */
    

    See the DOMDocument documentation for more details.

  2. You should use a DOM parser for this. If you can use 3rd party libraries, check out this one. It makes your task incredibly easy:

    $html = new simple_html_dom();
    $html->load($post->post_content);
    
    $anchor = $html->find('a', 0);
    $first_link = $anchor->href;
    

    If you cannot use this library for one reason or another, using PHP’s built-in DOM module is still a better option than regular expressions.

  3. Just some notes about your regular expression:

     "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i"
          ^ that's greedy, should be +?
          ^ that's any char, should be not-closing-tag character: [^>]
    
     "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i"
                ^^^^^^ for readability use ['"]
    
     "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i"
                           ^ that's any char, you might wanted .
    
     "/<a.+href=('|")(.*?).(bmp|gif|jpeg|jpg|png)('|").*>/i"
                        ^^ that's ungreedy (good!)       ^ see above (greedy any char)
    

    I can’t test it now as i don’t have PHP here, but correct these issues and maybe your problem is already solved. Also check the pattern modifier /U which toggles the default “greedyness”.

    This problem however has been solved many times so you should use the existing solutions (a DOM parser). For example you’re not permitting quotes in the href (which is probably ok for href but later you’ll copy + paste your regex for parsing another html attribute where quotes are valid characters).