Replacing with(out) regex link to attachment by links to files

I tried to do it myself but I have really hard time with Regex (regex is hell. I just don”t understand).

Anyway, I think this would be usefull to some people. The context is that we use a plugin to display images in post in a lightbox. It works only if we create the link to the file (the image) and NOT the attachment page.

Read More

Most of the time, authors forgot to check, lightbox don’t pop in, and everyone is disappointed.

Bad HTML (note the attachment word in the href and that the href doesn’t end with .jpg) :

 <p>
   <a href="http://domain.ru/uncategorized/208/e-poha-restavratsii/attachment/cyril_gassiline_photo" rel="attachment wp-att-209"><img class="alignnone size-large wp-image-209" title="cyril_gassiline_photo" src="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" alt="" width="615" height="409" /></a>
 </p>

So I guess we could do something like (in the logic) :

function remove_bad_img_links($content) {
$matches = array();
$check_for_attachment_word = preg_match('/<a[s]+[^>]*hrefs*=s*(attachment)(["']+)([^>]+?)(1|>)/i', $content, $matches);
if ( $check_for_attachment ) {

       $image_url = preg_match('/<img[s]+[^>]*srcs*=s*(["']+)([^>]+?)(1|>)/i', $content, $matches);

       preg_replace('/<a[s]+[^>]*hrefs*=s*(attachment)(["']+)([^>]+?)(1|>)/i', $content, $image_url);
     }
return $content;
}
add_filter( 'the_content', 'remove_bad_img_links' );

(I know the code is wrong, ^^, I really don’t get regex at all)

To produce corrected HTML (the value in the img src has remplaced the initial value of the a href that was detected as wrong because it contains the word attachment (of simply do it all the time) :

 <p>
   <a href="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" rel="attachment wp-att-209"><img class="alignnone size-large wp-image-209" title="cyril_gassiline_photo" src="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" alt="" width="615" height="409" /></a>
 </p>

But writing it, I ask myself if it’s really possible, considering that a post could have several occurences of those links…

Just asking for reference : is it possible, is it a good idea, what would the regex look like ?

I could do it in JS but it would better to SEO to do it in the generated code.

EDIT :

More research tell me that “Thou shalt not use regular expressions to parse HTML”. A better approach is to use DOM :
https://stackoverflow.com/questions/3820666/grabbing-the-href-attribute-of-an-a-element

Definitively, I was thinking it wrong, it’s very bad to parse HTML : http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html. I will try to use DOM and will come back with update.

Related posts

Leave a Reply

1 comment

  1. So I have completed a way to replace links to attachments pages for images, and replace them with not only the url to the img tag, but to the large version of the image file.

    I’m not using regex because it’s evil and I don’t want to make kitten dies.

    It working for me, so I share as a starting source for someone who want to do the same (and for possible improvments or corrections from more talented people) :

    First I use a function to get ID of a file from its URL (I took it here : Turn a URL into an Attachment / Post ID)

    // TOOL
    // GET ATTACHMENT ID FROM GUID
    function get_attachment_id( $url ) {
    
        $dir = wp_upload_dir();
        $dir = trailingslashit($dir['baseurl']);
    
        if( false === strpos( $url, $dir ) )
            return false;
    
        $file = basename($url);
    
        $query = array(
            'post_type' => 'attachment',
            'fields' => 'ids',
            'meta_query' => array(
                array(
                    'value' => $file,
                    'compare' => 'LIKE',
                )
            )
        );
    
        $query['meta_query'][0]['key'] = '_wp_attached_file';
        $ids = get_posts( $query );
    
        foreach( $ids as $id )
            if( $url == array_shift( wp_get_attachment_image_src($id, 'full') ) )
                return $id;
    
        $query['meta_query'][0]['key'] = '_wp_attachment_metadata';
        $ids = get_posts( $query );
    
        foreach( $ids as $id ) {
    
            $meta = wp_get_attachment_metadata($id);
    
            foreach( $meta['sizes'] as $size => $values )
                if( $values['file'] == $file && $url == array_shift( wp_get_attachment_image_src($id, $size) ) ) {
    
                    return $id;
                }
        }
    
        return false;
    }
    

    Then I use this function to return the innerHTML of the <a> node I will find (found here https://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument ) :

    // GET INNER HTML OF A NODE
    function DOMinnerHTML($element) 
    { 
        $innerHTML = ""; 
        $children = $element->childNodes; 
        foreach ($children as $child) 
        { 
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
            $innerHTML.=trim($tmp_dom->saveHTML()); 
        } 
        return $innerHTML; 
    } 
    

    Now, the actual function that will be add as filter to the_content :

    function remove_bad_img_links($content) {   
        $dom = new DOMDocument();
        // THIS IS HACK TO LOAD STRING WITH CORRECT ENCODING
        // JUST OUTPUT <--?xml encoding="UTF-8"--> IN HTML SO NO HARM
        $dom->loadHTML( '<?xml encoding="UTF-8">' .  $content );
    
        // GET ALL <a> NODE
        foreach ( $dom->getElementsByTagName('a') as $node ) {
                // GET HREF 
            $link_href = $node->getAttribute( 'href' );
                // USE INNER OF THIS <a> NODE AS NEW DOC TO EXTRACT IMG
            $dom_node = new DOMDocument();
            $inner = DOMinnerHTML($node);
            $dom_node->loadHTML($inner);
                // EXTRACT IMG AND GET SRC OF IT
                // ASSUMING THERE IS ONLY ONE IMAGE ...
            foreach ( $dom_node->getElementsByTagName('img') as $img_node ) {
                $img_node_link = $img_node->getAttribute( 'src' );
            }
                // CHECK IF THE WORD attachment IS IN HREF
            preg_match('/attachment/', $link_href, $matches);
                // IF SO...
            if ( $matches ) {
                        // GET ID OF THE IMAGE VIA CUSTOM FUNCTION
                $img_id = get_attachment_id( $img_node_link );
                        // GET ARRAY OF THE IMAGE VIA BUILTIN FUNCTION
                $img_array = wp_get_attachment_image_src( $img_id, 'large' );
                        // REPLACE HREF WITH NEW SOURCE
                        if ( ! empty ( $img_id ) ) {
                $node->setAttribute('href', $img_array[0] );
                        } else {
                        // FALLBACK IF CUSTOM FUNCTION DONT RETURN URL
                $node->setAttribute('href', $img_node_link );
                        }
            }
        // RETURN MODIFIED DOM
        if ( $matches ) $content = $dom->saveHTML();
        }
        // RETURN CONTENT
        return $content;
    }
    // APPLY FILTER
    add_filter( 'the_content', 'remove_bad_img_links' );
    

    I have tested on posts with no images, to several images, without any problems for now. All images have correct href destination and when possible, this href is directed to the large version of the file.