I tried to do it myself but I have really hard time with Regex (regex is hell. I just don”t understand).
Anyway, I think this would be usefull to some people. The context is that we use a plugin to display images in post in a lightbox. It works only if we create the link to the file (the image) and NOT the attachment page.
Most of the time, authors forgot to check, lightbox don’t pop in, and everyone is disappointed.
Bad HTML (note the attachment word in the href
and that the href
doesn’t end with .jpg) :
<p>
<a href="http://domain.ru/uncategorized/208/e-poha-restavratsii/attachment/cyril_gassiline_photo" rel="attachment wp-att-209"><img class="alignnone size-large wp-image-209" title="cyril_gassiline_photo" src="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" alt="" width="615" height="409" /></a>
</p>
So I guess we could do something like (in the logic) :
function remove_bad_img_links($content) {
$matches = array();
$check_for_attachment_word = preg_match('/<a[s]+[^>]*hrefs*=s*(attachment)(["']+)([^>]+?)(1|>)/i', $content, $matches);
if ( $check_for_attachment ) {
$image_url = preg_match('/<img[s]+[^>]*srcs*=s*(["']+)([^>]+?)(1|>)/i', $content, $matches);
preg_replace('/<a[s]+[^>]*hrefs*=s*(attachment)(["']+)([^>]+?)(1|>)/i', $content, $image_url);
}
return $content;
}
add_filter( 'the_content', 'remove_bad_img_links' );
(I know the code is wrong, ^^, I really don’t get regex at all)
To produce corrected HTML (the value in the img src
has remplaced the initial value of the a href
that was detected as wrong because it contains the word attachment (of simply do it all the time) :
<p>
<a href="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" rel="attachment wp-att-209"><img class="alignnone size-large wp-image-209" title="cyril_gassiline_photo" src="http://domain.ru/wp-content/uploads/2012/07/cyril_gassiline_photo-615x409.jpg" alt="" width="615" height="409" /></a>
</p>
But writing it, I ask myself if it’s really possible, considering that a post could have several occurences of those links…
Just asking for reference : is it possible, is it a good idea, what would the regex look like ?
I could do it in JS but it would better to SEO to do it in the generated code.
EDIT :
More research tell me that “Thou shalt not use regular expressions to parse HTML”. A better approach is to use DOM :
https://stackoverflow.com/questions/3820666/grabbing-the-href-attribute-of-an-a-element
Definitively, I was thinking it wrong, it’s very bad to parse HTML : http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html. I will try to use DOM and will come back with update.
So I have completed a way to replace links to attachments pages for images, and replace them with not only the url to the
img
tag, but to the large version of the image file.I’m not using regex because it’s evil and I don’t want to make kitten dies.
It working for me, so I share as a starting source for someone who want to do the same (and for possible improvments or corrections from more talented people) :
First I use a function to get ID of a file from its URL (I took it here : Turn a URL into an Attachment / Post ID)
Then I use this function to return the innerHTML of the
<a>
node I will find (found here https://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument ) :Now, the actual function that will be add as filter to
the_content
:I have tested on posts with no images, to several images, without any problems for now. All images have correct
href
destination and when possible, thishref
is directed to the large version of the file.