Matching shortcodes without regex

I’ve read a lot that using regex is not the smartest way to get and manipulate html and that you should make use of the DOMDocument. I have refactored some code from the docs and here and created two functions to split the_content() into text and a tags. The first function removes the specific tag and returns the content without the tag and the second function returns the content of the tag without other content

function get_content_without( $html, $tag )
{
    $dom = new DOMDocument;
    $dom->loadHTML( $html );

    $dom_x_path = new DOMXPath( $dom );
    while ($node = $dom_x_path->query( $tag )->item(0)) {
        $node->parentNode->removeChild( $node );
    }
    return $dom->saveHTML();
}

function get_html_tag_content( $html, $tag )
{
    $document = new DOMDocument();
    $document->loadHTML( $html );  

    $tags = [];
    $elements = $document->getElementsByTagName( $tag );
    if ( $elements ) {
        foreach ( $elements as $element ) {
            $tags[] = $document->saveHtml($element);
        }   
    }   
    return $tags;
}

Proof of concept: (Here we split the text from the a tag)

Read More
$html = '<a href="http://localhost/wordpress/image3/tags-sidebar/" rel="attachment wp-att-731">
        <img src="http://localhost/wordpress/wp-content/uploads/2014/12/tags-sidebar.jpg" alt="tags sidebar" width="318" height="792" class="alignright size-full wp-image-731" />
    </a>
    Cras malesuada turpis et augue feugiat, eget mollis tellus elementum. 
    Nunc posuere mattis arcu, ut varius ipsum molestie in. 
    Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; 
    Morbi ultricies tincidunt odio nec suscipit. Sed porttitor metus ut tincidunt interdum. 
    Etiam lobortis mollis augue at aliquam. Nunc venenatis elementum quam sed elementum. 
    Pellentesque congue pellentesque orci, vel convallis augue semper vitae';

?><pre><?php var_dump(get_html_tag_content($html, 'a')); ?></pre><?php  
?><pre><?php var_dump(get_content_without($html, '//a')); ?></pre><?php 

My question is, is there something similar to match and remove shortcodes in WordPress. The build in functions is WordPress is really crappy and matches all shortcodes.

I have found many examples using regex, but none using the DOM. Here are two examples of shortcodes

How do I match the audio shortcode and how do I match the gallery shortcode. Is this possible without using regex and using the DOM and how?

Related posts

Leave a Reply

1 comment

  1. It’s not possible to isolate a shortcode using just the DOM.

    The characters [ and ] have no special meaning in HTML or XML. So to the DOM parser, [shortcode] is no different than ipsum in your sample text above. It’s just another piece of the text node, so the only way to locate those is via the string functions, such as using a regex.

    Shadow DOM is the up-and-coming standard for what are essentially native HTML shortcodes. As of today, native support is spotty. If you wanted to replace your shortcodes with something DOM parseable, this would be the way to go.