Regex match and replace advanced

I’m trying to write a little WordPress plugin to support some migrated content.

The syntax highlighter expects (for proper highlighting):

Read More
<pre lang='something'>
  <code>
    The code...
  </code>
</pre>

However, my markdown code has the following:

<pre>
  <code>
    :::something
    The code...
  </code>
</pre>

I think you can see where this is going. What I want to achieve is this:

  1. :::something should be removed, and the <pre> tag should be updated to <pre lang="something">.
  2. If :::something does not exist, the <pre> tag should be <pre lang="plain">
  3. There may be multiple occurrences per page that need to be updated.

How would a PHP function achieving the above look like?

function set_syntax_lang($content) {
  // Do stuff here
  return $new_content;
}

What I gathered so far is this regex:

/<pre.*>s*<code>s*:::(w)/

This even yields me, using preg_match, the actual syntax indicator (something), but I don’t know how to update the pre-tag correctly.

It’s been a very long time since I coded PHP and regexes are not really my strong suit. So all help is appreciated.

Related posts

Leave a Reply

3 comments

  1. Finding :::something

    preg_replace( '/<pre(.*>s*<code>s*):::(w+)/U', '<pre lang="$2"$1' , $html );
    

    This is an edge-case. But normally I should advise you to NOT use regex for html (bobince someone?).

    Also next time try be less verbouse on your question. I took more time to read you than to write this answer.

    Finding code without :::something

    preg_replace( '/<pre(.*>s*<code>s*)(?!:::w+)/U', '<pre lang="plain"$1' , $html );
    

    Fixing <code>

    preg_replace( array( '/(<pre.*>)s*<code>/U' , '/</code>s*(</pre>)/U' ),
                  '$1' , $html );
    //> Completly untested
    
  2. You answered most of your question in the steps you gave. Break it down into those chunks — FIRST see if you have :::something, THEN update your <pre> tag and REPEAT.

    You’ll have a much easier time of it if you use the DOM instead of regex. It will make the job of navigating through the <pre> and <code> tags very simple. As has been said many, many times here, html is not a regular language, so a regular expression cannot parse it correctly. Even for a limited subset of HTML, it’s really not the right tool. The regex for :::something is trivial once you use the DOM to get the text between <code> and </code>: /:::(w+)/

  3. First of all some points I ran over:

    /<pre.*>s*<code>s*:::(w)/
         ^ 
    

    According to your question, there never is a space in there if you make use of :::something. But you add it into your regex. I wonder why.

    /<pre.*>s*<code>s*:::(w)/
                             ^ 
    

    If the language specifier is larger than one character (which I assume) you must write that into the regex, like w+ for one or more letters.

    The rest looks quite like you have already everything. Probably not the replacement:

    $result = preg_replace( '((<pre)(>s*<code>s*):::(w+))', '$1 lang="$3"$2' , $subject );
    

    Hopefully this helps.