Regex match and replace advanced

February 18, 20232 Views

I’m trying to write a little WordPress plugin to support some migrated content.

The syntax highlighter expects (for proper highlighting):

<pre lang='something'>
  <code>
    The code...
  </code>
</pre>

However, my markdown code has the following:

<pre>
  <code>
    :::something
    The code...
  </code>
</pre>

I think you can see where this is going. What I want to achieve is this:

:::something should be removed, and the <pre> tag should be updated to <pre lang="something">.
If :::something does not exist, the <pre> tag should be <pre lang="plain">
There may be multiple occurrences per page that need to be updated.

How would a PHP function achieving the above look like?

function set_syntax_lang($content) {
  // Do stuff here
  return $new_content;
}

What I gathered so far is this regex:

/<pre.*>s*<code>s*:::(w)/

This even yields me, using preg_match, the actual syntax indicator (something), but I don’t know how to update the pre-tag correctly.

It’s been a very long time since I coded PHP and regexes are not really my strong suit. So all help is appreciated.

Post Views: 2

3 comments

Anonymous says:

February 18, 2023 at 10:01 am
Finding :::something
```
preg_replace( '/<pre(.*>s*<code>s*):::(w+)/U', '<pre lang="$2"$1' , $html );
```
This is an edge-case. But normally I should advise you to NOT use regex for html (bobince someone?).

Also next time try be less verbouse on your question. I took more time to read you than to write this answer.

Finding code without :::something
```
preg_replace( '/<pre(.*>s*<code>s*)(?!:::w+)/U', '<pre lang="plain"$1' , $html );
```
Fixing <code>
```
preg_replace( array( '/(<pre.*>)s*<code>/U' , '/</code>s*(</pre>)/U' ),
              '$1' , $html );
//> Completly untested
```
Log in to Reply
Anonymous says:

February 18, 2023 at 10:01 am

You answered most of your question in the steps you gave. Break it down into those chunks — FIRST see if you have :::something, THEN update your <pre> tag and REPEAT.

You’ll have a much easier time of it if you use the DOM instead of regex. It will make the job of navigating through the <pre> and <code> tags very simple. As has been said many, many times here, html is not a regular language, so a regular expression cannot parse it correctly. Even for a limited subset of HTML, it’s really not the right tool. The regex for :::something is trivial once you use the DOM to get the text between <code> and </code>: /:::(w+)/

Log in to Reply
Anonymous says:

February 18, 2023 at 10:01 am
First of all some points I ran over:
```
/<pre.*>s*<code>s*:::(w)/
     ^ 
```
According to your question, there never is a space in there if you make use of :::something. But you add it into your regex. I wonder why.
```
/<pre.*>s*<code>s*:::(w)/
                         ^ 
```
If the language specifier is larger than one character (which I assume) you must write that into the regex, like w+ for one or more letters.

The rest looks quite like you have already everything. Probably not the replacement:
```
$result = preg_replace( '((<pre)(>s*<code>s*):::(w+))', '$1 lang="$3"$2' , $subject );
```
Hopefully this helps.
Log in to Reply

Regex match and replace advanced

Leave a Reply Cancel reply

3 comments

Finding :::something

Finding code without :::something

Fixing `<code>`

Social Network

Related posts

Leave a Reply Cancel reply

3 comments

Finding :::something

Finding code without :::something

Fixing <code>

Social Network

Fixing `<code>`