Most efficient way to search for values from CPT in Post content

I have a Glossary CPT. Title, editor, slug and featured image.

I would like to add a filter to Post content (and content of my other CPT “species”) which looks through all of the content in each post and replaces any words found in the Glossary with a link to the glossary entry.

Read More

For example…

Glossary entry: test

Post content: <p>This is a test post to explain what I mean.</p>

Outcome after filter: <p>This is a <a href="/glossary/t/test/">test</a> post to explain what I mean.</p>.

There may be up to 2000 glossary entries at any given time however, so I’m worried about efficiency.


EDIT: THIS CODE NEEDS SOME REWORKING!

I’ve started to use code as suggested by Matthew Boynes.

I have adapted it somewhat, as I was receiving an error with my $wordlist variable.

More information on said error is available here.

function add_glossary_links($content) {
    global $wpdb, $wordlist;
    if ( !$wordlist && !$wordlist = get_option('wordlist') ) {
        mysql_query('SET SESSION group_concat_max_len = 100000');
        $wordlist = $wpdb->get_var('SELECT GROUP_CONCAT(DISTINCT post_title SEPARATOR "|") AS list FROM '.$wpdb->posts.' WHERE post_status="publish" AND post_type="glossary" AND post_parent>0');
        add_option('wordlist', $wordlist);
    }

    $wordlist = str_replace(array(" ", "'", ".", "/"), array("s", "'", ".", "/"), $wordlist);

    $content = preg_replace_callback(
        '/b('.$wordlist.')b/i',
        create_function(
            '$matches',
            'return "<a href="/glossary/" . strtolower(substr($matches[0],0,1) . "/" . $matches[0]) . "/">" . $matches[0] . "</a>";'
        ),
        $content
    );

    return preg_replace('/(<[^<]+)<as.*?>(.*?)</a>/si','$1$2', $content);
}

add_filter( 'the_content', 'add_glossary_links' );

The function above works, to an extent. I now have two issues:

  1. The function is linking everything:

    [<a href="/glossary///"></a>caption <a href="/glossary///"></a>id="<a href="/glossary///"></a>attachment_542" <a href="/glossary///"></a>align="<a href="/glossary///"></a>alignleft" <a href="/glossary///"></a>width="<a href="/glossary///"></a>125" <a href="/glossary///"></a>caption="<a href="/glossary///"></a>Amazonas <a href="/glossary///"></a>Magazine - <a href="/glossary///"></a>now <a href="/glossary///"></a>in <a href="/glossary///"></a>English!"]<<a href="/glossary///"></a>a <a href="/glossary///"></a>href="<a href="/glossary///"></a>http://<a href="/glossary///"></a>www.<a href="/glossary///"></a>seriouslyfish.<a href="/glossary///"></a>com/<a href="/glossary///"></a>dev/<a href="/glossary///"></a>wp-<a href="/glossary///"></a>content/<a href="/glossary///"></a>uploads/<a href="/glossary///"></a>2011/<a href="/glossary///"></a>12/<a href="/glossary///"></a>Amazonas-<a href="/glossary///"></a>English-<a href="/glossary///"></a>1.<a href="/glossary///"></a>jpg"><<a href="/glossary///"></a>img <a href="/glossary///"></a>class="<a href="/glossary///"></a>size-<a href="/glossary///"></a>thumbnail <a href="/glossary///"></a>wp-<a href="/glossary///"></a>image-<a href="/glossary///"></a>542" <a href="/glossary///"></a>title="<a href="/glossary///"></a>Amazonas <a href="/glossary///"></a>English" <a href="/glossary///"></a>src="<a href="/glossary///"></a>http://<a href="/glossary///"></a>www.<a href="/glossary///"></a>seriouslyfish.<a href="/glossary///"></a>com/<a href="/glossary///"></a>dev/<a href="/glossary///"></a>wp-<a href="/glossary///"></a>content/<a href="/glossary///"></a>uploads/<a href="/glossary///"></a>2011/<a href="/glossary///"></a>12/<a href="/glossary///"></a>Amazonas-<a href="/glossary///"></a>English-<a href="/glossary///"></a>1-<a href="/glossary///"></a>288x381.<a href="/glossary///"></a>jpg" <a href="/glossary///"></a>alt="<a href="/glossary///"></a>Amazonas <a href="/glossary///"></a>English" <a href="/glossary///"></a>width="<a href="/glossary///"></a>125" <a href="/glossary///"></a>height="<a href="/glossary///"></a>165" /></a<a href="/glossary///"></a>>[/caption<a href="/glossary///"></a>]
    <<a href="/glossary///"></a>p><a href="/glossary///"></a>Edited <a href="/glossary///"></a>by <a href="/glossary///"></a>Hans-<a href="/glossary///"></a>Georg <a href="/glossary///"></a>Evers, <a href="/glossary///"></a>the <a href="/glossary///"></a>magazine &#<a href="/glossary///"></a>8216;<a href="/glossary///"></a>Amazonas&#<a href="/glossary///"></a>8217; <a href="/glossary///"></a>has <a href="/glossary///"></a>been <a href="/glossary///"></a>widely-<a href="/glossary///"></a>regarded <a href="/glossary///"></a>as <a href="/glossary///"></a>among <a href="/glossary///"></a>the <a href="/glossary///"></a>finest <a href="/glossary///"></a>regular <a href="/glossary///"></a>publications <a href="/glossary///"></a>in <a href="/glossary///"></a>the <a href="/glossary///"></a>hobby <a href="/glossary///"></a>since <a href="/glossary///"></a>its <a href="/glossary///"></a>launch <a href="/glossary///"></a>in <a href="/glossary///"></a>2005, <a href="/glossary///"></a>an <a href="/glossary///"></a>impressive <a href="/glossary///"></a>achievment <a href="/glossary///"></a>considering <a href="/glossary///"></a>it&#<a href="/glossary///"></a>8217;<a href="/glossary///"></a>s <a href="/glossary///"></a>only <a href="/glossary///"></a>been <a href="/glossary///"></a>published <a href="/glossary///"></a>in <a href="/glossary///"></a>German <a href="/glossary///"></a>to <a href="/glossary///"></a>date. <a href="/glossary///"></a>The <a href="/glossary///"></a>long-<a href="/glossary///"></a>awaited <a href="/glossary///"></a>English <a href="/glossary///"></a>version <a href="/glossary///"></a>is <a href="/glossary///"></a>just <a href="/glossary///"></a>about <a href="/glossary///"></a>to <a href="/glossary///"></a>launch, <a href="/glossary///"></a>and <a href="/glossary///"></a>we <a href="/glossary///"></a>think <a href="/glossary///"></a>a <a href="/glossary///"></a>subscription <a href="/glossary///"></a>should <a href="/glossary///"></a>be <a href="/glossary///"></a>top <a href="/glossary///"></a>of <a href="/glossary///"></a>any <a href="/glossary///"></a>serious <a href="/glossary///"></a>fishkeeper&#<a href="/glossary///"></a>8217;<a href="/glossary///"></a>s <a href="/glossary///"></a>Xmas <a href="/glossary///"></a>list&#<a href="/glossary///"></a>8230;</p<a href="/glossary///"></a>>
    <<a href="/glossary///"></a>p><a href="/glossary///"></a>The <a href="/glossary///"></a>magazine <a href="/glossary///"></a>is <a href="/glossary///"></a>published <a href="/glossary///"></a>in <a href="/glossary///"></a>a <a href="/glossary///"></a>bi-<a href="/glossary///"></a>monthly <a href="/glossary///"></a>basis <a href="/glossary///"></a>and <a href="/glossary///"></a>the <a href="/glossary///"></a>English <a href="/glossary///"></a>version <a href="/glossary///"></a>launches <a href="/glossary///"></a>with <a href="/glossary///"></a>the <a href="/glossary///"></a>January/<a href="/glossary///"></a>February <a href="/glossary///"></a>2012 <a href="/glossary///"></a>issue <a href="/glossary///"></a>with <a href="/glossary///"></a>distributors <a href="/glossary///"></a>already <a href="/glossary///"></a>organised <a href="/glossary///"></a>in <a href="/glossary///"></a>the <a href="/glossary///"></a>United <a href="/glossary///"></a>States, <a href="/glossary///"></a>Canada, <a href="/glossary///"></a>the <a href="/glossary///"></a>United <a href="/glossary/k/kingdom/">Kingdom</a>, <a href="/glossary///"></a>South <a href="/glossary///"></a>Africa, <a href="/glossary///"></a>Australia, <a href="/glossary///"></a>and <a href="/glossary///"></a>New <a href="/glossary///"></a>Zealand. <a href="/glossary///"></a>There <a href="/glossary///"></a>are <a href="/glossary///"></a>also <a href="/glossary///"></a>mobile <a href="/glossary///"></a>apps <a href="/glossary///"></a>availablen <a href="/glossary///"></a>which <a href="/glossary///"></a>allow <a href="/glossary///"></a>digital <a href="/glossary///"></a>subscribers <a href="/glossary///"></a>to <a href="/glossary///"></a>read <a href="/glossary///"></a>on <a href="/glossary///"></a>portable <a href="/glossary///"></a>devices.</p<a href="/glossary///"></a>>
    <<a href="/glossary///"></a>p><a href="/glossary///"></a>It&#<a href="/glossary///"></a>8217;<a href="/glossary///"></a>s <a href="/glossary///"></a>fair <a href="/glossary///"></a>to <a href="/glossary///"></a>say <a href="/glossary///"></a>that <a href="/glossary///"></a>there <a href="/glossary///"></a>currently <a href="/glossary///"></a>exists <a href="/glossary///"></a>no <a href="/glossary///"></a>better <a href="/glossary///"></a>publication <a href="/glossary///"></a>for <a href="/glossary///"></a>dedicated <a href="/glossary///"></a>hobbyists <a href="/glossary///"></a>with <a href="/glossary///"></a>each <a href="/glossary///"></a>issue <a href="/glossary///"></a>featuring <a href="/glossary///"></a>cutting-<a href="/glossary///"></a>edge <a href="/glossary///"></a>articles <a href="/glossary///"></a>on <a href="/glossary///"></a>fishes, <a href="/glossary///"></a>invertebrates, <a href="/glossary/a/aquatic/">aquatic</a> <a href="/glossary///"></a>plants, <a href="/glossary///"></a>field <a href="/glossary///"></a>trips <a href="/glossary///"></a>to <a href="/glossary///"></a>tropical<a href="/glossary///"></a> destinations<a href="/glossary///"></a> plus<a href="/glossary///"></a> the<a href="/glossary///"></a> latest<a href="/glossary///"></a> in<a href="/glossary///"></a> husbandry<a href="/glossary///"></a> and<a href="/glossary///"></a> breeding<a href="/glossary///"></a> breakthroughs<a href="/glossary///"></a> by<a href="/glossary///"></a> expert<a href="/glossary///"></a> aquarists<a href="/glossary///"></a>, all<a href="/glossary///"></a> accompanied<a href="/glossary///"></a> by<a href="/glossary///"></a> excellent<a href="/glossary///"></a> photography<a href="/glossary///"></a> throughout<a href="/glossary///"></a>.</p<a href="/glossary///"></a>>
    <<a href="/glossary///"></a>p><a href="/glossary///"></a>U.<a href="/glossary///"></a>S. <a href="/glossary///"></a>residents <a href="/glossary///"></a>can <a href="/glossary///"></a>subscribe <a href="/glossary///"></a>to <a href="/glossary///"></a>the <a href="/glossary///"></a>printed <a href="/glossary///"></a>edition <a href="/glossary///"></a>for <a href="/glossary///"></a>just $<a href="/glossary///"></a>29 <a href="/glossary///"></a>USD <a href="/glossary///"></a>per <a href="/glossary///"></a>year, <a href="/glossary///"></a>which <a href="/glossary///"></a>also <a href="/glossary///"></a>includes <a href="/glossary///"></a>a <a href="/glossary///"></a>free <a href="/glossary///"></a>digital <a href="/glossary///"></a>subscription, <a href="/glossary///"></a>with <a href="/glossary///"></a>the <a href="/glossary///"></a>same <a href="/glossary///"></a>offer <a href="/glossary///"></a>available <a href="/glossary///"></a>to <a href="/glossary///"></a>Canadian <a href="/glossary///"></a>readers <a href="/glossary///"></a>for $<a href="/glossary///"></a>41 <a href="/glossary///"></a>USD <a href="/glossary///"></a>or <a href="/glossary///"></a>overseas <a href="/glossary///"></a>subscribers <a href="/glossary///"></a>for $<a href="/glossary///"></a>49 <a href="/glossary///"></a>USD. <a href="/glossary///"></a>Please <a href="/glossary///"></a>see <a href="/glossary///"></a>the <<a href="/glossary///"></a>a <a href="/glossary///"></a>href="<a href="/glossary///"></a>http://<a href="/glossary///"></a>www.<a href="/glossary///"></a>amazonasmagazine.<a href="/glossary///"></a>com/"><a href="/glossary///"></a>Amazonas <a href="/glossary///"></a>website</a<a href="/glossary///"></a>> for<a href="/glossary///"></a> further<a href="/glossary///"></a> information<a href="/glossary///"></a> and<a href="/glossary///"></a> a<a href="/glossary///"></a> sample<a href="/glossary///"></a> digital<a href="/glossary///"></a> issue<a href="/glossary///"></a>!</p<a href="/glossary///"></a>>
    <<a href="/glossary///"></a>p><a href="/glossary///"></a>Alternatively, <a href="/glossary///"></a>subscribe <a href="/glossary///"></a>directly <a href="/glossary///"></a>to <a href="/glossary///"></a>the <a href="/glossary///"></a>print <a href="/glossary///"></a>version <<a href="/glossary///"></a>a <a href="/glossary///"></a>href="<a href="/glossary///"></a>https://<a href="/glossary///"></a>www.<a href="/glossary///"></a>amazonascustomerservice.<a href="/glossary///"></a>com/<a href="/glossary///"></a>subscribe/<a href="/glossary///"></a>index2.<a href="/glossary///"></a>php"><a href="/glossary///"></a>here</a<a href="/glossary///"></a>> or<a href="/glossary///"></a> digital<a href="/glossary///"></a> version<a href="/glossary///"></a> <<a href="/glossary///"></a>a <a href="/glossary///"></a>href="<a href="/glossary///"></a>https://<a href="/glossary///"></a>www.<a href="/glossary///"></a>amazonascustomerservice.<a href="/glossary///"></a>com/<a href="/glossary///"></a>subscribe/<a href="/glossary///"></a>digital.<a href="/glossary///"></a>php"><a href="/glossary///"></a>here</a<a href="/glossary///"></a>>.</p<a href="/glossary///"></a>>

  2. This may be a pushing my luck, but would it be possible to make this code case-sensitive, i.e. content only matches wordlist if case is the same, unless it’s the first word of a sentence?

Thanks in advance,

Related posts

Leave a Reply

3 comments

  1. This was a fun project. I think I could write an entire chapter in a book on this (in fact, I practically did below). Anyway, I messed around with a bunch of ways to do this, and here’s the simplest way I came up with:

    function add_glossary_links($content) {
        global $wpdb, $wordlist;
        if ( !$wordlist && !$wordlist = get_option('wordlist') ) {
            mysql_query('SET SESSION group_concat_max_len = 100000');
            $wordlist = $wpdb->get_var('SELECT GROUP_CONCAT(DISTINCT word SEPARATOR "|") AS `list` FROM '.$wpdb->prefix.'glossary');
            add_option('wordlist', $wordlist);
        }
        $content = preg_replace_callback(
            '/b('.$wordlist.')b/i',
            create_function(
                '$matches',
                'return "<a href="/glossary/".strtolower(substr($matches[0],0,1)."/".$matches[0])."/">".$matches[0]."</a>";'
            ),
            $content
        );
        return preg_replace('/(<[^<]+)<as.*?>(.*?)</a>/si','$1$2', $content);
    }
    add_filter( 'the_content', 'add_glossary_links' );
    

    Discussion

    Assuming a table prefix_glossary of words, we’re selecting a pipe-delimited list of all records. This will later become part of the regular expression when we search our post for the words. Before we do this, we’re setting a session variable for group_concat_max_length, which by default truncates to 1024 characters. 100,000 characters gives you room for 10,000 9-letter words (and 1 10 letter word, to be precise), so you can up that if you need more.

    In preg_replace_callback, we’re replacing all instances of the words with themselves wrapped in anchor tags. Adjust the HREF as necessary.

    This might have malformed some HTML (e.g. wrapping a word in an image alt with an anchor tag), so the next preg_replace fixes that. If you want, you could do another one to ensure that words already wrapped in anchor tags don’t get wrapped in another anchor tag.

    For performance testing, I built a glossary table of 1,001 random words 5-15 characters long. Performance does indeed suffer if the SELECT GROUP_CONCAT... results are not cached in some way, to the tune of a full second. As you can see, what I’ve done here is stored it in an option (of course, you’d want to update that option any time your glossary is updated). You could take any number of steps to boost performance here, and this is just one suggestion. Using apache benchmark to test, my page loads increased an average of 6.8% (with a very large standard deviation, so it’s actually not a statistically significant difference, but I digress), which is on the fence; whether that is crucial or not is based entirely on how much traffic the site gets. Although, if it got a lot of traffic, one would hope it would have caching enabled anyway, in which a case the difference would be completely negligible.

    Conclusion

    This works, but you’d want to revisit the approach if that 6.8% was critical. Specifically, you could consider doing the replacements with javascript. You could do it in similar fashion, writing the wordlist out as a JS variable. This way your server doesn’t have to handle the search processing and you let your users’ machines do all the work.

    Alternatives

    Another way I tried doing this, which was significantly faster but required more coding, was to store an index when a post was published or updated. In that case, I searched the dictionary against every word in the post and stored the matches in an index table. Then on page load, replace the shortened wordlist in the post like above, so then I was only processing the list of known matches instead of the full list (so, say 5 words instead of 1000). If you wanted to explore that route, you need two tables:

    CREATE TABLE `wp_glossary` (
      `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `word` tinytext,
      `definition` text,
      PRIMARY KEY (`id`)
    ) ENGINE=MyISAM AUTO_INCREMENT=1002 DEFAULT CHARSET=latin1;
    
    CREATE TABLE `wp_posts_glossary` (
      `post_id` int(11) unsigned NOT NULL,
      `glossary_id` int(11) unsigned NOT NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
    

    And you’d create the index on post update using something like:

    global $wpdb, $post;
    $wpdb->query('DELETE FROM '.$wpdb->prefix.'posts_glossary WHERE post_id='.$post->ID);
    $sql_regex = rtrim(preg_replace('/[Ws]*(w+)[Ws]*/','$1|',$post->post_content),'|');
    $wpdb->query('INSERT INTO '.$wpdb->prefix.'posts_glossary (post_id,glossary_id)
        SELECT '.$post->ID.',ID FROM '.$wpdb->prefix.'glossary WHERE word REGEXP("^'.$sql_regex.'$")';
    

    I used my 1,001-word glossary and the 2,215-word first chapter of Moby Dick to test the indexing and it averaged ~1.5 seconds to build. Therefore, if you went this route, you should consider leveraging wp_cron so you don’t have to wait for it to index after you hit “Update” or “Publish”.

    Fun stuff. Hope this helps!

    Cheers~

  2. The solution above are pretty good, and the performance concerns are right.
    But the “caching” solution is really simple, so you can do it without much effort:
    https://codex.wordpress.org/Transients_API
    you should cache the list of glossary terms with a large timeframe (you can expect it being quite immutable: when you really need to update, use delete_transient())

    you can then modify the above script ( https://wordpress.stackexchange.com/a/42008/12843 )
    caching $wordlist

    function add_glossary_links($content) {
        global $wpdb;
        $wordlist = get_transient('glossary_list');
        if ( !$wordlist && !$wordlist = get_option('wordlist') ) {
            mysql_query('SET SESSION group_concat_max_len = 100000');
            $wordlist = $wpdb->get_var('SELECT GROUP_CONCAT(DISTINCT word SEPARATOR "|") AS `list` FROM '.$wpdb->prefix.'glossary');
            set_transient('wordlist', $wordlist, 60*60*24*2);
        }
        $content = preg_replace_callback( /*continues...*/
    

    this will make mysql queries very rare.
    However, if your wordlist is huge, the processing will still be a bit slow: to make it faster, use “standard” wordpress caching plugin, you won’t look back!

  3. Take a look at the Linkify Text plugin. It seems like it’s doing a very similar thing, and you’d just have to modify it to use your glossary CPT for the terms instead of it’s internal bank of terms.

    You could run a caching plugin like WP Super Cache to avoid performance issues.