Filter/Remove HTML Elements on all posts and pages

I’m working on a migration from Drupal to WP. The database conversion and import went smoothly, but there is a lot of ‘crap’ in each post content such as divs with inline styles. Basically, in each post (over 800 of them) I need to sort through them, remove all div tags but keep the actual content between the div tags.

Examples
A post with content like this:

Read More
<div class="contentHeader" style="clear: both; min-height: 40px; margin: 12px 0px 9px 9px; color: #f16000; font-family: Arial; font-size: 16px; font-weight: bold; text-align: left;">
<div class="title entry-title" style="font-family: Arial; font-size: 24px; line-height: 22px; color: #f16000;"><span style="font-size: 13px; color: #333333; font-family: 'Trebuchet MS', Arial, Helvetica, sans-serif;">Dear Neil: I am 55, and find myself single all over again. Trying to find a relationship is radically different than it was when I was in my 20s. I want to remarry, but it's harder to date at this age, and it is very difficult to evaluate whether someone would be compatible with me. I know I'm not as “hot” as I used to be, and the people I'm meeting aren't likely to win “sexiest man alive” contests anytime soon as well. Is there anything that could help me evaluate whether someone is a good potential intimate partner for me? There are millions of us in the second half of our lives trying to find each other. Can you help?</span>
<div class="articlemain" style="min-height: 1365px; color: #333333; font-family: 'Trebuchet MS', Arial, Helvetica, sans-serif; text-align: left;">
<div class="hnews hentry item">
<div class="content" style="font-size: 13px; padding: 17px 0px 17px 9px;">
<div class="entry-content">
<div class="articleparagraph">More content.....
</div>
</div>
</div>
</div>
</div>
</div>
</div>

I need to run some sort of script (with regex?) that will remove the ‘crap’ but keep the text between div and span tags :

Dear Neil: I am 55, and find myself single all over again. Trying to find a relationship     is radically different than it was when I was in my 20s. I want to remarry, but it's harder to date at this age, and it is very difficult to evaluate whether someone would be compatible with me. I know I'm not as “hot” as I used to be, and the people I'm meeting aren't likely to win “sexiest man alive” contests anytime soon as well. Is there anything that could help me evaluate whether someone is a good potential intimate partner for me? There are millions of us in the second half of our lives trying to find each other. Can you help?

More content.....

Any ideas on the best way to accomplish this? Help is greatly appreciated.

Related posts

Leave a Reply

1 comment

  1. Here’s an example function that might help you accomplish something like that. Basically what it does is fetch a couple of posts, loops through them, modifies the post_content field and stores the changes.

    function wpse_87695_clean_post_content() {
        $posts = get_posts(array(
            'post_type' => array('post', 'page'),
            'post_status' => 'publish',
            /*
            'meta_query' => array(
                array(
                    'key' => '_wpse_87695_processed',
                    'value' => true,
                    'compare' => '!='
                )
            ),
            */
        ));
    
        foreach ($posts as $p) {
            $p->post_content = wpse_87695_filter_content($p->post_content);
            wp_update_post($p);
    
            // update_post_meta($p->ID, '_wpse_87695_processed', true);
        }
    
        die();
    }
    add_action('wp', 'wpse_87695_clean_post_content');
    
    function wpse_87695_filter_content($content) {
        return strip_tags($content); // wp_filter_nohtml_kses might be a more WordPress-friendly way to do this
    }
    

    First, you will wan’t to refine the get_posts argument so that it returns only the posts you need to clean. You would probably also want to limit the number of posts, as you will probably not be able to process 800 posts at once, though set_time_limit can help increase the number of posts you can process at once, depending on your configuration.

    Ideally you would also want to mark the posts already process in some way, for example using update_post_meta, as this will allow you to filter them out using a meta_query keyword in the arguments array. That way you could process e.g. 50 posts at a time, reloading the page until all posts have been processed. I commented it out in my example code as I think it’ll need some more work.

    Doing this work on a shared hosting environment might be very slow due to memory consumption and execution time limit, and as it’s also very likely that at some point (due to human error) you’ll corrupt data and having to start over, that you run on a backup database, preferably on a local machine.

    An alternative way, which would free you from having to run the conversion in batch, is to set up a small javascript to load a certain URL from somewhere in the admin, that will instead run the above one post at a time until all posts have been processed.

    Also, the filter function I supplied (wpse_87695_filter_content), as you can see, is extremely rudimentary. All it does is run strip_html() on the post_content to strip out all HTML in there. Likely you will have to use regular expressions or an HTML parser depending on your specific needs. For example you will probably need to remove the excess newlines and make sure the paragraphs are joined by only two newlines.

    Alternative approach

    Another solution could be to perform the filtering when the data is being output by WordPress. When you call the_content in you templates WordPress will fetch $post->post_content and run a few filters on it using apply_filters('the_content', '$post->post_content'). This allows you to register the function I outlines above as a filter for for all post content, by calling add_filter('the_content', 'wpse_87695_filter_content');.

    While this approach will save you the trouble of having to iterate all the posts and update the database manually it will require the same efforts when it comes to writing a good filter function. Also, it will be run every time a post is displayed and for posts completely unrelated to your current import, unless you define some form of exception. Thus, it can be considered more of a quick fix, off course depending on the nature of your data. Perhaps you could store the most important filtering in the database, and leave some of it out and handle that using WordPress filters if that helps you in some way.