I’m working on a migration from Drupal to WP. The database conversion and import went smoothly, but there is a lot of ‘crap’ in each post content such as divs with inline styles. Basically, in each post (over 800 of them) I need to sort through them, remove all div tags but keep the actual content between the div tags.
Examples
A post with content like this:
<div class="contentHeader" style="clear: both; min-height: 40px; margin: 12px 0px 9px 9px; color: #f16000; font-family: Arial; font-size: 16px; font-weight: bold; text-align: left;">
<div class="title entry-title" style="font-family: Arial; font-size: 24px; line-height: 22px; color: #f16000;"><span style="font-size: 13px; color: #333333; font-family: 'Trebuchet MS', Arial, Helvetica, sans-serif;">Dear Neil: I am 55, and find myself single all over again. Trying to find a relationship is radically different than it was when I was in my 20s. I want to remarry, but it's harder to date at this age, and it is very difficult to evaluate whether someone would be compatible with me. I know I'm not as âhotâ as I used to be, and the people I'm meeting aren't likely to win âsexiest man aliveâ contests anytime soon as well. Is there anything that could help me evaluate whether someone is a good potential intimate partner for me? There are millions of us in the second half of our lives trying to find each other. Can you help?</span>
<div class="articlemain" style="min-height: 1365px; color: #333333; font-family: 'Trebuchet MS', Arial, Helvetica, sans-serif; text-align: left;">
<div class="hnews hentry item">
<div class="content" style="font-size: 13px; padding: 17px 0px 17px 9px;">
<div class="entry-content">
<div class="articleparagraph">More content.....
</div>
</div>
</div>
</div>
</div>
</div>
</div>
I need to run some sort of script (with regex?) that will remove the ‘crap’ but keep the text between div and span tags :
Dear Neil: I am 55, and find myself single all over again. Trying to find a relationship is radically different than it was when I was in my 20s. I want to remarry, but it's harder to date at this age, and it is very difficult to evaluate whether someone would be compatible with me. I know I'm not as âhotâ as I used to be, and the people I'm meeting aren't likely to win âsexiest man aliveâ contests anytime soon as well. Is there anything that could help me evaluate whether someone is a good potential intimate partner for me? There are millions of us in the second half of our lives trying to find each other. Can you help?
More content.....
Any ideas on the best way to accomplish this? Help is greatly appreciated.
Here’s an example function that might help you accomplish something like that. Basically what it does is fetch a couple of posts, loops through them, modifies the
post_content
field and stores the changes.First, you will wan’t to refine the
get_posts
argument so that it returns only the posts you need to clean. You would probably also want to limit the number of posts, as you will probably not be able to process 800 posts at once, thoughset_time_limit
can help increase the number of posts you can process at once, depending on your configuration.Ideally you would also want to mark the posts already process in some way, for example using
update_post_meta
, as this will allow you to filter them out using ameta_query
keyword in the arguments array. That way you could process e.g. 50 posts at a time, reloading the page until all posts have been processed. I commented it out in my example code as I think it’ll need some more work.Doing this work on a shared hosting environment might be very slow due to memory consumption and execution time limit, and as it’s also very likely that at some point (due to human error) you’ll corrupt data and having to start over, that you run on a backup database, preferably on a local machine.
An alternative way, which would free you from having to run the conversion in batch, is to set up a small javascript to load a certain URL from somewhere in the admin, that will instead run the above one post at a time until all posts have been processed.
Also, the filter function I supplied (
wpse_87695_filter_content
), as you can see, is extremely rudimentary. All it does is runstrip_html()
on the post_content to strip out all HTML in there. Likely you will have to use regular expressions or an HTML parser depending on your specific needs. For example you will probably need to remove the excess newlines and make sure the paragraphs are joined by only two newlines.Alternative approach
Another solution could be to perform the filtering when the data is being output by WordPress. When you call
the_content
in you templates WordPress will fetch$post->post_content
and run a few filters on it usingapply_filters('the_content', '$post->post_content')
. This allows you to register the function I outlines above as a filter for for all post content, by callingadd_filter('the_content', 'wpse_87695_filter_content');
.While this approach will save you the trouble of having to iterate all the posts and update the database manually it will require the same efforts when it comes to writing a good filter function. Also, it will be run every time a post is displayed and for posts completely unrelated to your current import, unless you define some form of exception. Thus, it can be considered more of a quick fix, off course depending on the nature of your data. Perhaps you could store the most important filtering in the database, and leave some of it out and handle that using WordPress filters if that helps you in some way.