How can I filter Microsoft Word gunk from pasted content?

I have some users who are posting to a group blog and are able to cut-and-paste but their pastes include things like:

<!– /* Font Definitions */ @font-face {font-family:”Cambria Math”; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-520092929 1073786111 9 0 415 0;} @font-face {font-family:”Trebuchet MS”; panose-1:2 11 6 3 2 2 2 2 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:647 0 0 0 159 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:”"; margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:12.0pt; font-family:”Trebuchet MS”,”sans-serif”; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-bidi-font-family:”Times New Roman”; mso-bidi-theme-font:minor-bidi; color:black;} p {mso-style-noshow:yes; mso-style-priority:99; mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; font-size:12.0pt; font-family:”Times New Roman”,”serif”; mso-fareast-font-family:”Times New Roman”;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-size:12.0pt; mso-ansi-font-size:12.0pt; mso-bidi-font-size:12.0pt; mso-ascii-font-family:”Trebuchet MS”; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:”Trebuchet MS”; mso-bidi-font-family:”Times New Roman”; mso-bidi-theme-font:minor-bidi; color:black;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:10.0pt; line-height:115%;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.WordSection1 {page:WordSection1;} –>

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:”Table Normal”;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-qformat:yes;
mso-style-parent:”";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin-top:0in;
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:”Calibri”,”sans-serif”;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:”Times New Roman”;
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:”Times New Roman”;
mso-bidi-theme-font:minor-bidi;}

What can I do to filter out code like this automatically?

Related posts

Leave a Reply

4 comments

  1. I’d suggest using Ozh’s TinyMCE Advanced plugin. It lets you add a ‘Paste from Word’ option that takes care of all of that for you.

    However, if you’re not interested in that, you have a few more options. Like this:

    function get_rid_of_mso_junk( $content ){
      return preg_replace( '@(mso|panose)[^:]{1,25}:[^;]+;(s+)?(n+)?@i', '', $content );
    }
    
    add_filter( 'content_save_pre', 'get_rid_of_mso_junk' );
    

    Just keep adding undesirable declarations to the first capturing set in that regex to add lines that should be removed. E.g.: (mso|panose|other-junk|annoyance).

  2. I’ve worked with clients who face this problem a lot. The trick, I’ve found, is to copy-paste into HTML view and then switch back to the Visual editor to tweak formatting if necessary.

    This is also necessary if copy-pasting from another website. Sometimes you’ll accidentally pull in class definitions and in-line styling from the external source and that can break the display if you don’t have those same classes or styles set up or supported by your site.

    Another option would be to expose your users to Windows Live Writer. It’s a Microsoft product that’s completely free, plays nicely with copy-paste from Word, and can interact with WordPress – you write your post, edit your post, use the built-in spell checker, format the post to display exactly how you want, then click “Publish” to push your post to WordPress via XMLRPC. It’s a fairly sound system that makes it incredibly easy to teach a first-time blogger how to blog … particularly because the UI is so similar to Word to begin with.

  3. For anyone looking for a solution to this problem, I did something like this:

    function delete_between($beginning, $end, $string) {
        $beginningPos = strpos($string, $beginning);
        $endPos = strpos($string, $end);
        if (!$beginningPos || !$endPos) {
        return $string;
        }
    
        $textToDelete = substr($string, $beginningPos, ($endPos + strlen($end)) - $beginningPos);
    
        return str_replace($textToDelete, '', $string);
    }
    
    function clean_content( $content ){
        if( is_home() || is_single()){
            $content = delete_between('<!--[if gte mso', ';}', $content);   
            return $content;
        }else{
        return $content;
    }
    
    add_filter( 'the_content', 'clean_content' );
    add_filter( 'the_excerpt', 'clean_content' );
    

    You can replace the strings in the delete_between function with whatever you want. That seemed to work for me though.