Remove hidden formatting when user paste text from MS Word into TinyMCE

Around a fifth of post submissions that I receive contains ridiculous amounts of hidden formatting.

For example, here is some of it from a recent post:

Read More
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>

It is actually 650 lines, view it all here.

Also, random HTML formatting is added to tags like:

<p class="MsoNormal">

Upon further interesting research, it appears that this happens when the author paste content from MS Word directly into the TinyMCE visual editor. And as detailed:

The bad news isn’t evident until someone attempts to view that page
with a different browser and the page is totally misformatted or
appears blank. Ironically, this latter scenario happens most often
when the page is viewed in Microsoft Internet Explorer [Good!].

A way to solve it may be to use the Paste from Word button.

However, that is not a viable solution when 20% of submissions have this issue. Is there any way to strip this nonsense formatting upon paste?

Related posts

2 comments

  1. I am interpreting the question to mean that you already have Word markup in your post and so you need to clean that up via PHP. If so…

    1. You can see the code that cleans up Word content here:
      http://core.trac.wordpress.org/browser/trunk/src/wp-includes/js/tinymce/plugins/paste/editor_plugin_src.js#L375
      That is Javascript. With some work, you could convert that to PHP.
    2. PHP Tidy, if available, will clean that up.
    3. I believe that HTML Tidy can do it.
    4. strip_tags will just get rid of the code. (Tested)
    5. wp_kses will get rid of much of it but will take some tweaking to
      work well, at least as indicated by my simple test. Maybe with the right arguments it can do what you want.
  2. Here is a “zero development solution” :
    I would instruct your users that if they paste content from Word, they should paste in the html tab, not the “Visual” tab. They can switch to the visual tab afterwards. This will only paste the visible text, not its markup.

Comments are closed.