PHP library that can merge stylesheet with inline style

I am working with html document generated from Micrsoft Word 2007/2010. Besides generating incredibly dirty html, word also has the tendency of using both block and inline style. I am looking for a php library would merge block into already existing inline style element.

I have html converted from Word and will be sending them through XMLRPC. The php library will is need to merge stylesheet with in-line style so the format is preserved. I want to make the call to this library after the request is received by XMLRPC and before it reaches the kses filter so the style block is not stripped off.

Read More

Example

If the original html is:

    <html>
    <head>
    <style>
    .normaltext {color:black;font-weight:normal;font-size:10pt}
    .important {color:red;font-weight:bold;font-size:11pt}
    </style>
    <body>
    <p class="normaltext" style="font-family:arial">
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
In ut erat id dui mollis faucibus. Mauris eu neque et eros tempus placerat. 
<span class="important">Nam in purus nisi</span>, vitae dictum ligula. 
Morbi mattis eros eget diam vulputate imperdiet. 
<span class="important" style="color:green">Integer</span> a metus eros. 
Sed iaculis porta imperdiet.
    </p>
    </body>
    </html>

Should become:

    <html>
    <head>
    <body>
    <p style="font-family:arial;color:black;font-weight:normal;font-size:10pt">
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    In ut erat id dui mollis faucibus. Mauris eu neque et eros tempus placerat. 
    <span style="color:red;font-weight:bold;font-size:11pt">Nam in purus nisi</span>, vitae dictum ligula. 
    Morbi mattis eros eget diam vulputate imperdiet. 
    <span style="color:green;font-weight:bold;font-size:11pt">Integer</span> a metus eros. 
    Sed iaculis porta imperdiet.
    </p>
    </body>
    </html>

Goal

The end goal is to be able to preserve all style and formatting from a Word generated HTML file and send it to wordpress, where it can be edit by TinyMCE. If there is an alternative to what I am describing (must be done on the server side), I am welling to accept it as the answer as well.

Related posts

Leave a Reply

2 comments

  1. I am trying to replicate the same output you have provided in your example above and I am only able to achieve output along the lines of;

    <p class=MsoNormal>
        <span class=MsoIntenseReference>
            <span style='color:red;text-transform:none;letter-spacing:0pt;font-weight:normal;text-decoration:none'>
                Red Example text
            </span>
        </span>
    </p>
    

    As you can see Microsoft Word (2010) is inserting predefined class names for the paragraph and span tags, additionally its also wrapping the span containing the text.

    How were you able to assign a class name to the span in which wraps your text?

    For reference I am saving my HTML file as a “Web Page, Filtered” and Filtered being the key to removing any of the “dirty” formatting Word would otherwise apply to the document.

    If I can replicate the same output you are getting in your example above, then I we may be able to work towards an easier solution.

    PS. I do apologize that this response to your question is coming up as an answer, however I am seemingly not able to post a comment. I do intend to follow this through with some additional commentary that will work towards a complete answer though as I have some suggestions I want to make once I get some further insight into the my initial question above!

    UPDATE

    NOTE: This is intended as a guide to hopefully set you off on the right path and therefore the code provided below are examples missing some functionality in which you will need to write.

    Ideally you want your XML-RPC script to handle the processing of the content in which you feed it in two ways.

    1) Search and replace inline-styles to those in which are compatible with WordPress via Regular Expression (RegEx).

    2) Post your newly sanitized content to your blog in the form of a post.

    Considering you won’t know the exactly inline-style format that your MS Word Document will output, you can with the use of RegEx search and replace text between characters based upon meeting certain criteria.

    Take this for example;

    <span style="color:green">Integer</span>

    Through RegEx you might search for the word “green” between <span and > and where you find a match of “green” you replace all text between with your desired inline-style;

    <span class="green" style="color:green;font-weight:bold;font-size:10pt">

    To make this inline-styling available in the post editor screen in the WordPress dashboard you will need to add some extra options to the TinyMCE editor “styles-dropdown” which would look something similar to;

        array(
            'title' => 'Bold Green Text',
            'classes' => 'green',
                'inline' => 'span',
            'styles' => array(
                'color' => 'green',
                'fontWeight' => 'bold',
                        'fontSize' => '10pt'
            )
    

    You can read more about that at,

    1) HERE

    2) AND HERE

    Essentially the custom styles you add should match that of which you are making available via your RegEx function.

    Now in terms of your XML-RPC script (example. post-via-xmlrpc.php) would look something along the lines of;

    <?php
    
    // Your RegExp function for processing your source file
    
    function sanitize_content() {
    
        gloabl $content;
        $content = '<span class="important">example content is here</span>';
    
        // do your regular expression stuff here
    
        return $content;
    
    }    
    
    // Your XML-RPC function
    
    function wpPostXMLRPC($title,$content,$rpcurl,$username,$password,$categories=array(1)){
    
        $categories = implode(",", $categories);
        $XML = "<title>$title</title>"."<category>$categories</category>".$sanitized_content;
    
        $params = array('','',$username,$password,$XML,1);
        $request = xmlrpc_encode_request('blogger.newPost',$params);
    
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_POSTFIELDS, $request);
        curl_setopt($ch, CURLOPT_URL, $rpcurl);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 1);
        curl_exec($ch);
        curl_close($ch);
    }
    
    // Do stuff here to initiate your post function
    
    ?>
    

    For this example you can see I’ve included the $content string within the script, but of course you would want to pass your MS Word HTML file to this variable instead and you can either do this via a form or by file path and so on.

    Assuming your post-via-xmlrpc.php was accessible via your localhost you would run this process by visiting,

    http://localhost/post-via-xmlrpc.php

    The most difficult part of this entire process is really your regular expression (RegEx) search and replace function for which you would need to find <body> and remove everything before it, find </body> and remove everything after it, then remove both <body> and </body> and then parse through the remaining content replacing inline-styles as required.

    There really isn’t any need to mess around with another PHP library when it can all be done from a self contained XML-RPC script designed to sanitize your input.

  2. Check out:

    Porting code from either of the sources to PHP, or using any of the available APIs should do the trick of getting your CSS styling inline.

    If you’re OK with styles being out of line but don’t want TinyMCE to kill them off and it’s the sole purpose of you wanting to do this, you may like to approach the question more directly.

    TinyMCE has a valid_children configuration, which would allow styles to remain. By adding +body[style] you should be able to get style blocks through.

    http://codex.wordpress.org/TinyMCE#Customize_TinyMCE_with_Filters

    The keep_styles option should also help, as well as paste_remove_styles. Check out the defaults here http://core.trac.wordpress.org/browser/tags/3.3.1/wp-includes//class-wp-editor.php#L271

    You would hook to the tiny_mce_before_init filter and alter the values.

    http://core.trac.wordpress.org/browser/tags/3.3.1/wp-includes//class-wp-editor.php#L396