Retrieving Snapshot HTML for Google

So I am using AJAX to call a server file which uses WordPress to populate a pages content and return. Which I than use to populate fields. Now what I am confused about is, how do I create the snapshot and what do I have to do to make google know I am creating one besides #! also why do I do this? The escaped_fragments are a little unclear to and hope I could get a more detailed explanation. Does anyone have any tutorials that walk you through this process similar to what I am doing?

David

Related posts

Leave a Reply

1 comment

  1. Google’s crawlers don’t typically run your JavaScript. They hit your page, scrape your HTML, and move on. This is much more efficient than loading your page and all of its resources, running your JavaScript, guessing at when everything finished loading, and then scraping data out of the DOM.

    If your site uses AJAX to populate the page with content, this is a problem for Google and others. Your page is effectively empty… void of any content… in its HTML state. It requires your JavaScript to fill it in. Since the crawlers don’t run your JavaScript, your page isn’t all that useful to the crawler.

    These days, there are an awful lot of sites that blend the line between web-based applications and content-driven sites. These sites (like yours) require client-side code to run to get the content. Google doesn’t have the resources to do this on every site they encountered, but they did provide an option. That’s the info you found about escaped anchor fragments.

    Google has given you the opportunity to do the work of scraping the full finished DOM for them. They have put the CPU and memory burden of running your JavaScript back on you. You can signify to Google that this is encouraged by using links with #!. Google sees this and knows that they can then request the same page, but convert everything after #! (which isn’t sent to the server) to ?_escaped_fragment_= and make a request to your server. At this point, your server should generated a snapshot of the complete finished DOM, after JavaScript has ran.

    The good news is that these days you don’t have to hack a lot of code in place to do it. I’ve written a server to do this using PhantomJS. (I’m trying to get permission to open the source code up, but it’s in legal limbo, sorry!) Basically, PhantomJS is a full webkit web browser but it runs without a GUI. You can use PhantomJS to load your site, run all the JavaScript, and then when its ready scrape the HTML back out of the page and send that version to Google. This doesn’t require you to do anything special, other than fix your routing to point requests with _escaped_fragment_ at your snapshot server.

    You can do this in about 20 lines of code. PhantomJS even has a mini web server built into it, but they recommend not using it for production code.

    I hope this helps clear up some confusion!