How Does WordPress Block Search Engines?

If you go to wordpress admin and then settings->privacy, there are two options asking you whether you want to allow your blog to be searched though by seach engines and this option:

I would like to block search engines,
but allow normal visitors

Read More

How does wordpress actually block search bots/crawlers from searching through this site when the site is live?

Related posts

Leave a Reply

5 comments

  1. According to the codex, it’s just robots meta tags, robots.txt and suppression of pingbacks:

    Causes <meta name='robots' content='noindex,nofollow' /> to be generated into the section (if wp_head is used) of your site’s source, causing search engine spiders to ignore your site.

    Causes hits to robots.txt to send back:

    User-agent: *

    Disallow: /

    Note: The above only works if WordPress is installed in the site root and no robots.txt exists.

    These are “guidelines” that all friendly bots will follow. A malicious spider searching for E-Mail addresses or forms to spam into will not be affected by these settings.

  2. With a robots.txt (if installed as root)

     User-agent: *
     Disallow: /
    

    or (from here)

    I would like to block search engines, but allow normal visitors –
    check this for these results:

    • Causes "<meta name='robots' content='noindex,nofollow' />"
      to be
      generated into the
      section (if wp_head is used) of your
      site’s source, causing search engine
      spiders to ignore your site.
      * Causes hits to robots.txt to send back:

          User-agent: * 
          Disallow: / 
      

      Note: The above only works if WordPress is installed in the site root and no robots.txt exists.

    • Stops pings to ping-o-matic and any other RPC ping services specified in the Update
      Services of Administration > Settings > Writing. This works by having the function privacy_ping_filter() remove
      the sites to ping from the list. This
      filter is added by having
      add_filter(‘option_ping_sites’,’privacy_ping_filter’);
      in the default-filters. When the
      generic_ping function attempts to get
      the “ping_sites” option, this filter
      blocks it from returning anything.

    • Hides the Update Services option entirely on the
      Administration > Settings > Writing
      panel with the message “WordPress is
      not notifying any Update Services
      because of your blog’s privacy
      settings.”

  3. You can’t actually block bots and crawlers from searching through a publicly available site; if a person with a browser can see it, then a bot or crawler can see it (caveat below).

    However, there is something call the Robots Exclusion Standard (or robots.txt standard), which allows you to indicate to well behaved bots and crawlers that they shouldn’t index your site. This site, as well as Wikipedia, provide more information.

    The caveat to the above comment that what you see on your browser, a bot can see, is this: most simple bots do not include a Javascript engine, so anything that the browser renders as a result of Javascript code will probably not be seen by a bot. I would suggest that you don’t use this as a way to avoid indexing, since the robots.txt standard does not rely on the presence of Javascript to ensure correct rendering of your page.

    Once last comment: bots are free to ignore this standard. Those bots are badly behaved. The bottom line is that anything that can read your HTML can do what it likes with it.