What is a good robots.txt?

What is the “best” setup for robots.txt?
I’m using the following permalink structure /%category%/%postname%/.

My robots.txt currently looks like this (copied from somewhere a long time ago):

Read More
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */comments
  1. I want my comments to be indexed. So I can remove this
  2. Do I want to disallow indexing categories because of my permalink structure?
  3. An article can have several tags and be in multiple categories. This may cause duplicates in search providers like Google. How should I work around this?

Would you change anything else here?

Related posts

Leave a Reply

6 comments

  1. FWIW, trackback URLs issue redirects and have no content, so they won’t get indexed.

    And at the risk of not answering the question, RE your points 2 and 3:

    http://googlewebmastercentral.blogspot.com/2008/09/demystifying-duplicate-content-penalty.html

    Put otherwise, I think you’re wasting your time worrying about dup content, and your robots.txt should be limited to:

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-content/cache
    
  2. A lot of time since this quesiton and answer were posted. Since then things has changed a lot. The typical recommendation about disallow crawlers to access wp-content/themes, wp-content/plugins, wp-content/cache, wp-includes, and any other directory that contains CSS or js files needed in the site, are no longer valid.

    For example, lets talk about Google. Googlebot was rendering websites without CSS and without js, but not actually. Actually Googlebot fecth the full document and checks things like responsiveness, number, location and size of the scripts, etc. So Google doesn’t like if you disallow Googlebot to access CSS and js files. That means that you should not disallow wp-content/themes, wp-content/plugins, wp-content/cache and wp-includes because of all those folders can serve CSS and js files.

    From my point of view, actually the best robots.txt file is the one created by WordPress by default (the bellow robots.txt is the default since WP 4.0):

    User-agent: *
    Disallow: /wp-admin/
    

    If you have a cgi-bin folder, it may be good idea to disallow cgi-bin folder:

    User-agent: *
    Disallow: /wp-admin/
    Disallow: /cgi-bin/
    

    And if you use a sitemap, it is a good idea to include a sitemap reference in robots.txt (you still need to manual submit the sitemap to Google and Bing Webmaster Tools, but the reference can be useful to other crawlers):

    User-agent: *
    Disallow: /wp-admin/
    Disallow: /cgi-bin/
    
    Sitemap: http://example.com/sitemap.xml
    

    That is in general. Specific websites may need disallow other folders and files that should be studied in each specific case. For exmaple, you may need or you may want to disallow a specific plugin folder:

    User-agent: *
    Disallow: /wp-admin/
    Disallow: /wp-content/plugins/plugin-folder/
    

    To modify the robots.txt, use robots_txt filter (using a real robots.txt file will make WordPress be no longer able to handle robots.txt) . For example:

    add_filter( 'robots_txt', function( $output ) {
    
        $output .= "Disallow: /cgi-bin/n";
        $output .= "Disallow: /wp-content/plugins/plugin-folder-i-want-to-block/n";
        $output .= "nSitemap: " . site_url( 'sitemap.xml' ) . "n";
    
        return $output;
    
    });
    
  3. With a little bit of help, this is now mines (not to much different from everyone elses, apparently)

    User-agent: *
        Allow: /
    
    Disallow: /wp-content/
        Disallow: /wp-admin/
        Disallow: /cat/
        Disallow: /key/
        Disallow: /*?
        Disallow: /*.js$
        Disallow: /*.inc$
        Disallow: /*.css$
        Disallow: /cgi-bin
        Disallow: /wp-admin
        Disallow: /wp-includes
        Disallow: /wp-content/plugins
        Disallow: /wp-content/cache
        Disallow: /wp-content/themes
    
    User-agent: Mediapartners-Google
        Allow: /
    
    User-agent: Adsbot-Google
        Allow: /
    
    User-agent: Googlebot-Image
        Allow: /
    
    User-agent: Googlebot-Mobile
        Allow: /
    
    #User-agent: ia_archiver-web.archive.org
        #Disallow: /
    
    Sitemap: YOURSITENAME.HERE
    
  4. You should follow Joost de Valk’s current approach where very little is blocked in robots.txt, but also understand that each site will have a uniquely appropriate policy that will need to be reviewed and changed over time.

    Many of the answers given here previously are dated and will result in SEO self-sabotage since Google checks for “mobile friendliness” now. Today googlebots try to load everything a normal browser does, including fonts, images, JavaScript, and CSS assets from /wp-content, /themes, /plugins, etc. (Morten Rand-Hendriksen recently blogged about this.)

    You can use Google’s “mobile friendly” site checker to find out if your robots.txt file is sabotaging your site. If you use Google Webmaster Tools you should receive alerts and emailed notices if there is a big problem.

    Unless you are careful to make sure no key presentational or interactive assets are being loaded from disallowed folders, this is probably the bare minimum every WordPress install is safe with:

    User-agent: *
    Disallow: /wp-admin
    

    And don’t forget to add a sitemap:

    Sitemap: http://yoursite.com/sitemap.xml
    

    Unfortunately this more open policy today recreates the potential for other problems that formerly led people to be more restrictive with robots.txt, such as [plugin and theme developers including indexable pages with links back to their own sites].4 There is nothing to be done about this unless you are able to pore over all third party code with a fine tooth comb and move or remove things you don’t want to be indexed.

  5. FYI, ALWYAS begin your permalink with a number. From experience it speeds up the page because WordPress can quickly differentiate between a page and a post (I also read that somewhere else then tried it..and its true). so http:example.com/%month%/%post%…will be fine

    I am just going to copy what I have. A lot of research went into this. It’s probably overkill! It does help with Google recognizing what the main keywords of your site are as seen in the Google webmasters tool. Hope it helps

    User-agent: *
    Allow: /
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /cgi-bin/
    Sitemap: Url to sitemap1
    Sitemap: Url to sitemap2
    
    User-agent: Googlebot
    # disallow all files ending with these extensions
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.cgi$
    Disallow: /*.wmv$
    Disallow: /*.ico$
    Disallow: /*.opml$
    Disallow: /*.shtml$
    Disallow: /*.jpg$
    Disallow: /*.cgi$
    Disallow: /*.xhtml$
    Disallow: /wp-*
    Allow: /wp-content/uploads/ 
    
    # allow google image bot to search all images
    User-agent: Googlebot-Image
    Allow: /*
    
    User-agent:  *
    Disallow: /about/
    Disallow: /contact-us/
    Disallow: /wp-admin/
    Disallow: /wp-includes/
    Disallow: /wp-
    
    # disallow archiving site
    User-agent: ia_archiver
    Disallow: /
    
    # disable duggmirror
    User-agent: duggmirror
    Disallow: /
    
    User-agent: Googlebot
    Disallow: /*.js$
    Disallow: /*.inc$
    Disallow: /*.css$
    Disallow: /*.wmv$
    Disallow: /*.cgi$
    Disallow: /*.xhtml$
    
    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*