What is the “best” setup for robots.txt
?
I’m using the following permalink structure /%category%/%postname%/
.
My robots.txt
currently looks like this (copied from somewhere a long time ago):
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */comments
- I want my comments to be indexed. So I can remove this
- Do I want to disallow indexing categories because of my permalink structure?
- An article can have several tags and be in multiple categories. This may cause duplicates in search providers like Google. How should I work around this?
Would you change anything else here?
FWIW, trackback URLs issue redirects and have no content, so they won’t get indexed.
And at the risk of not answering the question, RE your points 2 and 3:
http://googlewebmastercentral.blogspot.com/2008/09/demystifying-duplicate-content-penalty.html
Put otherwise, I think you’re wasting your time worrying about dup content, and your robots.txt should be limited to:
A lot of time since this quesiton and answer were posted. Since then things has changed a lot. The typical recommendation about disallow crawlers to access
wp-content/themes
,wp-content/plugins
,wp-content/cache
,wp-includes
, and any other directory that contains CSS or js files needed in the site, are no longer valid.For example, lets talk about Google. Googlebot was rendering websites without CSS and without js, but not actually. Actually Googlebot fecth the full document and checks things like responsiveness, number, location and size of the scripts, etc. So Google doesn’t like if you disallow Googlebot to access CSS and js files. That means that you should not disallow
wp-content/themes
,wp-content/plugins
,wp-content/cache
andwp-includes
because of all those folders can serve CSS and js files.From my point of view, actually the best robots.txt file is the one created by WordPress by default (the bellow robots.txt is the default since WP 4.0):
If you have a cgi-bin folder, it may be good idea to disallow cgi-bin folder:
And if you use a sitemap, it is a good idea to include a sitemap reference in robots.txt (you still need to manual submit the sitemap to Google and Bing Webmaster Tools, but the reference can be useful to other crawlers):
That is in general. Specific websites may need disallow other folders and files that should be studied in each specific case. For exmaple, you may need or you may want to disallow a specific plugin folder:
To modify the robots.txt, use
robots_txt
filter (using a real robots.txt file will make WordPress be no longer able to handle robots.txt) . For example:Have you looked at Yoast’s WordPress SEO plugin? It definitely handles robots.txt issues.
With a little bit of help, this is now mines (not to much different from everyone elses, apparently)
You should follow Joost de Valk’s current approach where very little is blocked in
robots.txt
, but also understand that each site will have a uniquely appropriate policy that will need to be reviewed and changed over time.Many of the answers given here previously are dated and will result in SEO self-sabotage since Google checks for “mobile friendliness” now. Today googlebots try to load everything a normal browser does, including fonts, images, JavaScript, and CSS assets from /wp-content, /themes, /plugins, etc. (Morten Rand-Hendriksen recently blogged about this.)
You can use Google’s “mobile friendly” site checker to find out if your
robots.txt
file is sabotaging your site. If you use Google Webmaster Tools you should receive alerts and emailed notices if there is a big problem.Unless you are careful to make sure no key presentational or interactive assets are being loaded from disallowed folders, this is probably the bare minimum every WordPress install is safe with:
And don’t forget to add a sitemap:
Unfortunately this more open policy today recreates the potential for other problems that formerly led people to be more restrictive with
robots.txt
, such as [plugin and theme developers including indexable pages with links back to their own sites].4 There is nothing to be done about this unless you are able to pore over all third party code with a fine tooth comb and move or remove things you don’t want to be indexed.FYI, ALWYAS begin your permalink with a number. From experience it speeds up the page because WordPress can quickly differentiate between a page and a post (I also read that somewhere else then tried it..and its true). so
http:example.com/%month%/%post%
…will be fineI am just going to copy what I have. A lot of research went into this. It’s probably overkill! It does help with Google recognizing what the main keywords of your site are as seen in the Google webmasters tool. Hope it helps