Regex for robots.txt – disallow something within a directory, but not the directory itself

February 13, 20232 Views

I’m using wordpress with custom permalinks, and I want to disallow my posts but leave my category pages accessible to spiders. Here are some examples of what the URLs look like:

Category page: somesite dot com /2010/category-name/

Post: somesite dot com /2010/category-name/product-name/

So, I’m curious if there is some type of a regex solution to leave the page at /category-name/ allowed while disallowing anything one level deeper (the second example.)

Any ideas? Thanks! 🙂

Post Views: 2

2 comments

Anonymous says:

February 13, 2023 at 4:38 pm
Some information that might help.

There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final ‘/’ character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

Thereâs no 100% sure way to exclude your pages from being found, other than not to publish them at all, of course.

See:
http://www.robotstxt.org/robotstxt.html

There is no Allow in the Consensus. Plus the Regex option is not in the Consensus either.

From the Robots Consensus:

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:
```
User-agent: *
Disallow: /~joe/stuff/
```
Alternatively you can explicitly disallow all disallowed pages:
```
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
```
A Possible Solution:

Use .htaccess to set to disallow search robots from a specific folder while blocking bad robots.

See: http://www.askapache.com/htaccess/setenvif.html
Log in to Reply
Anonymous says:

February 13, 2023 at 4:38 pm
~~Would the following do the trick?~~
```
User-agent: *
Disallow: /2010/category-name/*/
```
You might need to explicitly allow certain folders under /2010/category-name:
```
User-agent: *
Disallow: /2010/category-name/
Allow: /2010/category-name/product-name-1/
Allow: /2010/category-name/product-name-2/
```
But according to this article, Allow field is not within the standard, so some crawlers might not support it.

EDIT:
I just found another resource to be used within each page. This page explains it well:
The basic idea is that if you include
a tag like:
```
<META NAME="ROBOTS" CONTENT="NOINDEX">
```
in your HTML document, that document
won’t be indexed.

If you do:
```
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
```
the links in that document will not be
parsed by the robot.
Log in to Reply

Regex for robots.txt – disallow something within a directory, but not the directory itself

Leave a Reply Cancel reply

2 comments

Social Network

Related posts

Leave a Reply Cancel reply

2 comments

Social Network