Detecting a WordPress URL without doing a full HTTP GET?

I’m trying to write a oneboxing routine that gives WordPress blog entries special treatment. So given a simple, unadorned URL in content, such as

http://blog.stackoverflow.com/2011/03/a-new-name-for-stack-overflow-with-surprise-ending/

Read More

How would I detect that this is a WordPress installation, ideally without doing a full HTTP GET on every URL I see?

There are certainly common conventions for WordPress URLs that we could start with, which eliminates at least some URLs from contention. In this case it is …

http://example.com/year/month/slug-goes-here

But that isn’t a universal constant either.

I tried looking at the headers of that URL using HTTP HEAD, and I see:

Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:18340
Content-Type:text/html; charset=UTF-8
Date:Thu, 07 Jun 2012 07:07:38 GMT
Keep-Alive:timeout=15, max=100
Server:Apache/2.2.9 (Ubuntu) DAV/2 PHP/5.2.6-2ubuntu4.2 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g
Vary:Cookie,Accept-Encoding
WP-Super-Cache:Served legacy cache file
X-Pingback:http://blog.stackoverflow.com/xmlrpc.php
X-Powered-By:PHP/5.2.6-2ubuntu4.2

I don’t think relying on the presence of WP-Super-Cache would be particularly reliable, and that’s the only thing I see in the headers that would help, so maybe there are zero common HTTP headers in a WordPress install?

Related posts

Leave a Reply

6 comments

  1. From my experience and quick code search there are no deliberate ways WP identifies itself in headers. However there are some that seem distinct enough and not likely to be customized.

    HEAD to /wp-login.php will contain following for .org install:

     Set-Cookie: wordpress_test_cookie=WP+Cookie+check; path=/
    

    And for .com:

    Set-Cookie: wordpress_test_cookie=WP+Cookie+check; path=/; domain=.wordpress.com
    

    Cookie name is customizable by defining TEST_COOKIE constant, but WP Cookie check string is hardcoded in core, as well as set_cookie() call for this in the file’s source.

    For locating wp-login.php there are some URL shortcuts (implemented in wp_redirect_admin_locations() since WP 3.4 (see ticket #19607 ):

    /login on site’s root does 302 redirect to wp-login.php, wherever it is.

    So the only scenario that cannot be reliably detected if WP is installed in and confined to subdirectory, without being used to manage site’s root at all.

  2. Send a HEAD request to /wp-feed.php in the same directory as /xmlrpc.php (even in subdirectory installations). In WordPress you will get a Location header as response containing the string feed.

    In your example for blog.stackoverflow.com you’ll get:

    HTTP/1.1 301 Moved Permanentlyrn
    Date: Thu, 07 Jun 2012 07:30:10 GMTrn
    Server: Apache/2.2.9 (Ubuntu) DAV/2 PHP/5.2.6-2ubuntu4.2 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8grn
    X-Powered-By: PHP/5.2.6-2ubuntu4.2rn
    Location: http://blog.stackoverflow.com/feed/rn
    Vary: Accept-Encodingrn
    Content-Type: text/html; charset=UTF-8rn
    rn
    

    The bare existence of a file xmlrpc.php alone is not safe enough. Anybody can give this name to a file.

    Caveat: The X-Pingback header can be disabled by filtering 'wp_headers'. So my suggestion is not bullet-proof.

    Related: Steps to Take to Hide the Fact a Site is Using WordPress?

  3. Append the URL with ?page_id=-1 and do an HTTP HEAD request for that.

    On self-installed WordPress blogs, this will result in a 404 response.

    On wordpress.com blogs, this will result in a 301 response (which ends up at a 200 response if you follow the redirect).

    On non-WordPress sites, you should get a 200 response (assuming the original URL without the query string gave you a 200) – the query string should make no difference.

    Example with a HEAD request for http://blog.stackoverflow.com/2011/03/a-new-name-for-stack-overflow-with-surprise-ending/?page_id=-1:

    HTTP/1.1 404 Not Found
    Server: Apache/2.2.9 (Ubuntu) DAV/2 PHP/5.2.6-2ubuntu4.2 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g
    Content-Encoding: gzip
    Vary: Cookie,Accept-Encoding
    Cache-Control: no-cache, must-revalidate, max-age=0
    Last-Modified: Thu, 07 Jun 2012 08:53:01 GMT
    Date: Thu, 07 Jun 2012 08:53:01 GMT
    Keep-Alive: timeout=15, max=100
    Expires: Wed, 11 Jan 1984 05:00:00 GMT
    Pragma: no-cache
    Connection: Keep-Alive
    X-Powered-By: PHP/5.2.6-2ubuntu4.2
    X-Pingback: http://blog.stackoverflow.com/xmlrpc.php
    Content-Type: text/html; charset=UTF-8
    

    Example with a HEAD request for http://dailycrave.wordpress.com/2012/06/01/three-cheese-grilled-pizza/?page_id=-1 (follow redirects turned off):

    HTTP/1.1 301 Moved Permanently
    X-Pingback: http://dailycrave.wordpress.com/xmlrpc.php
    Server: nginx
    Expires: Wed, 11 Jan 1984 05:00:00 GMT
    X-Hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
    Location: http://dailycrave.wordpress.com/2012/06/01/three-cheese-grilled-pizza/
    Pragma: no-cache
    Cache-Control: no-cache, must-revalidate, max-age=60
    Connection: close
    Last-Modified: Thu, 07 Jun 2012 09:01:09 GMT
    Content-Type: text/html; charset=UTF-8
    Date: Thu, 07 Jun 2012 09:01:09 GMT
    

    (Note the X-Hacker easter egg!)

    If you follow the 301 redirect for the wordpress.com blog, you end up with this:

    HTTP/1.1 200 OK
    Server: nginx
    Vary: Accept-Encoding, Cookie
    Last-Modified: Thu, 07 Jun 2012 09:48:26 GMT
    Cache-Control: max-age=172, must-revalidate
    Connection: close
    Date: Thu, 07 Jun 2012 09:50:34 GMT
    Transfer-Encoding: Identity
    Content-Encoding: gzip
    Link: <http://wp.me/pXGqK-27g>; rel=shortlink
    X-Pingback: http://dailycrave.wordpress.com/xmlrpc.php
    Content-Type: text/html; charset=UTF-8
    X-Nananana: Batcache
    X-Hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
    

    Note the “Link” header containing the http://wp.me/ URL, which seems to be common to all wordpress.com hosted blogs and could be used to identify them.

    I believe this works because passing ?page_id=-1 in the URL overrides the default routing from the URL segments. There will not be a page with ID of -1, and so a 404/redirect is served instead.

  4. Neither is wp-super-cache available on all wordpress installations, nor is there any fixed format in the URLs. While the permalinks settings page do give some fixed settings for URL schemes which can be used, anyone can just use any custom URL scheme. For example, if anyone just decides to use only the page/post name in the URL, it is more or less impossible to figure out if it is a WordPress website.

    The presence of xmlrpc can be used to detect, but again, this can be disabled.

    And finally, even if you do a full get on the URL, it is still not 100% possible to detect if the page is built using wordpress. It all depends on the theme template and how it is developed.

    One fairly reliable way is to look for the presence wp-login and wp-admin. But even these could also be moved. I’d go for this way though.

  5. Two alternatives to the comments, set your own WordPress header. Drop this in your theme’s functions.php.

    add_action('template_redirect', 'add_wp_header');
    function add_wp_header(){
    
    header('Type: WordPress');
    }
    

    The WP scan fingerprinter (ruby), it goes through several steps to try and figure out if WordPress is being used such as looking for the plugin directory, theme name, meta tags, readme, etc (I have no idea how accurate this actually is).http://code.google.com/p/wpscan/source/browse/#svn%2Ftrunk%2Flib%2Fwpscan

  6. How about sending a head request to one of the files starting with the prefix wp-.
    Ideally look at wp-login.php. If it exists that means the website is running WordPress.