StephenJungels.com :: Open source :: Articles :: The Indie Publisher's mod_rewrite Recipe Book

The Indie Publisher's mod_rewrite Recipe Book

The Apache module mod_rewrite is a powerful tool in the webmaster's toolbox. However, the Apache documents about mod_rewrite seem to be aimed at large educational and corporate institutions. Independent publishers have special needs which these documents don't address directly. In particular, mod_rewrite is an important part of any growth strategy because it allows a smooth transition to new architectures as your site grows and develops.

This article is a collection of recipes using mod_rewrite which I use myself or developed in response to questions on user forums. These recipes are meant to be stored in a file named ".htaccess" in your public_html directory (or a subdirectory of public_html, depending on where you want the rules to apply).

RewriteBase mysteries

mod_rewrite has a reputation for being voodoo magic, and one of its darkest voodoo secrets is the RewriteBase directive. Because the proper use of RewriteBase depends on the configuration of your web host and the particular directory in which the rule will be applied, I can't show RewriteBase commands in these recipes which would be appropriate for every situation. Often you can get away without using a RewriteBase, but you will need to test on your server to find out for sure. If a recipe does not work without a RewriteBase, and the .htaccess file is in your public_html folder, add the following:

RewriteBase /

If the .htaccess file is in a subfolder, use that subfolder as the rewrite base. For example, if .htaccess is in public_html/folder1/folder2, use

RewriteBase /folder1/folder2/

Advanced recipes require Regular Expressions

Most useful mod_rewrite recipes depend on Regular Expressions. To keep this article from getting too long, I'm not explaining them here, but Wikipedia has a good article on Regular Expressions.

Basic redirect (invisible)

Let's start off with a simple but useful recipe which would be used when a file is renamed. When this happens, you do not have to break any existing links to the file. Instead, use an an invisible redirect from the old file name to the new file name. It looks like this:

RewriteEngine on
RewriteRule ^oldname\.html$ newname.html

When this rule is used, a reader who browses to the url http://example.com/oldname.html will actually see the content at http://example.com/newname.html. But the browser location bar will continue to show the old name.

Basic redirect (visible)

The invisible redirect ensures that your readers will still find your content after a file is renamed, but they will not find out about the new name and will continue to use the "wrong" name for bookmarks. Search engines will also continue to use the old name. When you want to publicize the new name, you make the change visible using an external redirect. It looks like this:

RewriteEngine on
RewriteBase /
RewriteRule ^oldname\.html$ newname.html [R=301,L]

The "R" is a special directive which tells Apache to use an external redirect. When this happens, the web server actually sends a message to the user's web browser notifying it that the address has changed and specifying the new address. In response, the browser will request the new address and update the location bar to show that address. This happens transparently to the user, so all they are aware of is that the address in the location bar changes.

Canonical URL

Most web hosts will make sure that anybody who tries to access your web page at www.example.com will see the same page as they would see at example.com (without the "www"), but it can be advantageous for your search engine rankings to use one or the other consistently. The following recipe does an external redirect that removes the "www", so your site will always be accessed as example.com:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^www\.mysite\.com
RewriteRule ^(.*)$ http://mysite.com/$1 [R=301,L]

If you prefer to include the "www", use this recipe instead:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^mysite\.com
RewriteRule ^(.*)$ http://www.mysite.com/$1 [R=301,L]

In both cases, replace "mysite" with your own domain name.

Blocking bad robots

Although "good" robots like Googlebot, Yahoo! Slurp, and MSNBot help attract readers, there are also "bad" bots crawling the Internet which may be used to harvest email addresses for spammers or steal your content. One technique for blocking these bad bots is to use the robots.txt exclusion standard, but some bad bots ignore robots.txt. For these situations, there is a recipe using mod_rewrite:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC]
RewriteRule ^.* - [F,L]

This recipe blocks two bad bots. It can be extended by adding additional lines modelled after the second line. (There is a very extensive bad bot blocker using this technique on the Bluehost forum).

Surviving the Slashdot effect

Popularity can be harmful to the health of your web server. If you ever publish an article or blog that gets linked from one of several very popular sites, the much-higher-than-average traffic can send response times on your server through the roof. In a situation like this it helps to know about a caching service called nyud.net which is accessed simply by adding a suffix to your URL. For example, if you expect the content at

http://www.example.com/article.html

to be very popular, you can direct people instead to

http://www.example.com.nyud.net:8080/article.html

This will cause the server at nyud.net to cache a copy of your article and serve it to visitors, decreasing the load on your server by orders of magnitude. The following recipe uses mod_rewrite to redirect visitors to nyud.net automatically if they are coming from one of five popular sites:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} !^CoralWebPrx
RewriteCond %{QUERY_STRING} !(^|&)coral-no-serve$
RewriteCond %{HTTP_REFERER} ^http://(www.)?digg.com [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?slashdot.org [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?del.icio.us [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?engadget.com [OR]
RewriteCond %{HTTP_REFERER} ^http://(www.)?boingboing.net
RewriteRule ^(.*)$ http://www.mysite.com.nyud.net:8080/$1 [R,L]

To use this recipe, replace "www.mysite.com" with your domain and add the recipe to the .htaccess file in your public_html folder.

[Note: nyud.net was down last time I tested it. Don't use this recipe without first testing the server.]

Hotlink protection

Some people have the annoying habit of stealing your bandwidth by linking directly to popular images on your server. If you study your log files and notice this happening, take action by using mod_rewrite for hotlink protection. What this means is that when serving up image files, Apache will check the URL of the web page requesting the file. If the web page is not on a list of servers you specify, the image file will be blocked. The recipe looks like this:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mysite.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?myothersite.com/.*$ [NC]
RewriteRule .*\.(gif|jpg|jpeg|png)$ - [F]

You can put as many servers as you like on the "approved" list by adding more lines like line 3.

Search engine friendly URLs

As you begin using programs written in PHP (for example) to serve dynamic content that is pulled from a database, your URLs may become much more complex. A URL for an article from a content management system in a particular category and subcategory might look like this:

http://www.example.com/index.php?cat=foo&subcat=bar

We would like to allow readers to find this article using the more friendly URL

http://www.example.com/foo/bar

The mod_rewrite recipe to accomplish this task is fairly simple:

RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^([^/]+)/([^/]+)$ index.php?cat=$1&subcat=$2

Although these are called search engine friendly URLs, a primary goal of using them is to make your URLs friendly to your readers (easy to type and remember). In the opinion of the author, Google and Yahoo have many clever PhD's who are able to write a web robot that can crawl your site successfully even if you use dynamic URLs. Use friendly URLs anyways if they improve the user experience for your readers.

Static to dynamic transition

Let's say that you begin publishing using a standard static HTML approach. Later you realize that you need to include dynamic content like site navigation in your static HTML. But you want readers and search engines to continue to find your content at the URLs of the static files. It's possible using mod_rewrite to redirect your existing static HTML pages to similarly-named PHP files which can add whatever dynamic content you require. Although it would be possible to do this fairly simply on a case-by-case basis, this recipe attempts to solve the problem in the general case by redirecting all HTML files to similarly-named PHP files, if they exist. Thus if the user tries to load example.com/foo.html, and there is a file in the same folder named foo.html.php, foo.html.php will be loaded instead. If foo.html.php does not exist, foo.html will load as per usual. In any case, the user's location bar will display example.com/foo.html, so they will be completely unaware of the rewrite. The recipe looks like this:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(.*\.html) $1.php

Erasing the query string

In some cases you may have alternate versions of an article intended for different media, where the alternate is picked with a query string. If you later simplify your site and eliminate the alternate version, search engines may continue to point to the "wrong" version of the page. In this case you would like to do an external redirect and get the search engines to pick up the right version of the page. For example, let's say you started out with a printer-friendly version of each article accessed with a query string as follows:

http://www.example.com/article.html?media=print

The search engines picked up this URL, but now you have decided to eliminate it and use a standard version without a query string:

http://www.example.com/article.html

To redirect to the standard version of the page, publicizing the new URL, use the following recipe:

RewriteEngine on
RewriteBase /
RewriteCond %{QUERY_STRING} ^media=print$
RewriteRule ^article.html$ article.html? [R=301,L]

This rule seems to require a rewrite base. If you use this rule in a subdirectory, change the rewrite base appropriately.

Content for mobile devices

More and more of your readers will be using mobile devices in the future and you can improve their experience of your site by designing content specifically for mobile devices. Showing how to design that content is beyond the scope of this article, but once you have mobile content it's easy to use mod_rewrite to redirect mobile devices to their content automatically. The following recipe just OR's together a series of conditions which each match one mobile device. If there is another mobile device you need to match, it would be easy to add a line for it. The only tricky feature of this recipe is that you need to do something to prevent an infinite loop. That is the purpose of the last rewrite condition:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Windows\ CE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BlackBerry [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetFront [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Opera\ Mini [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Palm\ OS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Blazer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Elaine [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WAP.*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Plucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} AvantGo [NC]
RewriteCond $1 !^mobile
RewriteRule ^(.*)$ mobile/$1 [L]

Store this recipe in the .htaccess file in your public_html folder after creating your mobile content in the folder public_html/mobile. You can test the recipe using the Opera Mini Simulator.

Debugging: infinite loops and MaxRedirects

Consider the following mod_rewrite recipe:

RewriteEngine on
RewriteRule ^([^/]*)$ index.php?cat=$1 [L]

This looks innocent enough. It is meant to match a category entered as a URL and rewrite it to a dynamic PHP script with a query string. Thus

http://www.example.com/foo

would redirect to

http://www.example.com/index.php?cat=foo

However, a problem is lurking here. Consider that the replacement text, "index.php?cat=foo", on the right hand side, matches the pattern, "^([^/]*)$", on the left hand side. Because mod_rewrite continues to apply rewrite rules until there is no longer a match with any of the rules, this recipe will create an infinite loop.

The solution (in this case) is to test the query string:

RewriteEngine on
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^([^/]*)$ index.php?cat=$1 [L]

In general, be aware that you may have to select patterns carefully so that the replacement, on the right hand side, does not match the pattern, on the left hand side. Otherwise you can cause an infinite loop. Newer versions of Apache will apply the MaxRedirects option and terminate the loop before it causes Apache to hang, but this will result in an internal server error being displayed and should be avoided.

Further reading

This article is just a recipe book and doesn't provide detailed technical explanations of all the recipes. Although you can accomplish a great deal with mod_rewrite by studying these examples and modifying them experimentally, to master mod_rewrite you should read the documentation at Apache.org.

Comment on this article and its topic


Copyright © 2006-2008 Stephen Jungels. Written permission is required to repost or reprint this article

Valid HTML 4.01 Transitional

Last modified: Mon Oct 26 10:31:23 CDT 2009