A Simple Guide To Robots.txt
Robots, also known as web spiders, crawlers, or many other variations of insect, are a form of automated web programming that goes from website to website, mapping and gathering data.
They can have many different uses, from the obvious legal methods of web indexing and search engine development (which I will be talking more about later), but they are also used by less scrupulous technical minds to gather data such as email addresses, and other personal information that can be sold to agencies for spam emails, and can be a facilitator for fraud and identity theft.
The Standard for robot exclusion was defined on the 30th June 1994 by the majority of robot authors and other people with an interest. Though this was launched as a standard, it isn’t backed by any official body and there is no regulation that a robot must adhere to the rules in the txt file, so some robots may ignore the txt file, regardless of what you put in it.
Robots.txt can be used to direct the flow of these automated scripts around one’s website. They can direct the robot to avoid a page, a subdomain, or an entire site altogether. It is important to note however that whilst a robot may avoid a page, the page, and the robots.txt file is still public, so it is not a good idea to use the robots.txt to hide data.
Robots.txt is an essential tool for anyone who is looking to optimise website ‘crawling’. Whether you create your own .txt file, or whether you have a professional company do it for you, ensuring that the flow of the robots is adequately controlled is essential.
The structure of a robots.txt is very simple, but if it’s not done correctly the robots will simply ignore it.
The example below shows a robots.txt that tells all robots to ignore the entire site. If you want to add reminders for yourself as to what your specific commands refer to, you can use a ‘#’ at the start of a line to denote a comment.
The next example below tells robots to ignore any pages that start with ‘/fried chicken/’ and also shows how the comments tag can be used.
You can also use the ‘User-agent’ command to tell robots to avoid a page or domain, while at the same time, telling a specific robot to ignore that command, this is particularly useful if you have your own robot that you want to trawl a page or subdomain but you would rather it were ignored by other robots, the below example shows this.
When we start to talk about robots and SEO, the advantages of being able to tell a search engine spider to avoid a specific page or set of pages are quite clear.
If your site has a log in page, or a contact us page that doesn’t add value to the site and doesn’t require indexing by a search engine then it is more beneficial for your site for you to tell the spiders to ignore those pages, which will, in turn, allow the spiders to use their ‘crawl quota’ on pages of higher value.
So when you’re looking to make the most of this very simple tool, you need to remember a couple of simple rules.
- You only want the robots to index the pages on your site that add value.
- If you don’t want your customers visiting a page, you probably don’t want robots going there either.
- Remember that any page, on your site that isn’t adding value as it’s crawled, is consuming ‘crawl quota’ that could be going somewhere more worthwhile.
Ensure that you can create even a simple robots.txt to tell the search engines where to look, and you could notice a marked improvement in the value of your site.