There are multiple ways in which you can control, regulate and even manipulate the way your website is crawled, indexed and presented to end users in SERPs by search engine bots.
You usually don’t need to worry about these unless you’re operating a really large website that could benefit highly from advanced crawl budget optimisation techniques. It’s often enough to leave the default settings on – in most cases, at any rate.
But, for whatever reason, if you want to tweak things around here and there in terms of crawling, the best place to start is the robots.txt file.
In this post, we’ll be looking at what a robots.txt file is, what it does, when you need it (and, more importantly, when you don’t), the best robots.txt practices and some better, case-specific solutions.
Table of Contents
Just take a look at the file name robots.txt and you’ll know what it is.
Robots.txt is, simply put, a regular text file that contains directives. These directives, when followed by search engine bots, tell them if, when and how to crawl your website.
The robots.txt file is the face of the entire Robots Exclusion Protocol (or, the Robots Exclusion Standard) that is used by search engines around the world. So, this is what typically happens when a search engine bot (or, an user agent) comes ‘crawling’ to your website:
1. It validates the domain and the website presence.
2. It immediately checks if your website contains a robots.txt file in the root directory.
3. If it finds a robots.txt file, it obeys whatever directives are in there.
4. If it doesn’t, normal crawling operations resume.
1. If your website has a robots.txt file, it’ll be visible to everyone out there. The logic is simple – the robots.txt file is essentially the door of entry to your website. Everyone can see it.
2. The directives inside the robots.txt file aren’t always followed by bad actors. Data scraping bots, for example, don’t give a shi!t about your robots.txt file. You can, however, rest easy because mainstream search engines follow these directives to the letter.
3. If you’ve got subdomains, each one will require a separate robots.txt file.
Robots.txt, as we said earlier, is a part of the REP. It has one main job – to tell search engine bots which pages/directories to block from being crawled.
There are additional directives that a robots.txt file can contain. Common examples include robots.txt crawl-delay
and robots.txt sitemap
.
Before we go any further, it’s important make a clear distinction between crawling and indexing (trust me, many SEOs use these terms liberally and interchangeably – not okay at all!).
Crawling, simply put, is a process through which search engine bots make sense of the code of your website. It includes pretty much everything (except, usually, scripts) that there is on your website. When we say a website is crawled, we mean that search engines have found, assessed and read its codes successfully.
Indexing is the process through which search engines add the crawled pages to their own indexes. Each search engine then chooses how and when to display the indexed data when relevant search queries come their way.
Even though these are pretty basic concepts, the importance of proper crawling and indexation is often underestimated by marketers and businesses. Our technical SEO audit is designed specifically to root out all crawlability and indexation issues (along with other performance factors). It’s a smooth process that every website needs to undergo on a regular basis (if they want to rank, that is). To know more, do feel free to get in touch with us.
Know If Your Pages Have Been Losing Traffic Due to Keyword Cannibalisation.
Duplicate Content Is Worse Than Bad. Find Out If Your Website Has Any.
Zero In On Indexation Issues & Penalty Events.
Pages Not Getting Crawled? Find Out Why.
Unearth Invisible Technical SEO Problems.
The best way to do this is to visit your Google Search Console dashboard. Here’s a quick tutorial for that.
A faster way to find indexation results is to look site:yourdomain up on Google. For example, we could find the indexation status for our pages at site:hqseo.co.uk.
In most cases, the robots.txt file is nothing more than a short, sweet couple of lines of directives. Here’s what it looks like:
The most common directive used in robots.txt files is the disallow
directive. It tells search engine bots NOT to crawl the specific URL/directory. It can be followed by the allow
directive that overrides the disallow
directive for specified sub-directories or URLs. Googlebot responds to the allow
directive fairly well.
Here’s an example:
User-agent: *
Disallow: [directory/file that shouldn't be crawled]
Allow: [directory/file that should override that Disallow directive and be crawled]
The short answer is – no.
Most websites can do without a robots.txt file. In fact, we’ve seen countless instances of SEO problems that arise from faulty robots.txt files. If you want to take one thing away from this page, take this: there’s always a good chance that you don’t need a robots.txt file at all (especially if your website doesn’t have thousands of pages). And when you do, let experts handle it.
Here’s what Google has to say about this issue.
So, yeah – there’s the answer you were looking for. Robots.txt has the potential to block your entire website from being crawled if implemented wrongly. Fiddle with it only when you know what you’re doing. If you don’t know where to start, our mini technical SEO audit has you covered.
There are multiple scenarios in which adding crawl directives directly to the robots.txt makes sense. Here are some of them:
Every decent website has internal search functionality. If your website is built the right way, internal search results shouldn’t be public (and hence visible to search engine bots) in the first place. But if they still are, you can add a disallow
directive to block them from being crawled. Here’s an example:
User-agent: *
Disallow: /*?s=*
#Alter the disallow directive as per your website's search URL format.
This robots.txt disallow
directive asks ALL search engines not to specifically crawl URLs that contain ‘?s=’ (a typical characteristic of internal search results).
More pages to crawl = less pages crawled per cycle.
By blocking pages that are unimportant from the SEO/UX points of view, you keep each crawl cycle short, sweet and to-the-point. Crawl budget optimisation plays a huge role in how efficiently your website responds to various SEO efforts. Even though Google has said it’s overrated, it’s important to note that their perspective is based on their bots getting smart every year – something we have no control over.
This isn’t ideal – but many websites do it on a regular basis. If you’ve got a page that has a good deal of duplicate content, you can block it directly from being crawled via the robots.txt file. I, however, don’t recommend this method (nor does Google).
Duplicate content can cause enormous SEO damage that can be avoided with a thorough technical SEO audit and sitewide content audit. An easier, much more sensible way to use rel=canonical
tags to tells search engine bots the original source of content.
Most websites have a lot of back-end clutter that doesn’t need to be crawled at all. Admin pages or under-construction pages – Google has no business knowing what goes on there (and don’t worry, these have zero SEO value because they won’t rank for anything worth ranking for). These should go straight to the disallow
directive in your robots.txt file.
Other examples include PDF files, downloadable content files, certain scripts and so on.
disallow
Pages With Sensitive Data in Robots.txt? There are exceptions – but in most cases, no. You shouldn’t.
The best way to deal with such pages is to let them get crawled but NOT indexed. Thus, you should ideally be using the noindex
meta directive on such pages.
Password-protected pages like My Account, Cart or Orders will automatically be blocked from being crawled (but the stuff that’s not behind the pay/login wall WILL be indexed, as long as there’s no noindex
directive).
crawl-delay
DirectiveSince the robots.txt file is the first real piece of your website search engines will have access to, you can control the crawl rate directly with robots.txt crawl-delay
directive here. Again, I don’t advise doing this unless absolutely necessary and your website contains thousands of pages. There are easier ways to achieve this from your Search Console dashboard, as well.
Be careful with the robots.txt craw-delay
directive. It can seriously hurt your SEO if there are many, many pages to be crawled.
Crawlability issues are at the top of our priority list when we run detailed technical SEO audits on websites. Sure enough, craw delay problems are right up there – something most webmasters aren’t even aware about. If your website is facing any crawlability or indexation issues despite top-notch on-page SEO, it’s time to run a thorough health check. Get in touch with us to request a free, fully customised proposal.
The robots.txt sitemap
comes in handy when you want to submit your sitemaps to search engines other than Google (assuming you’ve already submitted sitemaps to Google).
The sitemap
directive tells search engine bots where your sitemap is located. The syntax for adding a sitemap to robots.txt file will be discussed in the syntax section of this post. If you aren’t sure why websites need sitemaps, do read through our simple, intuitive sitemap guide.
Ideally, you should be submitting your sitemaps to Google through Search Console – because you’ll already be using it to full extent. Most websites don’t feel the need to use consoles for other search engines – Bing, for one. Instead of creating dozens of accounts and adding umpteen unnecessary verification codes to your websites, you can choose to direct those bots to your sitemap via the robots.txt sitemap
directive.
Just check the / directory
via your favourite FTP client. Or, more easily, visit the /robots.txt
path on your website. For example, www.example.com/robots.txt.
We started with this – robots.txt is nothing special in terms of what it is.
It’s just a simple text file. You can create a robots.txt file in your favourite HTML/text editor without any issues. Most FTP clients allow you to create text files online. To avoid common robots.txt errors, follow this guide compiled by Google.
The robots.txt file MUST always be placed in the root directory of your website (topmost level). Essentially, it should be readable at /robots.txt
.
Writing a robots.txt file doesn’t take too long, so long as you’re using the right syntax. Here are some common directives that you’ll probably be using:
The user-agent
directive specifies which search engine the directive is issued for.
Remember – there are no closing tags in a robots.txt file. All the directives sandwiched between two different user-agent
directives are directed at the first user-agent.
Googlebot, for example is the user-agent
for Google. Here’s the complete listv of common crawlers and how they need to be addressed. Here’s how you can, for example, tell Googlebot to crawl all pages and files on your website:
User-agent: Googlebot
Disallow:
The robots.txt disallow
is a simple directive followed by the directory path/URL.
It tells the specified user-agent
not to crawl the file/URL/directory.
Here’s how, for example, you can ask ALL crawlers to disallow
the archives
directory.
User-agent: *
Disallow: /archives
#All robots.txt directives are case sensitive
Not commonly used. Googlebot will override the disallow
directive for paths that are explicitly under the allow
directive.
At times, not all the files/URLs are needed to be blocked from being crawled are in the same directory.
For example, you may want to stop bots from crawling ALL svg images on your website. In this case, the regular expression *
can be used as a wildcard. It applies across the board for every file/URL/directory that matches the description. For example, the following directive will disallow all files that contain “.svg” in their names.
User-agent: *
Disallow: /*.svg
Similarly, the regular expression $
can be used to mark the end of URL/file name. For example, the following code will disallow all URLs that end with .aspx
User-agent: *
Disallow: /*.aspx$
We discussed this directive a few points ago.
The robots.txt crawl-delay
directive tells bots how much time should pass between two successive crawl attempts. I’ll reiterate this: increasing the robots.txt craw-delay can result in adverse crawl budget optimisation.
The following robots.txt crawl-delay directives tells Twitterbot not to crawl ANY directory except posts
, and to let 15 seconds pass between two crawl attempts.
User-agent: Twitterbot
Disallow: /
Allow: /posts
Crawl-delay: 15
Use the robots.txt sitemap
directive to direct search engine bots to various sitemaps. All three major search engines – Google, Bing and Yahoo! – support this directive .
If you want to add a sitemap to robots.txt file, here’s how it should be done:
User-agent: *
Disallow:
Sitemap: https://hqseo.co.uk/sitemap.xml
Note: Not all search engines bots/crawlers follow the robots.txt
sitemap
directive.
A page that cannot be crawled is essentially unimportant from Google’s point of view.
Some points to note here:
1. The page disallowed from being crawled will still be indexed. Here’s what such pages look like in search results:
2. The disallow
directive essentially strips pages from imparting any SEO power/link juice. If there are backlinks pointing to such pages, the link juiced passed to your website through blocked pages will quickly go down, because Google can’t see what those links are all about. It’s a good idea, in such cases, to reassign those backlinks to other ‘crawled’ paged. Similarly, if there are any outbound/internal links on disallow
‘ed pages, they will essentially be useless in terms of passing the link juice on.
The noindex
and nofollow
tags are meta robot directives.
These are much similar to how robots.txt directives works – except they are placed in the code of your website and they cannot influence sitewide locations. The noindex
tag tells search engine bots not to index the page/element/file. The nofollow
tag, on the other hands, tells these bots not to ‘follow’ the links on the page. The nofollow
tag applies to all links on the page and overrides the individual dofollow
attributes on individual links (if any).
We won’t discuss the noindex
and nofollow
tags here in details. Instead, we’ll take a look at two common scenarios and how you should treat them so as to maintain the best SEO practices.
Add it to the robots.txt disallow
directive, or password-protect it. The page will still be indexed in Google even if you have the noindex
tag in place (because the Googlebot won’t be able to read that tag at all).
That’s alright – just use the noindex
tag in the head section of that particular page. It’ll be crawled but not indexed.
Search Console gives you an option to validate robots.txt files with ease.
Robots.txt is important in its own right – it gives you, as a webmaster, a virtually unlimited ability to fine-tune the way your website is crawled by search engine bots. It acts as the first barrier to entry to your website for bots, and that is pretty cool (given that it’s just a text file, not even a code).
Important as it is, a badly put-together robots.txt file can cause enormous harm to your website’s crawlability. I know I’m repeating myself when I say this, but it’s worth it: You don’t need a robots.txt file unless you have some pretty unique requirements.
HQ SEO’s technical SEO audit is something that takes this angle into account really well. As an agency, we focus on discovering SEO problems – not inventing them when they aren’t there. Check out our complete process if you want to learn more. To request a free, customised proposal, scroll down and fill in the proposal form.
Know If Your Pages Have Been Losing Traffic Due to Keyword Cannibalisation.
Duplicate Content Is Worse Than Bad. Find Out If Your Website Has Any.
Zero In On Indexation Issues & Penalty Events.
Pages Not Getting Crawled? Find Out Why.
Unearth Invisible Technical SEO Problems.