Technical SEO

Robots.txt – Is It Necessary? – Find Out If Your Website Needs A Robots.txt File

by Tom Buckland Updated On February 5, 2019

Is Robots.Txt Necessary

There are multiple ways in which you can control, regulate and even manipulate the way your website is crawled, indexed and presented to end users in SERPs by search engine bots.

You usually don’t need to worry about these unless you’re operating a really large website that could benefit highly from advanced crawl budget optimisation techniques. It’s often enough to leave the default settings on – in most cases, at any rate.

But, for whatever reason, if you want to tweak things around here and there in terms of crawling, the best place to start is the robots.txt file.

In this post, we’ll be looking at what a robots.txt file is, what it does, when you need it (and, more importantly, when you don’t), the best robots.txt practices and some better, case-specific solutions.

What Is a robots.txt File?

Just take a look at the file name robots.txt and you’ll know what it is.

Robots.txt is, simply put, a regular text file that contains directives. These directives, when followed by search engine bots, tell them if, when and how to crawl your website.  

The robots.txt file is the face of the entire Robots Exclusion Protocol (or, the Robots Exclusion Standard) that is used by search engines around the world. So, this is what typically happens when a search engine bot (or, an user agent) comes ‘crawling’ to your website:

1. It validates the domain and the website presence.

2. It immediately checks if your website contains a robots.txt file in the root directory.

3. If it finds a robots.txt file, it obeys whatever directives are in there.

4. If it doesn’t, normal crawling operations resume.

Robots.txt – Some Important Points to Note

1. If your website has a robots.txt file, it’ll be visible to everyone out there. The logic is simple – the robots.txt file is essentially the door of entry to your website. Everyone can see it.

2. The directives inside the robots.txt file aren’t always followed by bad actors. Data scraping bots, for example, don’t give a shi!t about your robots.txt file. You can, however, rest easy because mainstream search engines follow these directives to the letter.

3. If you’ve got subdomains, each one will require a separate robots.txt file.

 

What Does Robots.txt Exactly Do?

How Does Robots.txt Work

Robots.txt, as we said earlier, is a part of the REP. It has one main job – to tell search engine bots which pages/directories to block from being crawled.

There are additional directives that a robots.txt file can contain. Common examples include robots.txt crawl-delay and robots.txt sitemap.

Before we go any further, it’s important make a clear distinction between crawling and indexing (trust me, many SEOs use these terms liberally and interchangeably – not okay at all!).

 

What is Crawling?

Crawling, simply put, is a process through which search engine bots make sense of the code of your website. It includes pretty much everything (except, usually, scripts) that there is on your website. When we say a website is crawled, we mean that search engines have found, assessed and read its codes successfully.

 

What is Indexing?

Indexing is the process through which search engines add the crawled pages to their own indexes. Each search engine then chooses how and when to display the indexed data when relevant search queries come their way.

Even though these are pretty basic concepts, the importance of proper crawling and indexation is often underestimated by marketers and businesses. Our technical SEO audit is designed specifically to root out all crawlability and indexation issues (along with other performance factors). It’s a smooth process that every website needs to undergo on a regular basis (if they want to rank, that is). To know more, do feel free to get in touch with us.

 

Know If Your Pages Have Been Losing Traffic Due to Keyword Cannibalisation.

Duplicate Content Is Worse Than Bad. Find Out If Your Website Has Any.

Zero In On Indexation Issues & Penalty Events.

Pages Not Getting Crawled? Find Out Why.

Unearth Invisible Technical SEO Problems.

 

How Do I Know My Website Is Crawled and/or Indexed?

The best way to do this is to visit your Google Search Console dashboard. Here’s a quick tutorial for that.

A faster way to find indexation results is to look site:yourdomain up on Google. For example, we could find the indexation status for our pages at site:hqseo.co.uk.

 

What Does Robots.txt Look Like? (Robots.txt Example)

In most cases, the robots.txt file is nothing more than a short, sweet couple of lines of directives. Here’s what it looks like:

Sample Robots.Txt Example

The most common directive used in robots.txt files is the disallow directive. It tells search engine bots NOT to crawl the specific URL/directory. It can be followed by the allow directive that overrides the disallow directive for specified sub-directories or URLs. Googlebot responds to the allow directive fairly well.

Here’s an example:

User-agent: *
Disallow: [directory/file that shouldn't be crawled]
Allow: [directory/file that should override that Disallow directive and be crawled]

 

Robots.txt – Is It Absolutely Necessary?

The short answer is – no.

Most websites can do without a robots.txt file. In fact, we’ve seen countless instances of SEO problems that arise from faulty robots.txt files. If you want to take one thing away from this page, take this: there’s always a good chance that you don’t need a robots.txt file at all (especially if your website doesn’t have thousands of pages). And when you do, let experts handle it.

Here’s what Google has to say about this issue.

So, yeah – there’s the answer you were looking for. Robots.txt has the potential to block your entire website from being crawled if implemented wrongly. Fiddle with it only when you know what you’re doing. If you don’t know where to start, our mini technical SEO audit has you covered.

 

When Should You Use a Robots.txt File?

There are multiple scenarios in which adding crawl directives directly to the robots.txt makes sense. Here are some of them:

 

Keeping Search Results Private

Disallow Search Queries Robots.Txt Example

Every decent website has internal search functionality. If your website is built the right way, internal search results shouldn’t be public (and hence visible to search engine bots) in the first place. But if they still are, you can add a disallow directive to block them from being crawled. Here’s an example:

User-agent: *
Disallow: /*?s=*
#Alter the disallow directive as per your website's search URL format. 

 

This robots.txt disallow directive asks ALL search engines not to specifically crawl URLs that contain ‘?s=’ (a typical characteristic of internal search results).

Prevent Unnecessary Crawling

More pages to crawl = less pages crawled per cycle.

By blocking pages that are unimportant from the SEO/UX points of view, you keep each crawl cycle short, sweet and to-the-point. Crawl budget optimisation plays a huge role in how efficiently your website responds to various SEO efforts. Even though Google has said it’s overrated, it’s important to note that their perspective is based on their bots getting smart every year – something we have no control over.

 

Dealing with Duplicate Content Issues

Duplicate Content Robots.txt - Robots.txt vs Canonical

This isn’t ideal – but many websites do it on a regular basis. If you’ve got a page that has a good deal of duplicate content, you can block it directly from being crawled via the robots.txt file. I, however, don’t recommend this method (nor does Google).

Duplicate content can cause enormous SEO damage that can be avoided with a thorough technical SEO audit and sitewide content audit. An easier, much more sensible way to use rel=canonical tags to tells search engine bots the original source of content.

 

Keeping Certain Directories/Files Off-Limits for Google (And Other Search Engines)

Robots.txt Example Sample

Most websites have a lot of back-end clutter that doesn’t need to be crawled at all. Admin pages or under-construction pages – Google has no business knowing what goes on there (and don’t worry, these have zero SEO value because they won’t rank for anything worth ranking for). These should go straight to the disallow directive in your robots.txt file.

Other examples include PDF files, downloadable content files, certain scripts and so on.

 

Should You disallow Pages With Sensitive Data in Robots.txt?

There are exceptions – but in most cases, no. You shouldn’t.

The best way to deal with such pages is to let them get crawled but NOT indexed. Thus, you should ideally be using the noindex meta directive on such pages.

Password-protected pages like My Account, Cart or Orders will automatically be blocked from being crawled (but the stuff that’s not behind the pay/login wall WILL be indexed, as long as there’s no noindex directive).

 

Adjusting the Crawl Rate Via the crawl-delay Directive

Since the robots.txt file is the first real piece of your website search engines will have access to, you can control the crawl rate directly with robots.txt crawl-delay directive here. Again, I don’t advise doing this unless absolutely necessary and your website contains thousands of pages. There are easier ways to achieve this from your Search Console dashboard, as well.

Be careful with the robots.txt craw-delay directive. It can seriously hurt your SEO if there are many, many pages to be crawled.

Crawlability issues are at the top of our priority list when we run detailed technical SEO audits on websites. Sure enough, craw delay problems are right up there – something most webmasters aren’t even aware about. If your website is facing any crawlability or indexation issues despite top-notch on-page SEO, it’s time to run a thorough health check. Get in touch with us to request a free, fully customised proposal.

 

Adding Sitemaps Via Robots.txt

The robots.txt sitemap comes in handy when you want to submit your sitemaps to search engines other than Google (assuming you’ve already submitted sitemaps to Google).

Sample Robots.Txt - Walmart Sitemap

The sitemap directive tells search engine bots where your sitemap is located. The syntax for adding a sitemap to robots.txt file will be discussed in the syntax section of this post. If you aren’t sure why websites need sitemaps, do read through our simple, intuitive sitemap guide.

Ideally, you should be submitting your sitemaps to Google through Search Console – because you’ll already be using it to full extent. Most websites don’t feel the need to use consoles for other search engines – Bing, for one. Instead of creating dozens of accounts and adding umpteen unnecessary verification codes to your websites, you can choose to direct those bots to your sitemap via the robots.txt sitemap directive.

 

Does Your Website Already Have a Robots.txt File?

Just check the / directory via your favourite FTP client. Or, more easily, visit the /robots.txt path on your website. For example, www.example.com/robots.txt.

 

How to Create a Robots.txt File for Your Website?

We started with this – robots.txt is nothing special in terms of what it is.

It’s just a simple text file. You can create a robots.txt file in your favourite HTML/text editor without any issues. Most FTP clients allow you to create text files online. To avoid common robots.txt errors, follow this guide compiled by Google.

 

 

Where Should the Robots.txt File Be Placed?

The robots.txt file MUST always be placed in the root directory of your website (topmost level). Essentially, it should be readable at /robots.txt.

 Note: The file name should be robots.txt – and nothing else. Robots.txt, for example, doesn’t work. 

 

Important Robots.txt Syntax And Directives

Writing a robots.txt file doesn’t take too long, so long as you’re using the right syntax. Here are some common directives that you’ll probably be using:

 

User Agent

The user-agent directive specifies which search engine the directive is issued for.

Remember – there are no closing tags in a robots.txt file. All the directives sandwiched between two different user-agent directives are directed at the first user-agent.

Googlebot, for example is the user-agent for Google. Here’s the complete listv of common crawlers and how they need to be addressed. Here’s how you can, for example, tell Googlebot to crawl all pages and files on your website:

User-agent: Googlebot
Disallow: 

 

Disallow

The robots.txt disallow is a simple directive followed by the directory path/URL.

It tells the specified user-agent not to crawl the file/URL/directory.

Here’s how, for example, you can ask ALL crawlers to disallow the archives directory.

User-agent: *
Disallow: /archives
#All robots.txt directives are case sensitive

 

Allow

Not commonly used. Googlebot will override the disallow directive for paths that are explicitly under the allow directive.

 

Regular Expressions (Robots.txt Wildcard Syntax)

At times, not all the files/URLs are needed to be blocked from being crawled are in the same directory.

For example, you may want to stop bots from crawling ALL svg images on your website. In this case, the regular expression * can be used as a wildcard. It applies across the board for every file/URL/directory that matches the description. For example, the following directive will disallow all files that contain “.svg” in their names.

User-agent: *
Disallow: /*.svg

 

Similarly, the regular expression $ can be used to mark the end of URL/file name. For example, the following code will disallow all URLs that end with .aspx

User-agent: *
Disallow: /*.aspx$

 

Crawl Delay (Robots.txt)

We discussed this directive a few points ago.

The robots.txt crawl-delay directive tells bots how much time should pass between two successive crawl attempts. I’ll reiterate this: increasing the robots.txt craw-delay can result in adverse crawl budget optimisation.

The following robots.txt crawl-delay directives tells Twitterbot not to crawl ANY directory except posts, and to let 15 seconds pass between two crawl attempts.

User-agent: Twitterbot
Disallow: /
Allow: /posts
Crawl-delay: 15

 

Sitemap (Robots.txt)

Use the robots.txt sitemap directive to direct search engine bots to various sitemaps.  All three major search engines – Google, Bing and Yahoo! – support this directive .

If you want to add a sitemap to robots.txt file, here’s how it should be done:

 

How to Add Sitemaps to Robots.txt?

User-agent: *
Disallow: 
Sitemap: https://hqseo.co.uk/sitemap.xml

 

 Note: Not all search engines bots/crawlers follow the robots.txt sitemap directive. 

 

How Does Robots.txt Impact Page SEO?

A page that cannot be crawled is essentially unimportant from Google’s point of view.

Some points to note here:

1. The page disallowed from being crawled will still be indexed. Here’s what such pages look like in search results:

2. The disallow directive essentially strips pages from imparting any SEO power/link juice. If there are backlinks pointing to such pages, the link juiced passed to your website through blocked pages will quickly go down, because Google can’t see what those links are all about. It’s a good idea, in such cases, to reassign those backlinks to other ‘crawled’ paged. Similarly, if there are any outbound/internal links on disallow ‘ed pages, they will essentially be useless in terms of passing the link juice on.

 

Robots.txt vs Meta Robots (Robots.txt vs Noindex)

The noindex and nofollow tags are meta robot directives.

These are much similar to how robots.txt directives works – except they are placed in the code of your website and they cannot influence sitewide locations. The noindex tag tells search engine bots not to index the page/element/file. The nofollow tag, on the other hands, tells these bots not to ‘follow’ the links on the page. The nofollow tag applies to all links on the page and overrides the individual dofollow attributes on individual links (if any).

We won’t discuss the noindex and nofollow tags here in details. Instead, we’ll take a look at two common scenarios and how you should treat them so as to maintain the best SEO practices.

 

I want to Disallow A Page From Being Crawled.

Add it to the robots.txt disallow directive, or password-protect it. The page will still be indexed in Google even if you have the noindex tag in place (because the Googlebot won’t be able to read that tag at all).

 

I Don’t Want Google to Index a Page. I Still Want it to Be Crawled, Though.

That’s alright – just use the noindex tag in the head section of that particular page. It’ll be crawled but not indexed.

 

Validating the Robots.txt File

How to Validate Robots.Txt - Google Robots.Txt Tester

Search Console gives you an option to validate robots.txt files with ease.

 

Conclusion

Robots.txt is important in its own right – it gives you, as a webmaster, a virtually unlimited ability to fine-tune the way your website is crawled by search engine bots. It acts as the first barrier to entry to your website for bots, and that is pretty cool (given that it’s just a text file, not even a code).

Important as it is, a badly put-together robots.txt file can cause enormous harm to your website’s crawlability. I know I’m repeating myself when I say this, but it’s worth it: You don’t need a robots.txt file unless you have some pretty unique requirements.

HQ SEO’s technical SEO audit is something that takes this angle into account really well. As an agency, we focus on discovering SEO problems – not inventing them when they aren’t there. Check out our complete process if you want to learn more. To request a free, customised proposal, scroll down and fill in the proposal form.

 

Know If Your Pages Have Been Losing Traffic Due to Keyword Cannibalisation.

Duplicate Content Is Worse Than Bad. Find Out If Your Website Has Any.

Zero In On Indexation Issues & Penalty Events.

Pages Not Getting Crawled? Find Out Why.

Unearth Invisible Technical SEO Problems.

Share

About Tom

Hi, I'm Tom, Founder & Director of HQ SEO. I live and breathe SEO. I hope you enjoy my findings. Interested in SEO?
Let's Connect.

Get More Clicks & Higher Rankings

Get in touch to see how we've helped rank over 100+ clients worldwide, generating tens of thousands of leads and millions in additional revenue using smart inbound marketing strategies.

Get my free proposal

We Help Our Clients Make More Money

Get my free proposal