Robots.txt File – A Major SEO Hack often ignored

May 8, 2017 | Uncategorized | 0 comments

You might have heard about robots.txt file even if you are not very much into SEO. To be honest, when it comes to SEO you must take care of even smallest of things. Today the competition is much much bigger and the Search engine optimization is getting more and more complex as search bots are getting intelligent and they are looking for the very best. One great search engine optimization hack is the proper use of robots.txt file. If your robots.txt file is properly optimized for search engines, then you can improve your SEO ranking to a great deal.

What is robots.txt file?

Even people who are owner of a couple of websites, sometimes do not know about the existence of robots.txt file. This is basically because the file is automatically created for every website. If there is a website then there will be robot.txt file associated with it.

It is also called the robots exclusion protocol and it tells the search bots which parts of your website to crawl through. It can allow as well as disallow the web crawler (the algorithms that traverse throughout the web) to index the pages.

It is a plain text based file and it has some generic settings. Based on the settings in the robot.txt file, it becomes easier for search engines to crawl through your website. As the file plays a role in directing the flow of search bots crawling through your site, it is quintessential for your website to have a good robots.txt file for a good SEO.

First let us discuss how the robots.txt effects the Search engine bot.

How do search engine bots crawl websites and how robots.txt file can effect it?

Your website would have a lot of pages. All good websites do and so does blogs. If you think your website is small, check again. It must be full of lot many pages. So the search engine bots take time to crawl and index through the web pages. The more the time required to crawl through the website, the penalty it has to be in the form of ranking. To in order to make your site rank higher it is better to have as little crawl time as possible. For that we must make sure that the search engines crawl through only those parts of the website that are essential and would boost the SEO.

There are two important concepts related to web crawling that we must include here.

Crawl Rate Limit

Crawling is the main property of Googlebot and crawl rate limit makes sure that one website does on hog away all the time and resources of the Googlebot. The crawl rate limit out a check on the numberos simultaneous connections that the searchbot can have to a website along with the time between the fetches.

The bot can increase and decrease the crawl rate based in a number of factors. If the response rate from the site is really high then the crawl rate will go up and if the site goes down or is slow to respond the crawl rate limit will also be penalized and it will go down.

Next important thing is the crawl demand

Crawl Demand

Sometimes, even when the crawl rate limit is not reached, still there will be little crawling activity from the Google bot. this is because the crawl demand is very low. So when does Google decide that there is little requirement to crawl your page? If you are thinking it’s a very unfair thing, do not worry it is not. Crawl demand depends upon two factors.

Popularity

Of course, the more popular URLs are crawled much more then unpopular ones that do not have much traffic or updates with them.

Staleness

The Google systems prevent the URLs from becoming stale so if there is not any new activity then the site will not get stale and there is no need to crawl it again and again.

Which content need not to be crawled?

You would say that which are the parts of the site that we would not want to be indexed for the search engines. What about the duplicate pages. If there is same content on two different pages, for example one for normal web view and another for printing. Why would you need that to be crawled twice? Also sometimes long and complicated downloadable forms need not to be crawled through.

Low quality content, that you cannot yet take off your site, you can try to hide from many people by the robots.txt file.

Another example of a page which will not need any web crawling is the thank you page that you might send to the subscribers.

Another example maybe hacked pages, or pages whose content you have moved to a new location. Also pages that have information that is very old and out dated, it is better that until you update them or remove them permanently, you can prevent the web crawlers to crawl them and take away the precious crawl limit to crawl useless stuff.

Now maybe you might have a good idea about how important the robots.txt file is and what major and unique role it can play in determining the SEO rate of your site.

Where is the robots.txt located and how to access it?

As I have mentioned clearly before every website has a robots.txt file. It can be found in the root directory where all the files are located. If you want to quickly view what your robots.txt looks like type the URL of your website along with /robots.txt and it will pop open in the browser window. So for example of you want to see the robots.txt file of Facebook here is what you have to type.

https://www.facebook.com/robots.txt

a long file pops up. Now do not get scared. You really do not need to create anything this big and huge. Usually the robots.txt file is two or three lines and that is complete. Take a look at another file on this link.

http://www.fashionkibatain.com/robots.txt

This robots.txt file is also very much complete and good.

You can check out the file for any website using the /robots.txt link. You might find an empty file too sometimes although I doubt any good website would do that. Also there might be an Error 404 coming up sometimes too, and that would, mean some trouble going on that backend.

An important point for all Wrodpress users here is that you will not find robots.txt in your root directory. In WordPress the robots.txt file is virtually created.

Creating and updating robots.txt files

As robots.txt file is text based, it can be created using any plain text editor like notepad or textEdit. But remember do not use any fancy editor like MS word because it adds additional information to the text.

For starters let us look at the very basic but complete robots.txt file. Here is an image of a simple robots.txt file.

There are two main reserved words in this file. User-agent and disallow. The term user-agent:* is used to refer to the search bots. When you use asterisks * it means you are allowing all the search bots to crawl your web page. Usually it is best to use asterisks and allow all the search bots to access your webpages. There are four main search engine bots.

Googlebot is the most commonly used bot is made by Google
Msnbot is also widely used and is the project of msn.
Slurp is the search engine bot of Yahoo.
TEOMA is the search engine bot of ask.

Apart from these four main bots there are other bots too, but usually they are not commonly used. In fact it is best that you leave the user-agent with an asterisk. In this way all the bots will be allowed to access the site.

After that you see the disallow statement. You can also use the allow statement instead of disallow. In the above example you have written nothing after the disallow option. It means that the whole site is okay for the web crawler to crawl through. So you see just these two lines have so much meaning and context to them and under certain circumstances they can allow you to do a lot of controlling.

Now let us come to the main part of this entire article. The SEO optimization for the robots.txt file.

Using SEO Optimization for the robots.txt file to make SEO Friendly

Before we proceed further, let me tell some common mistakes people make while using the robots.txt file. Remember it is to increase the crawl rate of your site and to make it more SEO friendly. Be very careful about what you are going to disallow because you do not want any important or useful page to be disallowed from being indexed by the search engine bots. Do not confuse between allow and disallow words. Sometimes you can wrongly use them interchangeably. This can cause serious consequences as some very important parts of your site might not be visible to when you will make one single mistake.

SEO Friendly robots.txt

As we have mentioned before, there are some parts of the site that we do not want the web crawler to waste its time on. We also want to have a quick response time for our site in order to make Google happy and to make it rank higher in the SERP.

Which parts of site to block from web crawling to make SEO Friendly robots.txt?

Firstly block the pages that are not visible to the users. There is no need to get the pages that are invisible to the users to be crawled and indexed by the web search engine bots.

The most common example for this is blocking the administration login page for the website. There is no use of the wp-admin pages of the site for the end users so you can disaalow them. Here is what the syntax would look like.

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Apart from user-agent and allow and disallow there are some other reserved words too that can be used in the robots.txt file like noindex.

Here are the complete list of words that are used in robots.txt and their uses taken from the Search Engine Land.

Robots.txt syntax

User-Agent: the robot the following rule applies to (e.g. “Googlebot,” etc.)
Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed)
Noindex: the pages you want a search engine to block AND not index (or de-index if previously indexed). Unofficially supported by Google; unsupported by Yahoo and Live Search.
Each User-Agent/Disallow group should be separated by a blank line; however no blank lines should exist within a group (between the User-agent line and the last Disallow).
The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. May be used either for whole lines or end of lines.
Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.

Adding Sitemap directive to Robots.txt for an SEO friendly move

Having a sitemap is in itself a great way to tell the search engine bots about the pages in your site. A sitemap will clearly inform the web crawler which pages are needed to be indexed. It is a very good SEO practice to add the sitemap in the robots.txt file. Add the following line of code to your robots.txt file to add the sitemap location to it.

Sitemap: sitemap_location

This serves as an auto discovery mechanism for the bots to go to the sitemap of the website. Here are the site map directives of our website www.thenextrex.com. You can check out our site’s robots.txt file here.

Sitemap:http://www.thenextrex.com/sitemap_index.xml

Sitemap:http://www.thenextrex.com/post-sitemap.xml

Sitemap:http://www.thenextrex.com/page-sitemap.xml

Sitemap: http://www.thenextrex.com/attachment-sitemap.xml

Sitemap: http://www.thenextrex.com/team_member-sitemap.xml

Sitemap: http://www.thenextrex.com/team-sitemap.xml

Sitemap: http://www.thenextrex.com/category-sitemap.xml

Sitemap: http://www.thenextrex.com/post_tag-sitemap.xml

There are a couple of other directives too that are used in robots.txt for advanced options.

Crawl Delay

Some websites may experience high amount of traffic so there is the crawl delay function that allow web crawlers to delay the indexing of the page in order to fulfil all the requirements of the high traffic. Here is the syntax of this directive

User-agent: msnbot
Crawl-delay: 5

User-agent: googlebot
Crawl-delay: 4

Tools available for testing the robots.txt file

As like every other piece of code, we need to test the robots.txt file too. There are a number of tools available for testing the robots.txt file that you have written. Even if your robots.txt file is two lines long, do not under any circumstances use it without testing it through some tool. There may be some small syntax error that you could have overlooked ending up in the robots.txt file becoming useless without you knowing about it.

One tool that is commonly used is the one provided with the Google Webmaster kit.

It is called the robot.txt tester tool. Login to your webmaster account. From the account choose the website property. Then from the menu choose the Crawl option in the right hand tool bar.

Robots.txt tester option will come in the pop up menu. Select this option and the tester pops up. If there is any code already written in the editor, kindly delete it. now copy the contents from the file and paste them in the editor provided. Now click the test button at the bottom. If all is well then the test button will change to allow otherwise, if there are any mistakes, they will be highlighted.

Once your code passes the test, you copy it into the robots.txt file. Then upload the new SEO friendly file in your root directory. In this way you are adding a powerful SEO hack to your site that can greatly affect the SEO of the page. However the SEO will begin to show results in a couple of days.

So this was our long and insightful guide regarding the correct use of robots.txt file. It is a resource that is available to all website owners, then why should we not make full use of it. As I said in the beginning too, the competition to reach the top of search engine results page is tougher than ever. In such a case even the smallest of the values count.

So have you any experience with dealing this file for your website? Is your file SEO friendly? Let us know about your feedback by commenting below.

Have a Question?

If you have any questions or need to discuss about your project
Feel free to reach out to our friendly team.

+61 8 9128 4800

Robots.txt File – A Major SEO Hack often ignored

What is robots.txt file?

How do search engine bots crawl websites and how robots.txt file can effect it?

Crawl Rate Limit

Crawl Demand

Popularity

Staleness

Where is the robots.txt located and how to access it?

Creating and updating robots.txt files

Using SEO Optimization for the robots.txt file to make SEO Friendly

SEO Friendly robots.txt

Which parts of site to block from web crawling to make SEO Friendly robots.txt?

Adding Sitemap directive to Robots.txt for an SEO friendly move

Crawl Delay

Tools available for testing the robots.txt file

Recent Posts

Categories

Have a Question?

Let’s get in touch

Excellence And Innovation Built Into Every Design

Quick menu

Expertise

News

Follow us

Get in touch

Phone

Email