Web crawlers usually comprehend all the pages available on your website and index them. Then, how can we communicate to the web crawler to stop indexing all unwanted or private pages/info on our website? By implementing robots.txt!
What is robots.txt?
Robots.txt is a map or a guide for the web crawlers for indexing your website. It is a file used to restrict the indexing of certain sections or pages of a website.
For example, you own an educational website but do not want confidential data, such as student details or account info, to be indexed. This task can be done using robots.txt. In the image below, you can see that robots.txt blocks the indexing of site sections like temp files, private files, and parts of databases.
When to use robots.txt?
Below are the few use cases where implementation of robots.txt file is preferred:
- When a website has a sitemap
- To prevent server overloading when the website is crawled by several bots simultaneously adding a delay
- To prevent duplicate search results of the same pages on Google search
- To save crawl budget by preventing indexing of unimportant pages
- Blocking a few site locations, files, and media from indexing
If there are no such use cases on your website, robots.txt is not required.
Implementation of Robots.txt
Robots.txt of any site exists on the root of the domain, and you can access it by clicking on https://mywebsite.com/robots.txt. Robots.txt file must always be placed in the root directory of the website so that it is easily accessible to the web crawlers.
Sample of a robots.txt file
Consider the below example of a robots.txt file for your website http://www.mywebsite.com/.
The above code in the robots.txt file means:
- Googlebot - the user agent.
- Disallow - specify pages that you do not want bots to index. So, Googlebot will not be allowed to crawl any URL starting with http://mywesite.com/nogooglebot/.
- User-agent: * - all other user agents are allowed to crawl the site.
- Sitemap - website's sitemap file is located at http://www.mywebsite.com/sitemap.xml Linking the XML sitemap makes it easy for web crawlers to find it, making the site indexing faster.
Testing and inspecting Robots.txt
Since robots.txt is essential for indexing your website correctly, any error in setting up this file can cause serious SEO troubles or get your website deindexed. So, testing it before uploading is very crucial. Luckily, there are great tools available for auditing robots.txt.
Google Robots.txt tester
Google has a free robots.txt Tester Tool available in Google Search Console. Select the registered website for the list under the crawl section and find the robot.txt tester link. After a scan, Google gives errors and warning details. Before uploading the robot.txt file, do a round of testing.
Screaming frog SEO Spider
This application is an SEO auditing tool used to enhance the onsite SEO of the website. It can be used for multiple SEO tasks such as finding broken links, generating sitemaps, reviewing robots.txt files, and many more SEO audit-related features.
Download the application first to use it. Crawling up to 500 URLs is free. For more URLs, users will need to opt for a paid version.
Steps to test site Robots.txt
- Crawl the website or URL; Open the SEO Spider app, enter the site URL in the search field, and hit ‘Start’.
- Switch to the ‘Response Codes’ Tab, and select the ‘Blocked By Robots.txt’ Filter.
- Disallowed URLs will appear with the status -> ‘Blocked by Robots.txt’ after the whole site is crawled.
The blocked URLs will have the status Blocked by robots.txt. The ‘Matched Robots.txt Line’ column specifies the line number and the disallow path of the robots.txt entry.
- We can export the links result in bulk via the ‘Bulk Export -> Response Codes -> Internal & External -> Blocked by Robots.txt Inlinks’ report.
There are many more tools available online with which we can test robots.txt.
Best practices to implement Robots.txt
The right implementation of robots.txt will give the best outcomes in terms of SEO. Below are the best practices you should apply while implementing robots.txt on your website.
- Place the Robots.txt file in the root directory of the website to make it easily accessible to web crawlers. As the file is case-sensitive, name it robots.txt.
- All important pages must be accessible to the crawler for indexing. Unimportant pages can be blocked to save the crawl budget.
- Never use a robot.txt file to disallow or hide private files with sensitive information. It can make your website vulnerable to hackers. For such use cases, use password protection methods.
- Do not use the Noindex tag as it's no longer accepted, instead use meta tags or X-Robots-Tag.
- If the website has subdomains, use a separate robots.txt file for each subdomain.
- Add a sitemap link to the bottom of the robots.txt file.
- Do not block the website's JS or CSS files using robots.txt.
- Follow the robots.txt directives like capitalization of directories, file names, etc.
It is important to keep your robots.txt file updated whenever new pages are added or when there is any change in the website. There are a few limitations:
- Blocked pages will still appear in search results if they are linked to any crawlable page
- The max file size of robots.txt is 521kb
- Robots.txt is cached for 24hrs
In general, robots.txt is good for a website’s SEO but not mandatory. Implementing robots.txt in the right way will ensure a higher SEO score and also the security of websites.