What is robots.txt?

What is robots.txt?

The Robots Exclusion Standard (robots.txt) is a protocol used by websites to control web crawlers or robots access to their domain. It is essentially a whitelist/blacklist that allows publishers to identify known web crawlers and robots and determine how they interact with their site. Unfortunately, only web crawlers and robots that adhere to the standard will comply with the directives of the file. Malicious web crawlers and robots are unlikely to adhere to the standard and as such, robots.txt should not be used as a security measure. 

What do I need to do?

The robots.txt file is optional. To implement robots.txt, place the file at the root of your domain (http://rootdomain.com/robots.txt). The file covers a single origin, meaning you would need a separate robots.txt for each protocol, port, or subdomain. If robots.txt does not exist on a domain, all web crawlers and robots are allowed full access.

The robots.txt file is a plain text file that contains the user-agent of the web crawler or robot, followed by an allow or disallow directive.

For example (http://rootdomain.com/robots.txt):

User-agent: Googlebot
Allow: /help
Disallow: /private

The above example is telling Google's Web Crawler (Googlebot) that it can crawl and index http://rootdomain.com/help but do not crawl and index http://rootdomain.com/private. 

Note: Googlebot will still be able to access https://rootdomain.com/private, http://rootdomain.com:8080/private and http://subdomain.rootdomain.com/private.

Why do I need to implement robots.txt?

The short answer is, you do not need to implement robots.txt. 

However, for ad delivery, you need to ensure that Google's Ad Exchange crawler is able to index the ads.txt file, as well as any pages that contain ads. If you have implemented robots.txt, you will need to confirm that user-agent; Mediapartners-Google does not have disallow directives applied to the ads.txt file or any pages with ads.

Remove the following lines from robots.txt, if they exist:

User-agent: Mediapartners-Google
Disallow: /
Disallow: /ads.txt

What are other uses for robots.txt?

You can use robots.txt to prevent web crawlers and robots from indexing test pages or a staging site. These pages are non-public and you would not want random traffic landing on these pages because they showed up in the search engine's results page.

Googlebot has limits to how many pages it can index. This is referred to as the crawl budget. If your site is very large and you are having issues getting all of your pages indexed, you can use robots.txt to block low-traffic or other low-priority pages. This will allow Googlebot more time to index the pages that matter the most.

Additionally, robots.txt can be used to exclude multimedia resources such as images or PDFs from being indexed. This will also increase the crawl budget allocated to the rest of your site. You can use Google Search Console to see exactly what is being indexed by Googlebot.