What is the robots.txt file?
Robots.txt is a text file webmasters create to instruct web robots ( search engine robots ) which pages on your website to crawl or not to crawl.
The robots.txt file is primarily used to specify which parts of your website should be crawled by spiders or web crawlers. It can specify different rules for different spiders.
Googlebot is an example of a spider. It’s deployed by Google to crawl the Internet and record information about websites so it knows how high to rank different websites in search results.
- Example of Robots.txt file URL: https://www.xyz.com/robots.txt
- Blocking all web crawlers from all content
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages of the website, including the homepage.
- Allowing all web crawlers access to all content
Using this syntax in a robots.txt file tells web crawlers to crawl all pages of the website, including the homepage.
- Blocking a specific web crawler from a specific folder
This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string.
- Blocking a specific web crawler from a specific web page
This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page.
There are two important considerations when using /robots.txt:
- Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
March 2, 2020
February 3, 2020