robots.txt

The "robots.txt" file is a simple text file that allows website owners to tell the web crawlers, also called robots or spiders, which pages or sections of their website should not be crawled or indexed by search engines. The file is stored in the root directory of a website. Its name and location must have a certain format so that it is recognised by the web crawlers.

The robots.txt file uses a simple plain text format, with each line containing a specific instruction for web crawlers. The most common instruction is "User-agent", which specifies which web crawler the instruction applies to. For example: "User-agent: Googlebot" would apply the instruction to the web crawler Googlebot.

Another important instruction is "Disallow", which instructs the web crawler not to crawl a certain page or directory. For example: "Disallow: /password-protected-page" would prohibit the web crawlers from crawling the directory "password-protected-page" on the website.

A robots.txt file can look like this, for example:

User-agent: *
Disallow: /secret-page
Disallow: /folder/

This instructs all web crawlers not to crawl the URLs "/secret page" and "/folder/". The user agent "*" means that it applies to all web crawlers.

It is important to know that the robots.txt file is only a request and web crawlers are not obliged to follow it. Some web crawlers ignore the instructions in a robots.txt file or do not support the file at all. Also, a malicious user can ignore the instructions in the robots.txt file and still access the blocked pages. So the robots.txt file is not a secure way to protect sensitive pages or data. It is only a hint for web crawlers, so you should use other means such as authentication and access controls to protect sensitive areas of your website.

en_GB