Keyline Data » SEO  »  The “robots.txt” file

The “robots.txt” file

The "robots.txt" file is a standard used by websites to communicate with web crawlers and other automated agents, such as search engine bots. It is a text file placed in the root directory of a website and contains directives that specify which parts of the site should not be crawled or indexed by search engines.

Webmasters utilize the "robots.txt" file to control access to their site for web crawlers. The file may include instructions for specific user agents or define rules for entire groups of bots. For instance, it can instruct search engines not to index certain pages, not to follow specific links, or not to crawl particular sections of the site.

A basic example of a "robots.txt" file is as follows:

User-agent: * Disallow: /private/ Disallow: /restricted/

In this example, the asterisk (*) after "User-agent" signifies that the directives that follow apply to all web crawlers. The "Disallow" lines specify directories that should not be crawled or indexed by search engines.

A robots.txt file, positioned in the root directory of your website, serves to regulate the access of search engine bots to crawl your site. While adherence to these rules varies among search engines, specifying which bots are permitted to crawl and which are restricted from accessing certain files or folders can significantly optimize the efficiency of the crawling process. This approach not only conserves resources but also facilitates a more targeted exploration of your site.

To control crawling, the use of the "$" sign can be employed, allowing for the disallowance of specific elements, such as /images/abc.jpg$, while still permitting the crawling of variations like /images/abc.jpg=123v=960x700.

Wildcard characters, denoted by "*", provide a versatile means to either include or exclude paths from crawling. Placing them at the beginning or end of paths allows for broader or more specific directives, enhancing the flexibility of your robots.txt file.

In addition, it is advisable to incorporate the path to your sitemap.xml file within the robots.txt configuration. This inclusion ensures that bots comprehensively crawl all your pages, contributing to an effective and thorough indexing process.

Example:

User-agent: * # Allow all bots or specify bot name Disallow: /abc/folder Disallow: /abc.php
Allow: /abc/xyz.php
Sitemap: https://voloevents.com/sitemap.xml

Other Samples:

User-agent: Googlebot
Disallow: /not-for-google
User-agent: DuckDuckBot
Disallow: /not-for-duckduckgo
Sitemap: https://www.yourwebsite.com/sitemap.xml

If you want to tell Googlebot not to crawl your WordPress admin page, for example, your directive will start with:

User-agent: Googlebot
Disallow: /wp-admin/

For example, if you wanted to allow all search engines to crawl your entire site, your block would look like this:

User-agent: *
Allow: /

If you wanted to block all search engines from crawling your site, your block would look like this:

User-agent: *
Disallow: /

For example, if you want to prevent Googlebot from accessing every post on your blog except for one, your directive might look like this:

User-agent: Googlebot
Disallow: /blog
Allow: /blog/example-post

Scroll to Top