What Is A Robots.txt File Complete Guide

david

3 years ago

Complete Guide about Robots.txt File will be described in this article. Did you know that you may decide down to the page level who crawls and indexes your website? This is accomplished by using a file called Robots.txt.

A straightforward text file named Robots.txt can be found in the root directory of your website. It instructs “robots” (like search engine spiders) which pages on your website to crawl and which sites to skip. The Robots.txt file allows you a lot of control over how Google and other search engines view your site, while not being necessary.

What Is A Robots.txt File Complete Guide

In this article, you can know about What Is A Robots.txt File Complete Guide here are the details below;

When properly applied, this can boost crawling and even have an effect on SEO. But how can you actually make a useful Robots.txt file? How do you use it after it’s been created? What errors should you avoid when utilizing it, too?

I’ll cover all you need to know about using the Robots.txt file on your blog in this post.

Let’s start:

What is a Robots.txt file?

Programmers and engineers developed “robots” or “spiders” in the early years of the internet to crawl and index online pages. These machines are also referred to as “user-agents.”

These robots occasionally found their way onto pages that the site owners didn’t want indexed. a private website or a site that is still being built, for instance.

Martijn Koster, a Dutch engineer who developed the first search engine in history (Aliweb), proposed a set of guidelines that any robot should follow to address this issue. In February 1994, a proposal for these standards was made.

A group of robot authors and early web pioneers came to an agreement on the specifications on June 30, 1994.

As part of the “Robots Exclusion Protocol” (REP), certain guidelines were adopted.

This protocol is implemented in the Robots.txt file.

Every valid crawler or spider must abide by a set of regulations that are outlined in the REP. Every reputable robot, including the MSNbot and Googlebot, must abide by the instructions in the Robots.txt file if they state that a web page should not be indexed.

Note: You may find a list of reliable crawlers here.

Remember that some malicious robots, such as spyware, email harvesters, and so on, might not adhere to these protocols. Due to this, pages that you have blacklisted using Robots.txt may nevertheless experience bot traffic.

There are other robots that don’t adhere to REP norms and aren’t employed in any dubious activities.

Visit this link to view the robots.txt for any website:

http://[website_domain]/robots.txt

Here is an example of the Robots.txt file from Facebook:

File Facebook Robot.txt

Here is the Robots.txt file for Google:

Robots.txt file for Google

Use of Robots.txt

A website’s robots.txt file is not required. Without this file, your website will still rank well and continue to expand.

However, there are certain advantages to adopting the Robots.txt:

Discourage bots from crawling private folders: Despite not being ideal, forbidding bots from accessing private files will make it considerably more difficult for them to be indexed, at least by trusted bots (like search engine spiders).

Control resource: Each time a bot crawls your website, bandwidth and server resources are used that may be better utilized by actual visitors. This can increase prices for websites with a lot of material and detract from the experience of actual visitors. To save resources, you can use Robots.txt to prevent access to scripts, unnecessary photos, etc.

Prioritize important pages: You want search engine spiders to focus their energy on important sites on your site (like content pages), rather than wasting time on irrelevant pages (like search query results). You can determine which pages bots prioritize by cutting off such pointless pages.

How to find your Robots.txt file

Robots.txt is a straightforward text file, as the name would imply.

The root directory of your website houses this file. Simply launch your FTP client and go to your website directory under public_html to find it.

File Robots.txt

This text file is incredibly little; mine is barely over 100 bytes.

Use any text editor, such as Notepad, to open it. You might see something similar to this:

File Opening

There’s a potential that the root directory of your site contains no Robots.txt file. In this scenario, a Robots.txt file must be manually created by you.

This is how:

How to create a Robot.txt file

Opening a text editor and saving an empty file as robots.txt will create Robots.txt in no time at all because it is just a basic text file.

Build robots

Use your preferred FTP program to access your web server to upload this file (I suggest WinSCP for this). After that, access the root directory of your site and the public_html folder.

The root directory of your website can be located just inside the public_html folder, depending on how your web server is set up. It might also be a folder inside of that.

Drag and drop the Robots.txt file into the root directory of your website once it is open.

The Robots.txt file can also be created directly from your FTP editor.

Open your site’s root directory, then use Right Click -> Create New File to accomplish this.

Type “robots.txt” (without the quotations) in the dialog box, then click OK.

Create a blank file.

There need to be a fresh robots.txt file there:

Fresh Robots

Last but not least, confirm that the Robots.txt file has the appropriate file permissions set. You only want to be able to access and write to the file as the owner, not to anybody else or the general public.

You should see “0644” listed as the authorization code in your Robots.txt file.

If it doesn’t, click the “File permissions” option in the context menu of your Robots.txt file.

File Access Rights

There you have it—a Robots.txt file that works perfectly!

What can you do with this file, though?

I’ll then demonstrate some typical instructions you can use to restrict access to your website.

How to Use of Robots.txt

Keep in mind that robot interaction with your website is largely governed by the Robots.txt file.

Do you want to restrict search engines’ access to your complete website? Change the permissions in Robots.txt by yourself.

Would you like to prevent Bing from crawling your contact page? The same applies to you.

Although the Robots.txt file won’t help your SEO on its own, you can use it to manage how crawlers behave on your website.

Simply open the file in your FTP editor and put the text there to add or alter it. The changes you make will be immediately visible after you save the file.

You can use the following commands in your Robots.txt file:

1. Block all bots from your site

Do you want to prevent any robots from seeing your website?

Include the following code in your Robots.txt file:

In the actual file, it would appear like this:

Keep All Robots Off Your Website

Simply said, this command instructs all user agents (*) to refrain from accessing any files or folders on your website.

Here is a detailed explanation of everything that is taking place:

User-agent:* – The asterisk (*) is a ‘wild-card’ character that is applicable to any object (such as a file name or, in this example, a bot). Your computer will display every file with the.txt extension if you search for “*.txt” in the search box. The asterisk in this case denotes that your command is applicable to every user-agent.

Disallow: / – The robots.txt command “Disallow” prevents a bot from crawling a folder. You can tell that you are using this command on the root directory by the single forward slash (/).

Nota: If you manage any type of private website, like a membership site, this is great. However, keep in mind that doing so will prevent even respectable bots like Google from indexing your website. Take care when using.

2. Block all bots from accessing a specific folder

What happens if you wish to stop bots from indexing and crawling a particular folder?

Take the /images folder, for instance.

Here is an example of the command to use to prevent bots from accessing the /images folder:

Block All Robot Access To A Particular Folder

If you have a resource folder that you don’t want to be flooded with requests from robot crawlers, this command can be helpful. This might be a folder containing useless scripts, stale photos, etc.

Note that the /images folder is only being used as an example. I’m not advocating that you prevent crawling of that folder by bots. It depends on the goal you’re pursuing.

Be cautious when using this command because search engines often dislike webmasters that prevent their bots from exploring non-image directories. Below, I’ve provided various substitutes for Robots.txt for preventing search engines from indexing particular pages.

3. Block specific bots from your site

What if you want to prevent a certain robot, like Googlebot, from visiting your website?

The command is as follows:

[Name of the robot]

Refuse: /

For instance, use the following to prevent Googlebot from visiting your website:

Exclude a Subset of Robots from Your Site

Every trustworthy bot or user-agent has a unique name. For instance, Google’s spider is referred to as “Googlebot”. Both “msnbot” and “bingbot” are managed by Microsoft. The Yahoo! bot is known as “Yahoo! Slurp”.

Use this page to find the precise names of several user-agents, including Googlebot, Bingbot, and others.

The aforementioned command would prevent a specific bot from accessing your entire site. Only Googlebot is used as an illustration. Most of the time, you wouldn’t want to prevent Google from indexing your website. To keep the bots that benefit you visiting to your site while preventing those that do not, is one specific use case for restricting specific bots.

4. Block a specific file from being crawled

You have precise control over the files and folders you want to restrict access to for robots thanks to the Robots Exclusion Protocol.

The command to use to prevent any robot from crawling a file is as follows:

Userland Agent: *

/[folder_name]/[file_name.extension] is forbidden.

Therefore, you would use the following command to prevent the “img_0001.png” file from being opened from the “images” folder:

Blocking a certain file from being scanned

5. Block access to a folder but allow a file to be indexed

Bots are prevented from accessing a folder or file by the “Disallow” command.

The opposite is true with the command “Allow”.

If the “Disallow” command targets a specific file, the “Allow” command takes precedence.

This implies that you can restrict access to a folder while still allowing user-agents to view a specific file inside of it.

The format is as follows:

Userland Agent: *

Refuse: /[folder_name]/

Permit: [folder name]/[file extension]/

For instance, the following structure would be used to prevent Google from crawling the “images” folder but still allowing it access to the “img_0001.png” file it contains:

It would appear as follows for the aforementioned example:

Blocking a folder’s access but allowing a file to be indexed

This would prevent the indexing of any pages in the /search/ directory.

What if you wanted to prevent all sites with a particular extension from being indexed, such “.php” or “.png”?

Apply this:

Userland Agent: *

Refrain from: /*.extension$

Here, the dollar ($) symbol denotes the end of the URL, meaning that the extension is the final string.

Here is how to utilize the “.js” extension to ban all web pages that contain Javascript:

User Agent Block

If you wish to prevent bots from crawling your scripts, this command is especially effective.

6. Stop bots from crawling your site too frequently

You might have seen this command in the examples above:

Userland Agent: *

20 seconds for crawling

This command instructs all bots to delay every request for a crawl by at least 20 seconds.

On large websites with frequently updated material (like Twitter), the crawl-delay command is widely utilized. This command instructs bots to delay making further requests for a specific period of time.

By doing this, the server is prevented from being overloaded by several simultaneous requests from various bots.

For instance, the Robots.txt file for Twitter specifies that bots must wait at least one second between requests:

Robots.txt file for Twitter

Even for certain bots, the crawl delay is adjustable. This prevents an excessive number of bots from simultaneously crawling your website.

You might, for instance, have the following series of instructions:

Set of Orders

Note: Unless you manage a large website (like Twitter) where thousands of new pages are created every minute, you probably won’t need to use this command.

Common mistakes to avoid when using Robots.txt

A strong tool for managing bot behavior on your site is the Robots.txt file.

If not used properly, it can potentially result in SEO disaster. The fact that there are numerous falsehoods concerning Robots.txt circulating online doesn’t help.

The following errors should never be made when using Robots.txt:

Making 1: Using Robots.txt to prevent content from being indexed

Genuine bots won’t crawl a folder if you “Disallow” it in the Robots.txt file.

This nevertheless indicates two things:

Bots WILL crawl the information in the folder that has been connected from outside sources. For example, bots will follow through and index a file if another website contains a link to it.

Robots.txt directives are typically disregarded by rogue bots, such as spammers, spyware, malware, etc., which index your material anyhow.

Because of this, Robots.txt is a subpar tool for blocking content from being indexed.

Use the’meta noindex’ tag as an alternative instead.

In pages you don’t wish to be indexed, add the next tag:

“robots” meta tag with “noindex” content

The best way to prevent a page from being indexed is with this SEO-friendly technique, albeit it still doesn’t stop spammers.

Note: You may do this without modifying any code if you use a WordPress plugin like Yoast SEO or All in One SEO. For instance, you can apply the noindex tag to individual posts or pages in the Yoast SEO plugin as follows:

Simply access the post/page in question and select the Yoast SEO box’s cog. Then select ‘Meta robots index’ from the dropdown menu.

Additionally, starting on September 1st, Google will no longer support the usage of “noindex” in robots.txt files. More details can be found in this SearchEngineLand article.

Making 2: Using Robots.txt to Protect private content

Blocking the directory using a Robots.txt file will assist, but it is insufficient if you have private content, such as PDFs for an email course.

This is why:

If your content contains links from other websites, it can still be indexed. And malicious bots will continue to crawl it.

Keeping all private stuff behind a login is a preferable approach. This will make sure that neither malicious bots nor actual bots will be able to access your material.

The drawback is that it adds another hurdle for your guests to clear. Your stuff will be safer though.

Mistake 3: Using Robots.txt to stop duplicate content from getting indexed

When it comes to SEO, duplicate material is a big no-no.

Robots.txt is not the answer, though, as it does not prevent this content from being indexed. Once more, there is no assurance that this content won’t be discovered by search engine spiders from outside sources.

The following 3 methods also use duplicate content:

Remove duplicate content; doing so will completely remove the content. Although not ideal, this indicates that you are directing search engines to 404 sites. Consequently, deletion is not advised.

Use a 301 redirect – A 301 redirect notifies users and search engines that a page has moved. To direct visitors to your original material, just add a 301 redirect to any duplicate content.

301 redirects are’meta’ versions of the rel=”canonical” tag, which you should add. The “rel=canonical” element informs Google of the URL of the original version of a given page. As in the following code:

A link with the target “http://example.com/original-page.html” rel=”canonical” is displayed.

identifies original-page.html as the “original” version of the duplicate page, letting Google know. Yoast SEO or All in One SEO make it simple to add this tag if you use WordPress.

Use the rel=”canonical” tag if you want users to be able to access the duplicate material. Use a 301 redirect if you don’t want site visitors or search engine robots to view the information.

Both should be implemented with caution because they will affect your SEO.

Over to you

The Robots.txt file can help you control how web crawlers and other bots interact with your site. They can improve your rankings and make your site easier to crawl when applied properly.

Use this manual to learn how Robots.txt functions, how to install it, and some typical applications for it. And stay clear of the errors we covered above.