Friday, February 5, 2016

What is robots.txt

Robots Exclusion Standard

Hasil gambar untuk what is robots.txt

When primitive robots were first created, some of them would crash servers by requesting too many pages too quickly. A robots exclusion standard was crafted to allow you to tell any robot (or all of them) that you do not want some of your pages indexed or that you do not want your links followed. You can do this via a meta tag on the page copy

or create a robots.txt file that gets placed in the root of your website. The goal of either of these methods is to tell the robots where NOT to go. The official robots exclusion protocol document is located at the following URL.

You do not need to use a robots.txt file. By default, search engines will index your site. If you do create a robots.txt file, it goes in the root level of your domain using robots.txt as the file name.

This allows all robots to index everything:

User-agent: *


This disallows all robots to your site:

User-agent: *

Disallow: /

You also can disallow a folder or a single file in the robots txt file. This disallows a folder:

User-agent: *

Disallow: /projects/

This disallows a file:

User-agent: *

Disallow: /cheese/please.html

If you make a robots.txt user-agent command for a specific search engine (e.g. User-agent:Googlebot) the associated search engine will ignore the more general rules located in the section for all search engines (User-agent: *).

One problem many dynamic sites have is sending search engines multiple URLs with nearly identical content. If you have products in different sizes and colors, or other small differences, it is likely that you could generate lots of near-duplicate content, which will prevent search engines from fully indexing your sites.

If you place your variables at the start of your URLs, then you can easily block all of the sorting options using only a few disallow lines. For example, the following would block search engines from indexing any URLs that start with ‘cart.php?size’ or ‘cart.php?color’.

User-agent: *

Disallow: /cart.php?size
Disallow: /cart.php?color

Notice how there is no trailing slash at the end of the above disallow lines. That means the engines will not index anything that starts with that in the URL. If there were a trailing slash, search engines would only block a specific folder.

If the sort options were at the end of the URL, you would either need to create an exceptionally long robots.txt file or place the robots noindex meta tags inside the sort pages. You also can specify any specific user agent, such as Googlebot, instead of using the asterisk wild card. Many bad bots will ignore your robots txt files and/or harvest the blocked information, so you do not want to use robots.txt to block individuals from finding confidential information.

Googlebot also supports wildcards in the robots.txt. The following would stop Googlebot from reading any URL that includes the string ‘sort=’ no matter where that string occurs in the URL:

User-agent: Googlebot

Disallow: /*sort=

In 2006 Yahoo! also added robots.txt wildcard support. Their example pages is useful for helping you understand how to structure your robots.txt file

You have to be careful when changing your robots.txt file, because the following code

Disallow: /*page

also blocks a file like this from being indexed in Google beauty-pageants.php

Google’s Webmaster Toolset shows you what pages they have tried crawling that you have already blocked via robots.txt, and they have a robots.txt testing tool which will show you if a specific URL would end up getting blocked by your robots.txt file.

In 2007 Google released an unavailable_after meta tag, which tells Google to not crawl a URL after a specific date. I do not recommend using this. Instead, if one of your old URLs ranks where you would like a new one to rank I recommend 301 redirecting it to the URL you want to rank.


Facebook  Google+ Instagram Linkedin

Featured Post

Common Keyword Problems

PageRank Checker