Robots Exclusion Standard
When primitive
robots were first created, some of them would crash servers by requesting too
many pages too quickly. A robots exclusion standard was crafted to allow you to
tell any robot (or all of them) that you do not want some of your pages indexed
or that you do not want your links followed. You can do this via a meta tag on
the page copy
or create a
robots.txt file that gets placed in the root of your website. The goal of either
of these methods is to tell the robots where NOT to go. The official robots
exclusion protocol document is located at the following URL.
http://www.robotstxt.org/wc/exclusion.html
You do not
need to use a robots.txt file. By default, search engines will index your site.
If you do create a robots.txt file, it goes in the root level of your domain
using robots.txt as the file name.
This allows all robots to index
everything:
User-agent: *
Disallow:
This disallows all robots to your
site:
User-agent: *
Disallow: /
You
also can disallow a folder or a single file in the robots txt file. This
disallows a folder:
User-agent: *
Disallow: /projects/
This disallows a file:
User-agent: *
Disallow: /cheese/please.html
If you make a
robots.txt user-agent command for a specific search engine (e.g.
User-agent:Googlebot) the associated search engine will ignore the more general
rules located in the section for all search engines (User-agent: *).
One problem
many dynamic sites have is sending search engines multiple URLs with nearly
identical content. If you have products in different sizes and colors, or other
small differences, it is likely that you could generate lots of near-duplicate
content, which will prevent search engines from fully indexing your sites.
If you place
your variables at the start of your URLs, then you can easily block all of the
sorting options using only a few disallow lines. For example, the following
would block search engines from indexing any URLs that start with ‘cart.php?size’
or ‘cart.php?color’.
User-agent: *
Disallow: /cart.php?size
Disallow: /cart.php?color
Notice how
there is no trailing slash at the end of the above disallow lines. That means
the engines will not index anything that starts with that in the URL. If there
were a trailing slash, search engines would only block a specific folder.
If the
sort options were at the end of the URL, you would either need to create an
exceptionally long robots.txt file or place the robots noindex meta tags inside
the sort pages. You also can specify any specific user agent, such as
Googlebot, instead of using the asterisk wild card. Many bad bots will ignore
your robots txt files and/or harvest the blocked information, so you do not
want to use robots.txt to block individuals from finding confidential
information.
Googlebot also
supports wildcards in the robots.txt. The following would stop Googlebot from
reading any URL that includes the string ‘sort=’ no matter where that string
occurs in the URL:
User-agent: Googlebot
Disallow: /*sort=
In
2006 Yahoo! also added robots.txt wildcard support. Their example pages is
useful for helping you understand how to structure your robots.txt file http://www.ysearchblog.com/archives/000372.html
You
have to be careful when changing your robots.txt file, because the following
code
Disallow: /*page
also
blocks a file like this from being indexed in Google beauty-pageants.php
Google’s
Webmaster Toolset shows you what pages they have tried crawling that you have
already blocked via robots.txt, and they have a robots.txt testing tool which
will show you if a specific URL would end up getting blocked by your robots.txt
file.
In 2007 Google
released an unavailable_after meta tag, which tells Google to not crawl a URL
after a specific date. I do not recommend using this. Instead, if one of your
old URLs ranks where you would like a new one to rank I recommend 301
redirecting it to the URL you want to rank.
0 comments:
Post a Comment