|
Creating_a_Robotstxt_file
| Creating a Robots.txt file
Some people believe that they should create different pages for
different search engines, each page optimized for one keyword
and for one search engine. Now, while I don't recommend that
people create different pages for different search engines, if
you do decide to create such pages, there is one issue that you
need to be aware of.
These pages, although optimized for different search engines,
often turn out to be pretty similar to each other. The search
engines now have the ability to detect when a site has created
such similar looking pages and are penalizing or even banning
such sites. In order to prevent your site from being penalized
for spamming, you need to prevent the search engine spiders from
indexing pages which are not meant for it, i.e. you need to
prevent AltaVista from indexing pages meant for Google and
vice-versa. The best way to do that is to use a robots.txt file.
You should create a robots.txt file using a text editor like
Windows Notepad. Don't use your word processor to create such a
file.
Here is the basic syntax of the robots.txt file:
User-Agent: [Spider Name] Disallow: [File Name]
For instance, to tell AltaVista's spider, Scooter, not to spider
the file named myfile1.html residing in the root directory of
the server, you would write
User-Agent: Scooter Disallow: /myfile1.html
To tell Google's spider, called Googlebot, not to spider the
files myfile2.html and myfile3.html, you would write
User-Agent: Googlebot Disallow: /myfile2.html Disallow:
/myfile3.html
You can, of course, put multiple User-Agent statements in the
same robots.txt file. Hence, to tell AltaVista not to spider the
file named myfile1.html, and to tell Google not to spider the
files myfile2.html and myfile3.html, you would write
User-Agent: Scooter Disallow: /myfile1.html
User-Agent: Googlebot Disallow: /myfile2.html Disallow:
/myfile3.html
If you want to prevent all robots from spidering the file named
myfile4.html, you can use the * wildcard character in the
User-Agent line, i.e. you would write
User-Agent: * Disallow: /myfile4.html
However, you cannot use the wildcard character in the Disallow
line.
Once you have created the robots.txt file, you should upload it
to the root directory of your domain. Uploading it to any
sub-directory won't work - the robots.txt file needs to be in
the root directory.
I won't discuss the syntax and structure of the robots.txt file
any further - you can get the complete specifications from here.
Now we come to how the robots.txt file can be used to prevent
your site from being penalized for spamming in case you are
creating different pages for different search engines. What you
need to do is to prevent each search engine from spidering pages
which are not meant for it.
For simplicity, let's assume that you are targeting only two
keywords: "tourism in Australia" and "travel to Australia".
Also, let's assume that you are targeting only three of the
major search engines: AltaVista, HotBot and Google.
Now, suppose you have followed the following convention for
naming the files: Each page is named by separating the
individual words of the keyword for which the page is being
optimized by hyphens. To this is added the first two letters of
the name of the search engine for which the page is being
optimized.
Hence, the files for AltaVista are
tourism-in-australia-al.html travel-to-australia-al.html
The files for HotBot are
tourism-in-australia-ho.html travel-to-australia-ho.html
The files for Google are
tourism-in-australia-go.html travel-to-australia-go.html
As I noted earlier, AltaVista's spider is called Scooter and
Google's spider is called Googlebot.
A list of spiders for the major search engines can be found here.
Now, we know that HotBot uses Inktomi and from this list, we
find that Inktomi's spider is called Slurp.
Using this knowledge, here's what the robots.txt file should
contain:
User-Agent: Scooter Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html Disallow:
/tourism-in-australia-go.html Disallow:
/travel-to-australia-go.html
User-Agent: Slurp Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html Disallow:
/tourism-in-australia-go.html Disallow:
/travel-to-australia-go.html
User-Agent: Googlebot Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html Disallow:
/tourism-in-australia-ho.html Disallow:
/travel-to-australia-ho.html
When you put the above lines in the robots.txt file, you
instruct each search engine not to spider the files meant for
the other search engines.
When you have finished creating the robots.txt file,
double-check to ensure that you have not made any errors
anywhere in it. A small error can have disastrous consequences -
a search engine may spider files which are not meant for it, in
which case it can penalize your site for spamming, or, it may
not spider any files at all, in which case you won't get top
rankings in that search engine.
An useful tool to check the syntax of your robots.txt file can
be found here. While it will help you correct syntactical errors
in the robots.txt file, it won't help you correct any logical
errors, for which you will still need to go through the
robots.txt thoroughly, as mentioned above.
About the author:
Article by Sumantra Roy. Sumantra is one of the most respected
and recognized search engine positioning specialists on the
Internet. For more articles on search engine placement,
subscribe to his 1st Search Ranking Newsletter by going to:
http://the-easy-way.com/newsletter.html
|
|
| |
| |