|
Why_Robotstxt
| Why Robots.txt?
I am sure that a lot of you have heard of the file named
robots.txt (also called a "robot exclusion file") before. But
what does this file really pertain to? Basically you can think
of a robots.txt file as a list of rules that search engines
follow when they spider your site. A robots.txt file gives you
the Webmaster a say in what does and does not get indexed when
spiders come to your little corner of the web. Okay I
can hear a few people asking why anyone would want to keep some
things from being indexed. I thought the goal was to get
indexed, right? Well yes and no, there are quite a few instances
when blocking spider access to certain areas or pages is almost
a must. Here are several examples of what a person might want to
restrict access to: temporary files or directories,
presentations, information with a specific sequential order,
testing directories or cgi-bin. As you can see just from these
few examples there are definitely files that you would most
certainly want to keep from being indexed. While there is a Meta
tag (<meta name="Robots" content="attributes">) available
that does in essence the same thing as a robots.txt file it is
not currently 100% supported by search engines. Another drawback
is that the tag needs to go on every page you do not want
indexed, as opposed to one central point of
control.Writing 101 All right I have given you a
few vague examples as to what might be included in such a file,
essentially there is never going to be a set list of things that
should and should not be indexed, a robots.txt file needs to be
tailored to your site and your content. There is however a very
specific format that needs to be followed when creating a
robots.txt file.Step 1: First a robots.txt file
needs to be created in Unix format, or Unix line ender mode. The
reason for this is to ensure that there are no carriage returns
inserted into your file. I would suggest looking at Notepad++, my personal favorite text editor
due to the amount of languages and formatting it supports.
Notepad++ is able to create a document directly in Unix format
by selecting the "Convert to Unix Format" from the "Format"
option. Other plain text editors should be able to achieve the
same results however stay away from editors like WordPad or
Microsoft Word when creating your robots.txt file. Also I do not
recommend using HTML editors for this task.Step
2: Now lets begin adding some content to our file. A
robots.txt file is made up of two fields. The first line is the
User-agent line. This line specifies the spider/robot that we
are intending to limit or allow. An example of this would
be:User-agent: googlebot In addition to
allowing or restricting specific spiders you can use a wildcard
and target all spiders coming to your site. To do this you
simply need to place an asterisk (*) in for your User-agent.
Example:User-agent: *Step 3: Now we
will begin to disallow our desired content; either a file or a
whole directory can be kept from being index with a robots.txt
file. We will do this with the second line of our file the
Disallow: directive line. Here is an example:Disallow:
/cgi-bin/ Or for a file:Disallow:
/temp/temp.html Moreover you are not limited to just
one Disallow per User-agent and in fact you can get pretty
granular as to what you give spiders access to. Just make sure
that you give each Disallow its own line. If you leave the
Disallow field empty (i.e. Disallow: ) you are giving permission
for all files and directories to be indexed.One word
of caution when writing your robots exclusion file; if you are
not careful you can shut one or all spider's access to your site
off completely. This would be done by prohibiting access at the
root level by using a slash (/). Example:Disallow:
/If you were to use the asterisk wildcard to specify
your User-agent with the above example you would block all
search engines from every part of your site.Step
4: That is all there is to creating a robots.txt file. The
final step is to upload it to the root directory of your site:
www.yoursite.com/. Make sure that you upload it as ASCII just
like all other text files and you are done.Step
5: Writing a robots.txt file is pretty straightforward after
you get comfortable with the files configuration. Once your file
is complete and uploaded it is good practice to have it
validated; you can do this through www.searchengineworld.com.Notes:
Aside from search engine specific information you are also
able to comment your robots.txt file. This is achieved by using
the pound sign (#). Though you can place a comment after the
Disallow field it is not recommended. Instead make sure that you
begin your comments on a new line starting with the pound sign.
Example:# Just making a commentUser-agent:
googlebot Disallow: /cgi-bin/If you are hesitant
about the different steps involved in creating a robots.txt file
there are applications available that will help you through the
creation process. One application that does this is RoboGen from
Rietta
Solutions. RoboGen provides you with an Explorer like view
that lets you browse the files and directories that you want to
restrict access to and creates the robot exclusion file as you
go.In ClosingAs with all things there are going to
be some drawbacks you will need to contend with. With the
robots.txt file it is the road map effect that it causes;
for those with the desire to attempt to see what you do not want
made publicly available the file provides them with a prime
place to begin looking. Since all robot exclusion files are
named the same and are always in the same place probing people
will know where to find it.Still the pros out weight
the cons. And by having a robots.txt file present on your site
you keep important or private information from ending up in a
search engine's cache making it publicly available to a mass
audience. This is what the file is there for. If on the other
hand you have something that not only needs to be kept private
but also needs to be protected you should make sure that access
is restricted through much more secure and appropriate means.
Robot exclusion files were designed as a method for Webmasters
to delimit the access robots have to their sites, providing
robots with one central place to look when they begin the task
of indexing. To this end the file serves it purpose extremely
well and when used properly it makes the job of a Webmaster much
easier.
About the author:
Matt Benya is a co-owner of Primate Studios (www.primatestudios.com) an independent
development house focusing on CGI illustration, Web design and
multimedia. With 20+ years of art experience and a degree in
Network administration Matt is well suited to translate your
needs to the Web.
|
|
| |
| |