|
Anatomy_of_a_search_engine_crawler
| Anatomy of a search engine crawler
Anatomy of a search engine crawler
When you go to a search engine and perform a search many people
don't understand how those results end up there. Some people may
think that sites are submitted while others know that a piece of
software finds the pages. This article explains one piece of
that puzzle: The search engine crawler.
Todays search engines rely on software packages called spiders
or robots. These automated tools are used to search the web to
discover new pages.
A brief history of search crawlers
The first crawler was the World Wide Web Wander and it appeared
in 1993. It was developed by MIT and it's initial purpose was to
measure the growth of the web. Soon after, however, an index was
generated from the results effectively the first "search
engine."
Since then, crawlers have evolved and developed. Initially
crawlers were simple creatures, only able to index specific bits
of web page data such as meta tags. Soon, however, search
engines realized that a truly effective crawler needs to be able
to index other information, including visible text, alt tags,
images and even other non-HTML content such as PDF's word
processor documents and more.
How a crawler works
Generally, the crawler gets a list of URL's to visit and store.
The crawler doesn't rank the pages, it only goes out and gets
copies which it stores, or forwards to the search engine to
later index and rank according to various aspects.
Search crawlers also are smart enough to follow links they find
on pages. They may follow these links as they find them, or they
will store them and visit them later.
To date there are literally dozens of crawlers out regularly
indexing the web. Some are specialized crawlers such as
image indexers, while others are more general and therefore more
well known.
Some of the most well known crawlers include Googlebot (from
Google) MSNBot (from MSN) and Slurp (from Yahoo!). There is also
the Teoma crawler (from Ask Jeeves), as well as an assortment of
crawlers from other engines, such as shopping engines, blog
search engines and more.
Generally, when a crawler comes to visit a site, they request a
file called "robots.txt." this file tells the search crawler
which files it can request, and which files or directories it's
not allowed to visit.
The file can also be used to limit specific spiders access to
any or all of the site, and can also be used to control how many
times the crawler visits the site, by limiting it's speed or the
times when the crawler can visit. (Yahoo!s Slurp and MSNBot both
support the "Crawl Delay" directive which tells the crawlers to
slow down on their crawling).
It's not imperative that a site have a robots.txt file however
as a crawler will assume it is OK to index the site if there
isn't such a file.
Generally, today's crawlers are stripped down versions of web
browsers. Some, like Googlebot, are built upon a text based web
browser called Lynx. Therefore one of the tools one can use to
verify a site is the Lynx browser. by loading the site in the
browser you can see essentially what the crawlers "sees." You
can then look for errors in the pages as well as any navigation
problems the crawler may come up against.
One other thing you may notice, as you view your web server log
reports, is that some browsers come many different times and
with many different configurations.
Yahoo!s Slurp, for example emulates many different hardware
platforms from Windows 98 to Windows XP, and many
different browsers, from Internet Explorer to Mozilla. MSNbot
also works like this emulating different operating
systems and browsers.
They do this to ensure compatibility after all, the
search engines want to be sure that the majority of their users
find a site which they can use. Therefore, as a design tip, you
should test your site against various hardware platforms and
browsers as well. You don't have to use the variety that the
search engines use, but you should test against Internet
Explorer, Netscape and Firefox. Also, you should try your site
on other platforms such as a Mac or Linux just to ensure
compatibility.
You may also notice, upon reviewing your reports, that crawlers
like Googlebot will visit repeatedly and request the same
page(s) repeatedly. This is common as crawlers also want to be
sure the site is stable and also to measure the page's change
frequency.
If your site goes down temporarily when a crawler visits
repeatedly like this, don't worry. The crawlers are smart enough
to leave and come back later and try again. If, however, the
continue to find the site down, or slow to respond, they may opt
to stay away for longer periods, or index the site more slowly.
This can negatively impact your site's performance in the search
engines.
As time goes on, we'd expect these spiders to become even more
advanced. As new authoring technology comes available, or new
indexing options become available, then the search crawlers will
be adapted. Remember, the goal of all the search engines is to
have the most complete index of files found on the web. This
means they want to be able to index more than just web pages.
So as you are designing your site, be sure to keep the crawlers
in mind. Don't build your site for crawlers build it for
users but be sure to test it thoroughly so that the
crawlers see what you want them to without hindrances or
roadblocks. Remember the crawler is a site owners best
friend.
About the author:
About the author: Rob Sullivan - SEO Specialist and Internet
Marketing Consultant. Any reproduction of this article needs to
have an html link pointing to http://www.textlinkbrokers.com
|
|
| |
| |