Some
people believe that they should create different pages for different search
engines, each page optimized for one keyword and for one search engine. Now,
while I don't recommend that people create different pages for different
search engines, if you do decide to create such pages, there is one issue
that you need to be aware of.
These
pages, although optimized for different search engines, often turn out to be
pretty similar to each other. The search engines now have the ability to
detect when a site has created such similar looking pages and are penalizing
or even banning such sites. In order to prevent your site from being
penalized for spamming, you need to prevent the search engine spiders from
indexing pages which are not meant for it, i.e. you need to prevent
AltaVista
from indexing pages meant for
Google
and vice-versa. The best way to do that is to use a robots.txt file.
You
should create a robots.txt file using a text editor like Windows Notepad.
Don't use your word processor to create such a file.
Here
is the basic syntax of the robots.txt file:
User-Agent:
[Spider Name]
Disallow: [File Name]
For
instance, to tell AltaVista's
spider, Scooter, not to spider the file named myfile1.html residing in the
root directory of the server, you would write
User-Agent:
Scooter
Disallow: /myfile1.html
To
tell Google's spider,
called Googlebot, not to spider the files myfile2.html and myfile3.html, you
would write
User-Agent:
Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You
can, of course, put multiple User-Agent statements in the same robots.txt
file. Hence, to tell AltaVista
not to spider the file named myfile1.html, and to tell
Google
not to spider the files myfile2.html and myfile3.html, you would write
User-Agent:
Scooter
Disallow: /myfile1.html
User-Agent:
Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you
want to prevent all robots from spidering the file named myfile4.html, you
can use the * wildcard character in the User-Agent line, i.e. you would
write
User-Agent:
*
Disallow: /myfile4.html
However,
you cannot use the wildcard character in the Disallow line.
Once
you have created the robots.txt file, you should upload it to the root
directory of your domain. Uploading it to any sub-directory won't work - the
robots.txt file needs to be in the root directory.
I
won't discuss the syntax and structure of the robots.txt file any further -
you can get the complete specifications from here.
Now we
come to how the robots.txt file can be used to prevent your site from being
penalized for spamming in case you are creating different pages for
different search engines. What you need to do is to prevent each search
engine from spidering pages which are not meant for it.
For
simplicity, let's assume that you are targeting only two keywords:
"tourism in Australia" and "travel to Australia". Also,
let's assume that you are targeting only three of the
major
search engines: AltaVista,
HotBot and
Google.
Now,
suppose you have followed the following convention for naming the files:
Each page is named by separating the individual words of the keyword for
which the page is being optimized by hyphens. To this is added the first two
letters of the name of the search engine for which the page is being
optimized.
Hence,
the files for AltaVista
are
tourism-in-australia-al.html
travel-to-australia-al.html
The
files for HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The
files for Google are
tourism-in-australia-go.html
travel-to-australia-go.html
As I
noted earlier, AltaVista's
spider is called Scooter and
Google's
spider is called Googlebot.
A list
of spiders for the major search engines can be found here.
Now,
we know that HotBot
uses Inktomi and from
this list, we find that Inktomi's spider is called Slurp.
Using
this knowledge, here's what the robots.txt file should contain:
User-Agent:
Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent:
Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent:
Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When
you put the above lines in the robots.txt file, you instruct each search
engine not to spider the files meant for the other search engines.
When
you have finished creating the robots.txt file, double-check to ensure that
you have not made any errors anywhere in it. A small error can have
disastrous consequences - a search engine may spider files which are not
meant for it, in which case it can penalize your site for spamming, or, it
may not spider any files at all, in which case you won't get top rankings in
that search engine.
An
useful tool to check the syntax of your robots.txt file can be found here.
While it will help you correct syntactical errors in the robots.txt file, it
won't help you correct any logical errors, for which you will still need to
go through the robots.txt thoroughly, as mentioned above.
This article may
be re-published as long as the following resource box is included at the end
of the article and as long as you link to the email address and the URL
mentioned in the resource box:
Article
by Sumantra Roy. Sumantra is one of the most respected and recognized search
engine positioning specialists on the Internet. For more articles on search
engine placement, subscribe to his 1st Search Ranking Newsletter by sending
a blank email to
mailto:1stSearchRanking.999.99@optinpro.com
or by going to
http://www.1stSearchRanking.net