India Seos >> Articles
>> Search
Engine Optimization and Information >> Dangerous User-Agents
and Robot.txt Files
Dangerous User-Agents and Robot.txt Files
WWW Robots are also called as Spiders or Wanderers
or Crawlers are programs that navigate many pages in the World
Wide Web by recursively retrieving linked pages.
In 1994 there have been occasions where robots have visited WWW
servers where they weren’t welcome for various reasons.
Robots traversed parts of WWW servers that weren’t suitable.
These indicated the need for establishment of WWW servers to indicate
to robots which parts of their server should not be accessed.
Robots.txt files (often speciously called as
robot.txt, in singular) are created by webmaster to mark files
and directories of a web site that search engine spiders should
not access. Robots.txt checker is a “validate” that
analyzes the syntax of a robots.txt file to see if this format
is valid as established by robot exclusion standard or if it contains
errors.
Simple Usage – How to check out the robot.txt
file format? Very simple just insert full URL of the robots.txt
file you want to analyze and hit enter
Powerful Checker – The
checker finds out mistyped words, logical errors, syntax errors
and it gives you useful optimization tips.
Precise - Here both robot exclusion
standard rules and spider-specific extensions are considered.
Generally a robot is a program that robotically traverses the
web’s hypertext structure by retrieving a document, and
recursively retrieving all documents that are referenced.
A word “agent” has got lots of meanings
in computing these days explicitly,
Intelligent Agents –
These are programs that help user in choosing a product, guiding
a user through form filling and helps user to find things. They
have got something to do with networking side.
Autonomous Agents – They
are programs which travel between sites, decide itself what to
do and when to move. Travels only between special servers.
User–Agents – This
is a technical name for a program that perform networking task
for a user, such as Web User-Agents like Microsoft Internet Explorer,
and email User-Agent like Qualcomm Eudora.
Usage of robots can be varied for number of
purposes say robots can be used for indexing purpose, Link validation,
HTML validation, Mirroring etc... The robot information’s
are stored into individual files, with several HTML tables providing
different views of the data:
• View their names - Alkaline, ABCdatos BotLink, Arale,
Calif, Checkbox, EbiNess etc….
• View their type details - using tables say, Alkaline:
its purpose is indexing and its availability is binary and its
platform is UNIX or can be windows 95/NT. These details will be
in table format.
• View their contact details – these details includes
Agent: alkalineBOT, host:*,Email:dblock@vestris.com. All these
details will be encrypted in table.
An overview of text files is useful for browsers without table
support and the combined raw data in machine readable format is
available at text file.
Robots itself decide where to visit. Indexing services also allow
you to submit URLs manually, which then are queued and visited
by the robot. Other sources for URLs can also be used such that
scanners through USENET postings etc… given those starting
points robots can select URLs to index and visit, and to parse
and use as a source for new URLs. Indexing robot knows about a
document, it decides to parse it, and insert into database. Some
of them index HTML titles, parse the entire HTML or index all
the words, etc…
Wanna need to register your page with robot! Contact Indiaseos.com
Well it depends on the service; many services have a link to a
URL submission form on their search page, or have more and more
information in their help pages.
If you have been visited by robot then simple just checks your
server logs for sites that retrieve many documents, particularly
in a short time. If your server supports User-agent logging then
checks for retrievals with unusual User-agent header values. Finally,
if you noticed that a site repeatedly checking for the file ‘/robots.txt’
chances are there is a robot too.
A robot is traversing my whole site too fast! This is said to
be “rapid-fire”, first check if it is a problem by
checking the load of your server, and monitoring your server’s
error log, and concurrent connections if you can. If you have
many problems say your desktop PC or you run low performance slowdowns
etc. these problems manifest themselves in refused connections,
performance slowdowns, or in extreme cases a system crash. If
this happens, then few things are need to be done say most importantly,
start logging information: when you noticed, what do your logs
say, etc. this will surely help you later. Next find out where
robot came from, what IP addresses and so on… if you can
find out then you can email the respective person and ask them
what’s up. If not try their own site for contact numbers,
or mail at their domain.
Robots are not there on list! Mail to indiaseos.com! If we can’t
help you, at least we can make a note of it for future.
We provide methods used to eliminate robots from a server are
to create a file on the server which specifies an access policy
for robots. File must be handy thro’ HTTP on the local URL
“/robots.txt”. This approach is very simple and implemented
on any existing WWW server, and a robot can find the access policy
with only single document retrieval.
The selection of the URL was goaded by numerous criteria:
• The filename should fit in file naming limitations of
all common operating systems and should point out the intention
of file.
• Filename should be easily remembered.
• There should not be an extra server configuration for
filename extension.
• The possibility of a conflict with existing files should
be nominal.
/robots.txt files on your server: (use to prevent
the robot scanning my sites)
| User-agent: * |
(value of this field is the name of the robot
the record is telling access policy for and if the value is
‘*’, the record tells the default access policy
for any robot that has not matched any of other records) |
| Disallow: / |
(this tells you the partial URL that has not been visited,
may be partial path Or full path…empty value tells you
that URL can be retrieved, at least there should be one disallow
field in the record) |
Watch out following examples
User-agent: BadBot
Disallow: / |
(here BadBot robot does not allow to see anything.
The slash is shortcut for “all directories”) |
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/ |
(all robots can visit every directory except the three mentioned) |
User-agent: BadBot
Disallow: / |
blank line is new user agent command) |
User-agent: *
Disallow: /private/
|
(all robots can see everything except private Folder) |
User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html |
(this one keeps weirdBot from visiting the listing page
in the tmp directory,the private directory, link directory) |
User-agent: *
Disallow: /tmp/
Disallow: /private/ |
(Except tmp and private directories, robots can view everything)
(inefficient! Yes it’s right) |
This keeps the WeirdBot from visiting the listing
page in the links directory, the tmp directory and the private
directory.
Have a look at Invalid entries
User-agent: *
Disallow / |
(note this is a wrong entry because there missing
colon after disallow) |
User-agent: *
Disallow: * |
(want to disallow everything, use a slash-indicate the root
directory) |
User-agent: sidewiner
Disallow: /tmp/ |
(misspelled User Agent names please have look at server
logs) |
User-agent: *
Disallow: /tmp/
User-agent: Weirdbot
Disallow: /links/listing.html
Disallow: /tmp/ |
(robots can not read from top to bottom, so weirdbot stop
at the first record, *, instead of seeing its special entry) |