SEARCH ENGINE OPTIMIZATION - INDIA SEOs  
www.indiaseos.com
email
index
Home About us Services SEO Plans Contact us Articles FAQ Link
 
index

CATEGORIES

index

SEO RESOURCES

Extractor Tools

index
bg

SEO India >> Articles>> Search Engine Optimization and Information >> Dangerous User-Agents and Robot.txt Files

Dangerous User-Agents and Robot.txt Files

WWW Robots are also called as Spiders or Wanderers or Crawlers are programs that navigate many pages in the World Wide Web by recursively retrieving linked pages.

In 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Robots traversed parts of WWW servers that weren’t suitable. These indicated the need for establishment of WWW servers to indicate to robots which parts of their server should not be accessed.

Robots.txt files (often speciously called as robot.txt, in singular) are created by webmaster to mark files and directories of a web site that search engine spiders should not access. Robots.txt checker is a “validate” that analyzes the syntax of a robots.txt file to see if this format is valid as established by robot exclusion standard or if it contains errors.

Simple Usage:

How to check out the robot.txt file format? Very simple just insert full URL of the robots.txt file you want to analyze and hit enter

Powerful Checker:

The checker finds out mistyped words, logical errors, syntax errors and it gives you useful optimization tips.

Precise:

Here both robot exclusion standard rules and spider-specific extensions are considered.

Generally a robot is a program that robotically traverses the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

A word “agent” has got lots of meanings in computing these days explicitly.

Intelligent Agents:

These are programs that help user in choosing a product, guiding a user through form filling and helps user to find things. They have got something to do with networking side.

Autonomous Agents:

They are programs which travel between sites, decide itself what to do and when to move. Travels only between special servers.

User–Agents:

This is a technical name for a program that perform networking task for a user, such as Web User-Agents like Microsoft Internet Explorer, and email User-Agent like Qualcomm Eudora.

Usage of robots can be varied for number of purposes say robots can be used for indexing purpose, Link validation, HTML validation, Mirroring etc... The robot information’s are stored into individual files, with several HTML tables providing different views of the data:

View their names - Alkaline, ABCdatos BotLink, Arale, Calif, Checkbox, EbiNess etc….
View their type details - Using tables say, Alkaline: its purpose is indexing and its availability is binary and its platform is UNIX or can be windows 95/NT. These details will be in table format.
View their contact details - These details includes Agent: alkalineBOT, host:*,Email:dblock@vestris.com. All these details will be encrypted in table.

An overview of text files is useful for browsers without table support and the combined raw data in machine readable format is available at text file.

Robots itself decide where to visit. Indexing services also allow you to submit URLs manually, which then are queued and visited by the robot. Other sources for URLs can also be used such that scanners through USENET postings etc… given those starting points robots can select URLs to index and visit, and to parse and use as a source for new URLs. Indexing robot knows about a document, it decides to parse it, and insert into database. Some of them index HTML titles, parse the entire HTML or index all the words, etc…

Wanna need to register your page with robot! Contact Indiaseos.com

Well it depends on the service; many services have a link to a URL submission form on their search page, or have more and more information in their help pages.

If you have been visited by robot then simple just checks your server logs for sites that retrieve many documents, particularly in a short time. If your server supports User-agent logging then checks for retrievals with unusual User-agent header values. Finally, if you noticed that a site repeatedly checking for the file ‘/robots.txt’ chances are there is a robot too.

A robot is traversing my whole site too fast! This is said to be “rapid-fire”, first check if it is a problem by checking the load of your server, and monitoring your server’s error log, and concurrent connections if you can. If you have many problems say your desktop PC or you run low performance slowdowns etc. These problems manifest themselves in refused connections, performance slowdowns, or in extreme cases a system crash. If this happens, then few things are need to be done say most importantly, start logging information: when you noticed, what do your logs say, etc. This will surely help you later. Next find out where robot came from, what IP addresses and so on… if you can find out then you can email the respective person and ask them what’s up. If not try their own site for contact numbers, or mail at their domain.

Robots are not there on list! Mail to indiaseos.com! If we can’t help you, at least we can make a note of it for future.

We provide methods used to eliminate robots from a server are to create a file on the server which specifies an access policy for robots. File must be handy thro’ HTTP on the local URL “/robots.txt”. This approach is very simple and implemented on any existing WWW server, and a robot can find the access policy with only single document retrieval.

The selection of the URL was goaded by numerous criteria:

• The filename should fit in file naming limitations of all common operating    systems and should point out the intention of file.
• Filename should be easily remembered.
• There should not be an extra server configuration for filename extension.
• The possibility of a conflict with existing files should be nominal.

/robots.txt files on your server: (use to prevent the robot scanning my sites)

User-agent: *

(value of this field is the name of the robot the record is telling access policy for and if the value is ‘*’, the record tells the default access policy for any robot that has not matched any of other records)

Disallow: /

(this tells you the partial URL that has not been visited, may be partial path Or full path…empty value tells you that URL can be retrieved, at least there should be one disallow field in the record)

Watch out following examples:

User-agent: BadBot
Disallow: /

(here BadBot robot does not allow to see anything.
The slash is shortcut for “all directories”)

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/


(all robots can visit every directory except the three mentioned)

User-agent: BadBot
Disallow: /

blank line is new user agent command)

User-agent: *
Disallow: /private/

(all robots can see everything except private Folder)

User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html

(this one keeps weirdBot from visiting the listing page in the tmp directory,the private directory, link directory)

User-agent: *
Disallow: /tmp/
Disallow: /private/

(Except tmp and private directories, robots can view everything)
(inefficient! Yes it’s right)

This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.

Have a look at Invalid entries

User-agent: *
Disallow /

(note this is a wrong entry because there missing colon after disallow)

User-agent: *
Disallow: *

(want to disallow everything, use a slash-indicate the root directory)

User-agent: sidewiner
Disallow: /tmp/

(misspelled User Agent names please have look at server logs)

User-agent: *
Disallow: /tmp/
User-agent: Weirdbot
Disallow: /links/listing.html
Disallow: /tmp/

(robots can not read from top to bottom, so weirdbot stop at the first record, *, instead of seeing its special entry)

 

bg
button
SEO

SEO FOR YOU

index
SEO client speaks