PositionCare.com major engines


 HOME :: LINKS :: SITE MAP ::
Google
  SE tutorial
 » Introduction
 » The <TITLE> tag
 » Meta tags
 » Improve your ranking
 » Web design
 » Search engines vs. spam
 » Link popularity
 » Well indexed
 » Frames
 » Full text indexing
 » Query relevance
 » Avoiding the index
  did you know...
robotstxt is a file that tells search engines what to and what not to index. In theory it's a good way of banning search engines from pages you don't want them to find, but in practice, because they have to be in the public folder of your site where anyone can access them, using them to hide pages from spiders is very insecure.

In addition to the robots.txt file -- which allows you to concisely specify instructions for a large number of files on your web site -- you can use the robots META tag for fine-grain control over individual pages on your site. To implement this, simply add specific META tags to HTML pages to control how each individual page is indexed.
  avoiding the index - robots.txt - META robots tag


What if certain parts of your Web site contain confidential information? Perhaps your site is still under construction, and you want to get comments from a few people, but you aren't ready to tell the whole world that the pages are live.

You may prefer that some or all of your pages not be indexed. You can prevent spiders, also known as robots, from accessing parts of your site by using a robots.txt file, which tells spiders that your Web site, or specific parts of your site, are off-limits.

 

A D V E R T I S E M E N T

WARNING: The robots.txt file is not a shield against unauthorized entry. Please do not post material to the Internet that absolutely should not be seen by an unauthorized person.

This standard is not backed by an official organization or covered by law. Keep in mind that while most major search engines respects this standard, the possibility exists that someone out there might simply choose to not follow the standard and access the file anyway.

The robots.txt file

The "Robot Exclusion Standard," specifies a protocol for site administrators to direct the actions of so-called "robots" that crawl the Web and index Web sites. You do this with a small text file that you name "robots.txt" - a file that contains the instructions you post for visiting robots. You can exclude a particular crawler or all crawlers (that follow the standard) from your entire site, from particular directories, or from particular files.

This file needs to be placed in the top level of your server's document space, so if your site is hosted at an ISP, you'll need to ask the ISP's webmaster for help with this.

If you decide to use robot exclusion, keep in mind that Web server software often comes with a directory indexing feature. If your server has that feature, and it happens to be in effect, then any crawler that comes to your site could grab everything right out of the index, even if you had set up for robot exclusion. So the first thing you have to do is shut off the directory indexing feature.

To exclude your site from all web crawlers, create a file named robots.txt that states:

User-agent: *
Disallow: /

To exclude just one crawler (e.g. Altavista's "Scooter") your file should read:

User-agent: scooter
Disallow: /

To limit the exclusion to a particular directory or file, put that address after Disallow: For instance,

User-agent: *
Disallow: /images/personal/
Disallow: /cgi-bin/

In this example, two directory paths from the root server are excluded. You need a separate Disallow line for every path you want to exclude, and you may not have empty lines in the text file, as they are used to delimit multiple records. The "*" in the User-agent field is a special value meaning "any robot"; it cannot be used anywhere else in the record.

To allow a single robot complete access and exclude all others:

User-agent: Lycos
Disallow:
User-agent: *
Disallow: /


You can also use METAtags, without your ISP's cooperation, to exclude crawlers from particular pages. But not all indexing robots observe these METAtags.

  • NOINDEX prevents anything on the page from being indexed.
  • NOFOLLOW prevents the crawler from following the links on the page and indexing the linked pages.
  • NOIMAGEINDEX prevents the images on the page from being indexed, but the text on the page can still be indexed.

Excluding search crawlers from specific files can give you a way to assert some control over the visitor's experience at your site. For instance, if you wanted to hold a trivia contest, you could put robot exclusion on the pages with the answers, so people wouldn't be able to find those pages randomly -- they'd only find the pages with the questions.

One last point: not all robots adhere to the Robot Exclusion Standard, so if you have material you really want to keep away from all search engines, you should consider arranging for some kind of password protection.

Additional Information
To learn more about this subject, we recommend the Web document titled Robot Exclusion Standard Revisited.


The Robots META Tag

The Robots meta tag enables HTML authors and content providers to indicate to visiting robots if a document may be indexed or used to traverse additional links. It differs from the robots.txt file in that you don't require access to a Web server's root directory.

Meta Tag Placement
Like any meta tag, the robots meta tag should be placed in the HEAD section of an HTML document:

<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This webpage...">
<title>Welcome to my Page</title>
</head>
<body>
.
.
.
</body>
</html>

Meta Tag Structure
The content of the Robots meta tag includes directives separated by commas.The currently defined directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if a robot should index the page and the FOLLOW directive specifies if a robot should follow links on the page. The defaults are INDEX and FOLLOW. Examples:

<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

NOTE: The name of the tag and the content are not case sensitive.


HOMEHELP
Contact Us    Terms of Service    Privacy Policy
© 1999-2007 PositionCare.com. All Rights Reserved.
 Hosted by: PowWeb.com - About World Internet Group