Confused Between "crawler", "robot" and "spider" ?


  • Robot: Any program that goes out onto the web to do things. This includes search engine crawlers, but also many other programs, like email scrapers, site testers, and so on.
  • Crawler: This is the term for the special kind of robot that search engines use.
  • Spider: this is a term that many professional SEO's use - it's synonymous with "crawler", but apparently isn't as non-threatening and marketing -friendly sounding as "crawler".



robots.txt Maker/Generator


Generate effective robots.txt files that help ensure Google and other search engines are crawling and indexing your site properly.

Default -  All Robots are:  
Crawl-Delay:
Sitemap: (leave blank for none)
     
Specific Search Robots: Google   googlebot
  MSN Search   msnbot
  Yahoo   yahoo-slurp
  Ask/Teoma   teoma
  Cuil   twiceler
  GigaBlast   gigabot
  Scrub The Web   scrubby
  DMOZ Checker   robozilla
  Nutch   nutch
  Alexa/Wayback   ia_archiver
  Baidu   baiduspider
  Naver   naverbot, yeti
   
Specific Special Bots: Google Image   googlebot-image
  Google Mobile   googlebot-mobile
  Yahoo MM   yahoo-mmcrawler
  MSN PicSearch   psbot
  SingingFish   asterias
  Yahoo Blogs   yahoo-blogs/v3.9
   
Restricted Directories: The path is relative to root and must contain a trailing "/"
 
 
 
 
 
   



Why do I want a robots.txt?


  • It saves your bandwidth
  • It gives you a very basic level of protection
  • It cleans up your logs
  • It can prevent spam and penalties associated with duplicate content.
  • It's good programming policy.



Some robots.txt Examples


This example tells all robots that they can visit all files because the wildcard * specifies all robots:
User-agent: *
Disallow:

This example tells all robots to stay out of a website:
User-agent: *
Disallow: /

This example tells all robots not to enter three directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

This example tells all robots to stay away from one specific file:
User-agent: *
Disallow: /directory/file.html

This example tells To block files of a specific file type (for example, .gif):
User-agent: *
Disallow: /*.gif$

This example tells To block access to all URLs that include a question mark (?) (for example, http://scanftree.in/search?find=ankit kumar singh):
User-agent: *
Disallow: /*?

This example tells all robots to stay out of a website and only Google robot access all except private folder:
User-agent: googlebot       
Disallow: /private/

User-agent: *
Disallow: /

Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt
Sitemap: http://scanftree.in/sitemap1.xml
Sitemap: http://scanftree.in/sitemap2.xml

Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:
User-agent: googlebot       
Crawl-delay: 10

User-agent: *
Crawl-delay: 20