Confused Between "crawler", "robot" and "spider" ?
- Robot: Any program that goes out onto the web to do things. This includes search engine crawlers, but also many other programs, like email scrapers, site testers, and so on.
- Crawler: This is the term for the special kind of robot that search engines use.
- Spider: this is a term that many professional SEO's use - it's synonymous with "crawler", but apparently isn't as non-threatening and marketing -friendly sounding as "crawler".
robots.txt Maker/Generator
Generate effective robots.txt files that help ensure Google and other search engines are crawling and indexing your site properly.
Why do I want a robots.txt?
- It saves your bandwidth
- It gives you a very basic level of protection
- It cleans up your logs
- It can prevent spam and penalties associated with duplicate content.
- It's good programming policy.
Some robots.txt Examples
This example tells all robots that they can visit all files because the wildcard * specifies all robots:
User-agent: *
Disallow:
This example tells all robots to stay out of a website:
User-agent: *
Disallow: /
This example tells all robots not to enter three directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
This example tells all robots to stay away from one specific file:
User-agent: *
Disallow: /directory/file.html
This example tells To block files of a specific file type (for example, .gif):
User-agent: *
Disallow: /*.gif$
This example tells To block access to all URLs that include a question mark (?) (for example, http://scanftree.in/search?find=ankit kumar singh):
User-agent: *
Disallow: /*?
This example tells all robots to stay out of a website and only Google robot access all except private folder:
User-agent: googlebot
Disallow: /private/
User-agent: *
Disallow: /
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt
Sitemap: http://scanftree.in/sitemap1.xml
Sitemap: http://scanftree.in/sitemap2.xml
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:
User-agent: googlebot
Crawl-delay: 10
User-agent: *
Crawl-delay: 20