Confused Between "crawler", "robot" and "spider" ?
- Robot: Any program that goes out onto the web to do things. This includes search engine crawlers, but also many other programs, like email scrapers, site testers, and so on.
- Crawler: This is the term for the special kind of robot that search engines use.
- Spider: this is a term that many professional SEO's use - it's synonymous with "crawler", but apparently isn't as non-threatening and marketing -friendly sounding as "crawler".
Generate effective robots.txt files that help ensure Google and other search engines are crawling and indexing your site properly.
Why do I want a robots.txt?
- It saves your bandwidth
- It gives you a very basic level of protection
- It cleans up your logs
- It can prevent spam and penalties associated with duplicate content.
- It's good programming policy.
Some robots.txt Examples
This example tells all robots that they can visit all files because the wildcard * specifies all robots:User-agent: * Disallow:
This example tells all robots to stay out of a website:User-agent: * Disallow: /
This example tells all robots not to enter three directories:User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
This example tells all robots to stay away from one specific file:User-agent: * Disallow: /directory/file.html
This example tells To block files of a specific file type (for example, .gif):User-agent: * Disallow: /*.gif$
This example tells To block access to all URLs that include a question mark (?) (for example, http://scanftree.in/search?find=ankit kumar singh):User-agent: * Disallow: /*?
This example tells all robots to stay out of a website and only Google robot access all except private folder:User-agent: googlebot Disallow: /private/ User-agent: * Disallow: /
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txtSitemap: http://scanftree.in/sitemap1.xml Sitemap: http://scanftree.in/sitemap2.xml
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:User-agent: googlebot Crawl-delay: 10 User-agent: * Crawl-delay: 20