Wednesday, August 17, 2011

About Robots(robots.txt and meta robots tag)

Sometimes it is possible that you don't want some of the web pages to be crawled by bot. In this post I am sharing two methods by which you can avoid your page being indexed in Google database.
1. You can create a robots.txt file.

2. Edit your robots meta tag.

Robots.txt file:-
Whenever a robot(Googlebot) crawls website it searches for robots.txt file, this file provides information to the robots saying not to crawl those URL mentioned in the file. If the robots.txt file is not present it automatically assumes that it can crawl all the pages in the website. Lets see how to create the file-
We have two options to create the robots.txt file
1. Using others application.
2. Manually

In the first method you can use robots.txt generator to generate robots.txt file, Click Here to see a robots.txt generator.

Manually:-
In this method we will learn how to create the robots.txt file manually. In Every robots.txt file contains we must set two parameters-
useragent:<the bot/agent to which you want to disallow eg-googlebot, yahoo-slurp, msnbot or you can add "*"(asterisk) to disallow bots of all search engines>
Disallow:<URL or name of the folder you want to disallow>


examples:-
To disallow a folder to all search engines:
useragent:*
disallow:/images/


To disallow a folder to a particular search engine:

useragent:googlebot
disallow:/images/


To disallow entire site to all search engine:

useragent:*
disallow:/


To disallow a particular file:

useragent:*
disallow:/project/test.html


To disallow a particular image:

useragent:*
disallow:/image.abc.jpg


follow the steps:-
1. Create a blank file named robots.txt and open with notepad.
2. Edit the file using above instruction.
3. Upload the file in the root directory of the website.
4. Click Here to check if your robots.txt file is working, this website will show the pages which you have disallowed/allowed.


Click Here to see the robots.txt file of Google. 
or type:- inurl:robots.txt in Google search box to see the robots.txt file of different websites.


Robots Meta Tag:-
This is the second method by which you can disallow a page from robots(crawlers). In this method we will be adding extra meta tag with name "robots", this method can be useful when you want to disallow one or two pages in your website, for those pages you can directly add meta tag in there head section.
example:-
<head>
<meta name="robots" conten="noindex, nofollow" >
</head>
this tag will disallow all the search engine from crawling this page.

No comments:

Post a Comment