I had a screaming client calling me asking about how there pages were indexed by google when the site itself was not live. Sure enough google bot had sneaked into our staging server pages and indexed all this stuff. We did find a way to remove the indexes and here is what we did to avoid similar mishaps in the future
To prevent Google or any other search engines to access the entire site or any particular folder of a site, we need to specify that in a file called robots.txt in the top-level directory of the website.
Here are some examples of preventing bots from indexing our pages using robots.txt file:
Prevent Robots from Indexing the Entire Site:
Prevent Robots from Indexing everything except the specified pages/directories:
Disallow: /Private Folder/
We can also add this robots.txt page in individual Folder in order to prevent this particular Folder from Indexing. This robots.txt should contain the following text:
We can also do .httaccess authentication for preventing site to be indexed by the Search Engines. For which we can have a .htaccess file under our root directory and write the following:
AuthName “FORBIDDEN AREA”
AuthUserfile “/var/www/html/yourProjectFolder/credentials.httpd” #This file stores the user credentials
Hope this helps others avoiding such kind of situation.