NO SUPPORT FOR ROBOTS.TXT BY GOOGLE
Google has recently announced to step back from supporting undocumented and unsupported rules in robots.txt as per their official statement. This announcement will come into practice officially from September 1, 2019. Google’s official tweet has said-
“Today we’re saying goodbye to undocumented and unsupported rules in robots.txt
If you were relying on these rules, learn about your options in our blog post.”
Moreover Google’s announcement has clearly stated that in order to maintain a healthy ecosystem and prepare for future releases, they are retiring all unsupported and unpublished rules such as NoIndex from the said date. Google till date used to obey the NoIndex directive. According to Stone Temple in 11 out of 12 cases the results were positive but as the results are not 100%, so it does not always work.
Robots.txt is a text file that we use to instruct the Google bots that which pages should be crawled. It also specifies the pages that should not be indexed while crawling. Many organisations use this to prevent confidential data to be scanned by searched engines or web crawlers. Moreover, it is used to prevent overloading of site with requests also. It does not mean that the pages are hidden from Google but they cannot be indexed by Google Bots.
WAYS TO CONTROL CRAWLING
NoIndex in Robot Meta tags – The NoIndex is supported in both HTTP response headers and HTML. The NoIndex tag is quite effective in masking the URLs and prevent them from getting scanned by Google bots.
HTTP Standard Response Codes – The response codes of 404 and 410 will remove the URLs from the Google’s index if they are crawled or scanned by search engine crawler.
Protecting through password – If we create a login and password to enter a page, it will not be indexed by Google.
Usage of Disallow — As the search engines can only index the visible pages, blocking that page means it will not be indexed. However, the search engine can also index URLs according to the links from other pages and without reading the content, we aim to reduce the visibility of the page.