Over the past couple days, I thought of a new way to discover what online search engines are indexing pages that I have excluded them from visiting using standard meta/HTML tags and/or a robots.txt file. These methods basically instruct search engine code to not list the specified pages on a search engine for others to see.
Well, several search engines, even some of the most popular, do not seem to follow the rules as one might think. Quite some time ago, I wrote HoneyPot scripts to detect if search engine bots were visiting the pages I designated no-index. But this caused many bots to be blocked, even some of the top search engines in the world, because they were visiting the excluded pages. And just because the engine's bot visited the page does not mean the page was indexed or saved.
Which comes to my newer idea. I made specific excluded pages containing unique short alphanumeric (numbers and letters) codes. If a search engine does not index the pages as requested, and a search is placed for that unique code, no results should be displayed. However, if results do appear. This means the search engine did not abide by the exclusion and should be watched closely.
I'm sure this is not a new tactic for catching bots indexing pages excluded by various means. But it should prove a interesting test to see how it pans out. I will post results after some time has passed, letting various different search engines crawl the pages, and index them if they might. This system could be made more autonomous by writing a script to check the major search engines periodically for results of the unique code making the process partially autonomous.