Quote:
Atomicorp Candidate #21
rergarding #21:
might it be useful to write a wrapper that DNS lookups for the characteristic googlebots?
something that has the logic in it to verify real bots by the user-agent and only whitelist it if it matches.
http://www.google.com/support/webmaster ... swer=80553http://www.google.com/support/webmaster ... er=1061943Yeah, I'm still trying to figure out a way to do this that doesnt kill the users experience with the server. DNS lookups on source IPs *before* serving up content will make the server appear to be really slow to the user (the server isnt actually slow, its just waiting for the lookup to complete).
So we dont want that, any thoughts on how you would like to see candidate #21 work? Heres my current working idea:
1) If the UA claims its googlebot then, dont serve up content until we finish the steps below:
2) lookup the IP, if it says its google, then do a reverse and make sure the FQDN matches the IP as well.
If it really is a google boot:
auto-whitelist the IP (dont shun it)
If it is NOT really a google bot:
block and shun the host
Optional: If it is a search engine, dont block anything it does either (let google/bing/etc. do whatever it wants to the system)
That may work and if you slow down google, thats probably not so bad and we could cache the IP for a period of time and only look it back up on an interval so this doesnt happen every time the IP connections. I dont think the option step (and dont block it either) should be the default, not shunning a search engine is a good idea, not blocking it may not be a good idea.
Let me explain, and please lend me your thoughts:
We have rules that try to trick search engines when they are being used to do bad things. For example , that is if a bad guy is carrying out a "google hack" to try and find vulnerable applications we return a 404 for those searches, so the bad guy doesnt think you are vulnerable (and moves on) - and by badguy that might be a fully automated worm so you definitely want them to get a 404 and leave.
If you have shunning setup, ASL will also shun the search engine. So, maybe we make that class of rules by default to not-shun, but still "block". The same is true for DLP rules, we dont want google caching sensitive data from your server, but we dont want to shun it either.
Are there any rules you think a search engine should never be blocked on? (not shunned, we can handle that differently) My gut says no to the security rules, although maybe to the spam rules.
Thoughts appreciated.