We've got Semrush blocked with a 503 right at the top of config.php ... a dumb place, in theory, but we don't even consider allowing them access any longer, so it's about the first thing that happens. I can't remember how we decided, but that's what it is.
We have a "bad_bots.php" that's included a little further down in most scripts/pages. Among the highlights:
$banned_uas = array(
"compatible; synapse",
"seokicks",
"ahrefs",
"linkdex",
"hubspot",
'phantomjs'
);
403 for them. I seem to remember mentioning PhantomJS earlier ... it may also be blocked in Apache config/.htaccess, so possibly it (and Hubspot) aren't needed in this array. That might also be true for this next one:
if (stristr($_SERVER['HTTP_USER_AGENT'],'Go 1.1 package http')) {
header("HTTP/1.0 204 No Content");
die();
}
if (stristr($_SERVER['HTTP_USER_AGENT'], "Chrome/60.0.3112.113")) {
header("HTTP/1.0 503 Service unavailable");
die();
}
The second one's interesting, maybe. Lots of bogus simultaneous requests from distributed IP's with this UA string. Given that Chrome forces updates in most environments, and that at the time that policy was written very few stats labs reported anyone on Chrome 60, we put that in place.
Finally:
// we have three blocks like this, wrapped in tests of sys_getloadavg().
// this list is for the lightest loads; if the load avg. is higher these numbers
// are *lower* ($cos == "chance of success")
$cos_bing =
$cos_slurp =
$cos_sougou =
$cos_yandex = 99;
$cos_dotbot = 19;
$cos_unknown = 8;
$cos_MJ12 = 70;
$limits_array = array(
'bingbot' => 'bing',
'slurp' => 'slurp',
'sogou' => 'sougou',
'sougou' => 'sougou',
'yandex' => 'yandex',
'alexabot' => 'bing',
'MJ12' => 'MJ12',
'dotbot' => 'dotbot',
'mail.ru_bot' => 'dotbot',
'netseer' => 'unknown',
'xovi' => 'unknown',
'easou' => 'sogou',
'crawl' => 'sogou',
'spider' => 'sogou',
'iOpus' => 'dotbot',
'seznambot' => 'sogou'
);
foreach ($limits_array as $ua=>$chance) {
$cos_var = ${"cos_".$chance};
if (stristr($_SERVER['HTTP_USER_AGENT'], $ua)) {
$brandom = rand(0, 100);
if ($brandom > $cos_var) {
header("HTTP/1.0 503 Service unavailable");
die('Server too busy. Please try again later.');
}
}
}
Dunno that the PHB would really like us ever serving a 503 to Bing/Yahoo, but I don't recall asking; if the load's high enough, there's a chance they'll see one (and a one percent chance that they will even if load is light ... that might bear re-consideration). If Google sees any 5XX it's most likely an actual server problem (and they do seem to think they see some ... I think it's bad chars in the auto-generated page URI's we feed them. (I need to add that to the bug DB, actually ....)