I want to partial match. What's best to do it with-PHP or MySQL?

schwim

Hi there everyone,

I'm writing a bot blocking script and I'm to the point where I'm writing the admin interface where they can enter data that the script will check for partial matches on the HTTP info, and if it finds a match, it stops page load.

For instance, if the HTTP_REFERER has "viagra" somewhere in it, or the REMOTE_ADDR contains "208.109.390.", then it will stop the loading process.

I know I could do this with LIKE %viagra%, but is it the best way to do it? I know REGEX is the darling of many, but I don't know if it's better to use for something like this.

Any thoughts and suggestions would be greatly appreciated.

thanks,
json

MarkR

A bot-blocking script? You clearly have to use PHP code to block a bot - mysql has no way of knowing if a bot is accessing it.

Remember that you can only block polite, nice bots that you probably don't care about that much anyway. The really unpleasant ones will get through anyway.

Consider using an alternative approach. I don't think blocking bad bots by IP address , referrer or anything else is likely to work.

Mark

schwim

Hi there Mark,

My http ref logs with over 1,000 fake referers in 24 hours with links to viagra sites would beg to differ 🙂

This portion of the script simply combats referer spam. It's a great way to block it, as they're trying to get link seniority for search matches. To do that, they have to use the terms in the links.

My question remains. I am asking in regards to the best way to match my phrase in the db with the bot's data.

Their HTTP_REFERER is: http:// www . onlinepharm . com/ viagra-tramadol-online-at-the-best-prices.html

one of my entries to block by is: viagra

I want the script to rotate through the list in the entries to block by(viagra, tramadol, cialis, pharmacy) and determine if we have a match.

What is the best way for me to do this?

thanks,
json

MarkR

Even if you block a request by sending back (e.g.) a 403 response, it will still be logged in your access logs.

Your best bet is to configure your log analysis software to ignore such bogus referrers.

Alternatively, it IS possible to configure Apache to selectively log requests, but it's a bad idea to do so because it's so easy to get it wrong and have nothing logged.

They're not going to get any benefit from referrer spamming you unless you publish your referrer logs, which presumably you won't, or at least you'll publish them somewhere not indexed and/or restricted, and with spam filtered out.

Mark

schwim

Hi there Mark,

Wow, I hope you'll stop trying to talk me out of this and help me 🙂

1) The server loads a lot of images, and data for these bots. Instead loading a small text explanation then appending their IP to the .htaccess file will save incredible amounts of bandwidth.

2) the bots aren't checking to see if I'm sharing referers publicly. I've never done this, and in spite of this, my referer logs are full. They're not taking the time to check for this before spamming. That would take too much time, and they're not into spending a lot of time.

3) adding the IP to the .htaccess file once a match is determined will stop them from showing up in my logs, as I'm not using the apache logs, but an integrated system through PHP.

Can anyone please tell me the best way via php or mysql to determine a match anywhere in an entry, like in the examples above?

thanks,
json

MarkR

I recommend you do it in PHP, with a file which is require()'d as soon as possible in your page before most of the rest of the code is loaded.

Have an array (or something) of regular expressions that you want to block. If you see such a pattern in the referrer, send back a content-type text/plain 403 error.

Presumably the problem stems from that fact that:
- The bots are actively spidering your site
- Your server is doing work to provide HTML for them
- Sending back no meaningful response to the bots will stop them being able to find additional pages, thus will block further requests.

You must realise that a bot which just tries to hit your site regardless of the response, will still be logged regardless of what you do in PHP.

You could possibly set some Apache variable from PHP which could tell it not to log it- using custom logging - but I'm not sure.

Bots can get a list of URLs on your site which are public anyway:
- They could run their own spider with a "clean" referrer which determines URLs of pages to hit with the dirty one
- They could consult Google to find valid pages to referrer-spam

I've done bot-blocking before now (by user agent) - the trick is:
Block the bot as early as possible - before you connect to any databases or open too many files.
Send back a meaningful but terse response - ensure the status is something like 400 or 403 - send back a text/plain human readable response with a brief description of the error (but don't be TOO specific)
* Log the error if you can - but concisely - so you can monitor the blocking.

Mark

MarkR

One other tip is to do as FEW regular expression matches as possible. You can for instance, make regular expressions like:

/viagra|cialis|pharmacy/i

Which is very efficent compared to doing the matches individually.

Mark

schwim

Hi Mark,

Thanks very much for that:

Are you saying that I can insert all of the rows (viagra,cialis,etc) into a single regex expression, saving resources?

So if I'm matching the referer, I would do it like

if(preg_match("/viagra|cialis|pharmacy/i")){
    echo "Matched";
  }

This is my first dealings with regex in php, so I might have messed that up, but from the PHP site, it looks like that's how I would implement it.

thanks,
json

MarkR

You'd need to put the referrer in as a parameter to preg_match. Really.

If you can do it with single preg_match, that will be very efficient.

You should require() the file before you require() anything else which does anything resource intensive (e.g. connect to db). So your pages might look like:

require('botblocker.php'); // Block bots
require('common.php'); // Setup session, database etc.
// ... rest of page

botblocker.php should call exit if it detects that the request is a bad bot (after setting the status to 400 and printing an appropriate message, of course)

Calling exit means that the session, database setup code etc won't ever be run, which means those requests are very cheap.

Mark

schwim

Hi Mark,

I've got the page set as the first thing to be run, and currently, it's working fine, in that it doesn't do anything other than run this script if the visitor gets flagged. I still have to add the 400, as I've not done that part yet.

if(preg_match("$HTTP_REFERER","/viagra|cialis|pharmacy/i")){
    echo "Matched";
  }

Would that do it?

thanks,
json

schwim

I've got it checking properly(I think):

/* SECOND CHECK START - Referer String matches */

$query = "SELECT data FROM zap_blacklist_referer";
$result = mysql_query( $query);
while( $row=mysql_fetch_assoc($result) )
{
	$referer_data[] = $row['data'];
}

mysql_free_result( $result );

$referer_array = implode("|", $referer_data);

if (preg_match("/".$referer_array."/i","$HTTP_REFERER",$matched)){
	sleep(12);
	echo("<html>
	<head>
	<title>Dude, you sank my battleship!</title>
	</head>
	<body>
	<center><h1>Maybe you should stop visiting those types of sites...</h1></center>
	<p><center>The page load was halted because of your referer string.<br>
	This is the phrase that resulted in the match: ".$matched[0]."</center></p>
	<p><center>Have we screwed up?  Simply <a href='mailto:mail@domain.com'>contact us</a>.<br>
	We will fix it as quickly as possible, and we are very sorry for the inconvenience.</center></p>
	</body>
	</html>");
	exit;
}

/* SECOND CHECK END - Referer String matches */

thanks,
json

MarkR

There are lots of things wrong with this:

Don't read the list of banned referrers from a database - that makes maintenance more complicated and uses more resources. Hard code it (put it in a separate file if you like) so you can do this check cheaply, without connecting to or querying your db.
The referrer is NOT determined in PHP from $HTTP_REFERER (Use $_SERVER['HTTP_REFERER'] instead
Do NOT put sleep(12) in there - this just ties up precious resources on your server keeping the page alive. The spammer bot really doesn't care.
Don't put such a silly message - remember that the message MIGHT be innocently sent to legitimate users (say a viagra-related site creates a genuine line to yours)
Don't put so much information in there - don't tell them what matched. Do give an email address
You don't need to send a HTML response, prefer plain text, it's shorter. Remember to set the content type if you do this (this applies even to unsuccessful, e.g. 403 requests)
You still haven't set the status to 403 - Do that DEFINITELY.
Rather than looping through many regexps, simply make a single regexp which matches them all (as I previously suggested)

Doing these things will produce not only a more polite, more technically correct response, but also vastly reduce the resource usage to do so.

Remember that during your sleep(12), the page is probably holding on to a mysql connection. Ideally it should not need one at all.

Mark

MarkR

One more thing: remember that these checks are likely to run on EVERY hit by a legitimate user, so they need to create as little performance impact as possible.

Doing the checks with a hard-coded regexp (ideally just one, rather than a loop) will reduce resource usage to a minimum.

Mark

schwim

Hi there Mark,

thank you very much for your reply.

1) it's called from the db because it's a script that will be released and the point of the script is to centrally control administration of this and make it easy to control. If I ask them to hard code it in, I might as well ask them to write the script themselves.

2) My referer is determined by $HTTP_REFERER, since I define it at the top of the page by $HTTP_REFERER = $_SERVER['HTTP_REFERER'];

3) I understand. This was a holdover from a previous author's script, and I thought maybe that they did this to prevent flooding.

4) I will be very sure to have a less silly message here.

5) I will drop the match

6) Would setting it to plain text really save that much in the way of resources?

7) I most definitely will set it to forbidden.

8) I thought I was using only one regex. I create an array and the array contains every blocked phrase in the db. Where is the loop? Do you mean the loop that grabs all of the phrases from the db?

) Addendum: I guess I could create a flat file that got written to for the regex check, if you think it would make that large of a difference.

Thanks very much for taking the time to let me know how I can fix the problems and make the script better.

thanks,
json

Piranha

Just an idea regarding database vs. flatfile. If I understand it correctly you need the script to do 2 things, be possible to administrate and take little resources.

Well, it is easier to administrate it if you store the strings in the database. And it takes less resources to use flat file. Why not use both?

Use a database to administrate it, and generate a flatfile from the database. You could then either have a button to generate the flatfile, or do it every minute/hour/month/midnight every friday that is the first in the month (or whatever is good). Then you will get the benefit both ways to use it.

The only problem I can see is how to handle it when the file is being generated, but I'm sure it is possible to solve without any big problem.

halojoy

MarkR wrote:
You could possibly set some Apache variable from PHP which could tell it not to log it
- using custom logging - but I'm not sure.

I know it may not solve this issue of blocking bots,
but I like to share how I use an Apache Env variable in my custom logging.

I do not want to log my my php working into same logfile as all other public requests.
Because file would quickly get very big and make it difficult to search, if I need.
So I use separate logfiles. It works great.

Here is from my Apache httpd.conf

<IfModule log_config_module>

LogFormat "%{%y%m%d_%H:%M.%S}t=%a %{Referer}i -> %U" referer
LogFormat "%{%y%m%d_%H:%M.%S}t=%a %>s_%r %B %{Referer}i" myformat

[b]SetEnvIf Remote_Addr 127.0.0.1 LOCAL
SetEnvIf Remote_Addr 192.168.0. LOCAL

CustomLog logs/referer.log referer env=!LOCAL
CustomLog logs/access.log myformat env=!LOCAL
CustomLog logs/local.log myformat env=LOCAL[/b]

</IfModule>

More on this using this SetEnvIf apache module:

Apache Module mod_setenvif

Description: Allows the setting of environment variables based on characteristics of the request

======================

SetEnvIf Directive
Description: Sets environment variables based on attributes of the request
Syntax: SetEnvIf attribute regex [!]env-variable[=value] [[!]env-variable[=value]] ...
Context: server config, virtual host, directory, .htaccess
Override: FileInfo
Status: Base

Module: mod_setenvif

The SetEnvIf directive defines environment variables based on attributes of the request.

The attribute specified in the first argument can be one of three things:

An HTTP request header field (see RFC2616 for more information about these); for example: Host, User-Agent, Referer, and Accept-Language.
A regular expression may be used to specify a set of request headers.

One of the following aspects of the request:
Remote_Host - the hostname (if available) of the client making the request
Remote_Addr - the IP address of the client making the request
Server_Addr - the IP address of the server on which the request was received (only with versions later than 2.0.43)
Request_Method - the name of the method being used (GET, POST, et cetera)
Request_Protocol - the name and version of the protocol with which the request was made (e.g., "HTTP/0.9", "HTTP/1.1", etc.)
Request_URI - the resource requested on the HTTP request line -- generally the portion of the URL following the scheme and host portion without the query string

The name of an environment variable in the list of those associated with the request. This allows SetEnvIf directives to test against the result of prior matches. Only those environment variables defined by earlier SetEnvIf[NoCase] directives are available for testing in this manner. 'Earlier' means that they were defined at a broader scope (such as server-wide) or previously in the current directive's scope. Environment variables will be considered only if there was no match among request characteristics and a regular expression was not used for the attribute.

halojoy

hi again

There is a php function [man]apache_getenv[/man]
that might be able check such a SetEnvIf variable
as I mention in my prev post.

There is also a corresponding function to set an Env variable: [man]apache_setenv[/man]

apache_getenv

apache_getenv — Get an Apache subprocess_env variable
Description
string apache_getenv ( string $variable [, bool $walk_to_top] )

Get an Apache environment variable as specified by variable.

This function requires Apache 2 otherwise it's undefined.

apache_setenv

apache_setenv — Set an Apache subprocess_env variable
Description
bool apache_setenv ( string $variable, string $value [, bool $walk_to_top] )

apache_setenv() sets the value of the Apache environment variable specified by variable.
Note: When setting an Apache environment variable, the corresponding $_SERVER variable is not changed. 
Parameters

variable
The environment variable that's being set. 
value
The new variable value. [/QUOTE]

MarkR

schwim wrote:
Hi there Mark,
6) Would setting it to plain text really save that much in the way of resources?

Probably not, but it would make the script cleaner.

7) I most definitely will set it to forbidden.

Good.

8) I thought I was using only one regex. I create an array and the array contains every blocked phrase in the db. Where is the loop? Do you mean the loop that grabs all of the phrases from the db?

So I see, you are.

Remember that a regex can contain other special characters other than |. You could conceivably build a regex in that fashion, but you might want to do other things inside the regex.

I really recommend that you don't build that regex at all at runtime, just create the regex by hand (and test it - have a test harness).

) Addendum: I guess I could create a flat file that got written to for the regex check, if you think it would make that large of a difference.

If it's connecting to the db vs not connecting, not connecting is definitely better.

I don't quite understand the central administration issue, as surely you maintain the .php scripts, and you would be maintaining the contents of that table, so it's no different.

I imagine this stuff will change infrequently so it can be managed by your normal software release process (depending on how stringent your release process is).

Mark

schwim

Hi there guys, and thanks to everyone for your suggestions and help.

Well, I've got quite an enigma wrapped in a conundrum if I want to get rid of the db connection during monitoring because I already connect to pull the tracking data.

The server data(referer, page request, remote addr) gets placed into the database. then with the same connection, I get the block data.

so if I get rid of the connection, I also need to change the way that the script tracks visitors. This means I pretty much need to alter the whole front end of the script 🙁 I'm not savvy enough with flat files to do everything I need it to do, but I'll look into it.

thanks again for all your help.

thanks,
json

MarkR

You can still connect to the database, provided you do so AFTER you've decided that the request is acceptable and you're not going to block it as spam.

As you're connecting to the database in a function in a require()'d file, you can simply require this one after the check has been made.

The script which does the check may need a little maintenance from time to time (e.g. to change the pattern(s)), but that can (I hope) be done by your normal software release process.

Mark