how to filter HTML?

CmdGabriel

Hi,
to secure text input against SQL intrusions in my programs I stole a little function like

function str_clean($textstring)
{
	//1. filter blank
	$textstring=str_replace(" ","_",$textstring);
	//2. filter sql separator
	$textstring=str_replace("'","´",$textstring);
	//3. Magicquotes by realescape (mysql_real_escape_string) 
	if( get_magic_quotes_gpc() )
	{
		  $textstring = stripslashes( $textstring );
	}
	//check if this function exists
	if( function_exists( "mysql_real_escape_string" ) )
	{
		  $textstring = mysql_real_escape_string( $textstring );
	}
	//for PHP version < 4.3.0 use addslashes
	else
	{
		  $textstring= addslashes( $textstring );
	}
	return $textstring;
}

But the question is: how can I write a function to allow only special HTML tags? (<b>, <br>)?
Is there anthing ready-to-use out there?

regards
Gabriel

MarkR

Your approach seems somewhat incorrect, as it changes some characters into other ones.

Generally, the database should store every character verbatim - therefore you should correctly escape the strings as you put them into the db, not anywhere else or anything else.

To strip html tags, you can use strip_tags, however this function is limited in what it can achieve.

I've spent a long time trying to create a function which can safely and correctly (in all cases) remove unwanted HTML elements / attributes from a HTML fragment.

My current implementation now works, but is very complicated- I use DOM to parse the HTML fragment (respecting the encoding, of course), then traverse the DOM making modifications as necessary (removing unwanted elements and attributes; some elements need to have their contents preserved, others need to be removed wholesale).

It's not even obvious what you have to remove to prevent unauthorised scripts:
- script elements, obviously!
- some attributes, not-so-obviously (Think onclick, onload etc)
- some anchor and other types of links need to be vetted for having legitimate targets (no javascript: etc)
- Some other attributes that would otherwise be allowed (style, specifically) need to be checked for executable content. Some browsers allow you to put script in CSS (Notable MSIE and Mozilla, but through different vendor-specific syntaxes).

My recommendation is that you have whitelists of acceptable elements and attributes and remove anythiing else. strip_tags does this for elements but doesn't touch attributes, so Javascript can still easily be inserted (unless you strip ALL tags).

There isn't a magic bullet, this is genuinely very difficult.

Mark