I'm trying to write a small PHP based search engine, but I've gotten stuck... I need some way to strip all HTML tags, PHP code, etc., leaving just the plain text. So far, I've the following...
$temp = file_get_contents($file);
$temp = preg_replace("/(\n|\r\n)/", " ", $temp);
$temp = preg_replace("/<\?.*?(?<!\\\\)\?>/", "", $temp);
$temp = preg_replace("/<.*?>/", "", $temp);
echo "$temp<br>";
(the second search string is supposed to be "/<\?.*?(?<!\\)\?>/", but I can't get it to post correctly on the forum)
(not the original code, but gives a basic idea of my problem)
The first replace removes all endlines and inserts a space as a separator. The second replace removes all PHP code (the part of the string that begins with <? and ends with the next unescaped ?>. The problem is that the second replace stops at the third replace, because it sees the ?>, and interprets it as the end of the PHP script. Even though this isn't a widely used string (outside of the obvious use, terminating a PHP script...), I'd like to figure out a better way of finding and removing a PHP script from a file.
I'll probably wake up tomorrow with the answer (Murphy's law applied to forums: The faster you post, the sooner the answer hits you and you feel like a retard for not seeing it sooner.) I'm sure somebody has some perfect code to do this... please help...