I realized very long ago that concocting a regex to check for a valid url is a pain in the ass. I was delighted by the advent of [man]filter_var[/man]. However, I've been looking into the output of this function lately and find that its output is kinda confusing. Some example code:
$urls = array(
"",
"Buy It Now",
"localhost/foo/bar",
"blarg",
"blarg/",
"blarg/some/path/file.ext",
"http://google.com",
"http://google.com/",
"http://google.com/some/path.ext",
"http://google.com/some/path.ext?foo=bar",
"example.com",
"example.com/",
"example.com/some/path/file.ext",
"example.com/some/path/file.ext?foo=bar",
"example.com:1234",
"example.com:1234/",
"example.com:1234/some/path/file.ext",
"example.com:1234/some/path/file.ext?foo=bar",
"//foobar.com",
"//foobar.com/",
"//foobar.com/path/file.txt",
"//cdn.example.com/js_file.js"
);
function check_url($url) {
echo "checking $url\n";
return filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED);
}
foreach ($urls as $url) {
echo "url: $url\n";
echo check_url($url) ? "PASS" : "FAIL";
echo "\n\n";
}
NOTE: I do provide a couple of flags and FILTER_FLAG_SCHEME_REQUIRED is NOT one of them. That said, here's the output when run using PHP 7.0.15:
url:
checking
FAIL
url: Buy It Now
checking Buy It Now
FAIL
url: localhost/foo/bar
checking localhost/foo/bar
FAIL
url: blarg
checking blarg
FAIL
url: blarg/
checking blarg/
FAIL
url: blarg/some/path/file.ext
checking blarg/some/path/file.ext
FAIL
url: http://google.com
checking http://google.com
FAIL
url: http://google.com/
checking http://google.com/
PASS
url: http://google.com/some/path.ext
checking http://google.com/some/path.ext
PASS
url: http://google.com/some/path.ext?foo=bar
checking http://google.com/some/path.ext?foo=bar
PASS
url: example.com
checking example.com
FAIL
url: example.com/
checking example.com/
FAIL
url: example.com/some/path/file.ext
checking example.com/some/path/file.ext
FAIL
url: example.com/some/path/file.ext?foo=bar
checking example.com/some/path/file.ext?foo=bar
FAIL
url: example.com:1234
checking example.com:1234
FAIL
url: example.com:1234/
checking example.com:1234/
FAIL
url: example.com:1234/some/path/file.ext
checking example.com:1234/some/path/file.ext
FAIL
url: example.com:1234/some/path/file.ext?foo=bar
checking example.com:1234/some/path/file.ext?foo=bar
FAIL
url: //foobar.com
checking //foobar.com
FAIL
url: //foobar.com/
checking //foobar.com/
FAIL
url: //foobar.com/path/file.txt
checking //foobar.com/path/file.txt
FAIL
url: //cdn.example.com/js_file.js
checking //cdn.example.com/js_file.js
FAIL
Note that anything without a scheme FAILs. I also tried changing my filter_var line to this:
return filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED ^ FILTER_FLAG_SCHEME_REQUIRED);
This makes no difference whatsoever in the output.
If I alter my check_url function to prepend a default scheme of http:// if none is specified, the results change, but there are still problems. Here's the new function:
function check_url($url) {
// check for scheme first, if it's missing then add it
if (preg_match('#^http(s*)://#', $url)) {
$checkme = $url;
} else {
$checkme = "http://" . $url;
}
echo "checking $checkme\n";
return filter_var($checkme, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED);
}
The output:
url:
checking http://
FAIL
url: Buy It Now
checking http://Buy It Now
FAIL
url: localhost/foo/bar
checking http://localhost/foo/bar
PASS
url: blarg
checking http://blarg
FAIL
url: blarg/
checking http://blarg/
PASS
url: blarg/some/path/file.ext
checking http://blarg/some/path/file.ext
PASS
url: http://google.com
checking http://google.com
FAIL
url: http://google.com/
checking http://google.com/
PASS
url: http://google.com/some/path.ext
checking http://google.com/some/path.ext
PASS
url: http://google.com/some/path.ext?foo=bar
checking http://google.com/some/path.ext?foo=bar
PASS
url: example.com
checking http://example.com
FAIL
url: example.com/
checking http://example.com/
PASS
url: example.com/some/path/file.ext
checking http://example.com/some/path/file.ext
PASS
url: example.com/some/path/file.ext?foo=bar
checking http://example.com/some/path/file.ext?foo=bar
PASS
url: example.com:1234
checking http://example.com:1234
FAIL
url: example.com:1234/
checking http://example.com:1234/
PASS
url: example.com:1234/some/path/file.ext
checking http://example.com:1234/some/path/file.ext
PASS
url: example.com:1234/some/path/file.ext?foo=bar
checking http://example.com:1234/some/path/file.ext?foo=bar
PASS
url: //foobar.com
checking http:////foobar.com
FAIL
url: //foobar.com/
checking http:////foobar.com/
FAIL
url: //foobar.com/path/file.txt
checking http:////foobar.com/path/file.txt
FAIL
url: //cdn.example.com/js_file.js
checking http:////cdn.example.com/js_file.js
FAIL
Some weird successes:
url: blarg/
checking http://blarg/
PASS
url: blarg/some/path/file.ext
checking http://blarg/some/path/file.ext
PASS
Also, I'm not sure what to do about these failures:
url: //foobar.com
checking http:////foobar.com
FAIL
url: //foobar.com/
checking http:////foobar.com/
FAIL
url: //foobar.com/path/file.txt
checking http:////foobar.com/path/file.txt
FAIL
url: //cdn.example.com/js_file.js
checking http:////cdn.example.com/js_file.js
FAIL
I'm starting to think filter_var is not as legit as I want it to be. Does anyone else have suspicions of this function? Should I report any of this to the PHP devs? Or perhaps there is some detail in the RFC that I'm overlooking?