filter_var results a bit weird for FILTER_VALIDATE_URL

sneakyimp · May 10, 2017

I realized very long ago that concocting a regex to check for a valid url is a pain in the ass. I was delighted by the advent of [man]filter_var[/man]. However, I've been looking into the output of this function lately and find that its output is kinda confusing. Some example code:

$urls = array(
        "",
        "Buy It Now",
        "localhost/foo/bar",
        "blarg",
        "blarg/",
        "blarg/some/path/file.ext",
        "http://google.com",
        "http://google.com/",
        "http://google.com/some/path.ext",
        "http://google.com/some/path.ext?foo=bar",
        "example.com",
        "example.com/",
        "example.com/some/path/file.ext",
        "example.com/some/path/file.ext?foo=bar",
        "example.com:1234",
        "example.com:1234/",
        "example.com:1234/some/path/file.ext",
        "example.com:1234/some/path/file.ext?foo=bar",
        "//foobar.com",
        "//foobar.com/",
        "//foobar.com/path/file.txt",
        "//cdn.example.com/js_file.js"

);

function check_url($url) {
        echo "checking $url\n";
        return filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED);
}

foreach ($urls as $url) {
        echo "url: $url\n";
        echo check_url($url) ? "PASS" : "FAIL";
        echo "\n\n";
}

NOTE: I do provide a couple of flags and FILTER_FLAG_SCHEME_REQUIRED is NOT one of them. That said, here's the output when run using PHP 7.0.15:

url: 
checking 
FAIL

url: Buy It Now
checking Buy It Now
FAIL

url: localhost/foo/bar
checking localhost/foo/bar
FAIL

url: blarg
checking blarg
FAIL

url: blarg/
checking blarg/
FAIL

url: blarg/some/path/file.ext
checking blarg/some/path/file.ext
FAIL

url: http://google.com
checking http://google.com
FAIL

url: http://google.com/
checking http://google.com/
PASS

url: http://google.com/some/path.ext
checking http://google.com/some/path.ext
PASS

url: http://google.com/some/path.ext?foo=bar
checking http://google.com/some/path.ext?foo=bar
PASS

url: example.com
checking example.com
FAIL

url: example.com/
checking example.com/
FAIL

url: example.com/some/path/file.ext
checking example.com/some/path/file.ext
FAIL

url: example.com/some/path/file.ext?foo=bar
checking example.com/some/path/file.ext?foo=bar
FAIL

url: example.com:1234
checking example.com:1234
FAIL

url: example.com:1234/
checking example.com:1234/
FAIL

url: example.com:1234/some/path/file.ext
checking example.com:1234/some/path/file.ext
FAIL

url: example.com:1234/some/path/file.ext?foo=bar
checking example.com:1234/some/path/file.ext?foo=bar
FAIL

url: //foobar.com
checking //foobar.com
FAIL

url: //foobar.com/
checking //foobar.com/
FAIL

url: //foobar.com/path/file.txt
checking //foobar.com/path/file.txt
FAIL

url: //cdn.example.com/js_file.js
checking //cdn.example.com/js_file.js
FAIL

Note that anything without a scheme FAILs. I also tried changing my filter_var line to this:

return filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED ^ FILTER_FLAG_SCHEME_REQUIRED);

This makes no difference whatsoever in the output.

If I alter my check_url function to prepend a default scheme of http:// if none is specified, the results change, but there are still problems. Here's the new function:

function check_url($url) {
        // check for scheme first, if it's missing then add it
        if (preg_match('#^http(s*)://#', $url)) {
                $checkme = $url;
        } else {
                $checkme = "http://" . $url;
        }
        echo "checking $checkme\n";
        return filter_var($checkme, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED);
}

The output:

url: 
checking http://
FAIL

url: Buy It Now
checking http://Buy It Now
FAIL

url: localhost/foo/bar
checking http://localhost/foo/bar
PASS

url: blarg
checking http://blarg
FAIL

url: blarg/
checking http://blarg/
PASS

url: blarg/some/path/file.ext
checking http://blarg/some/path/file.ext
PASS

url: http://google.com
checking http://google.com
FAIL

url: http://google.com/
checking http://google.com/
PASS

url: http://google.com/some/path.ext
checking http://google.com/some/path.ext
PASS

url: http://google.com/some/path.ext?foo=bar
checking http://google.com/some/path.ext?foo=bar
PASS

url: example.com
checking http://example.com
FAIL

url: example.com/
checking http://example.com/
PASS

url: example.com/some/path/file.ext
checking http://example.com/some/path/file.ext
PASS

url: example.com/some/path/file.ext?foo=bar
checking http://example.com/some/path/file.ext?foo=bar
PASS

url: example.com:1234
checking http://example.com:1234
FAIL

url: example.com:1234/
checking http://example.com:1234/
PASS

url: example.com:1234/some/path/file.ext
checking http://example.com:1234/some/path/file.ext
PASS

url: example.com:1234/some/path/file.ext?foo=bar
checking http://example.com:1234/some/path/file.ext?foo=bar
PASS

url: //foobar.com
checking http:////foobar.com
FAIL

url: //foobar.com/
checking http:////foobar.com/
FAIL

url: //foobar.com/path/file.txt
checking http:////foobar.com/path/file.txt
FAIL

url: //cdn.example.com/js_file.js
checking http:////cdn.example.com/js_file.js
FAIL

Some weird successes:

url: blarg/
checking http://blarg/
PASS

url: blarg/some/path/file.ext
checking http://blarg/some/path/file.ext
PASS

Also, I'm not sure what to do about these failures:

url: //foobar.com
checking http:////foobar.com
FAIL

url: //foobar.com/
checking http:////foobar.com/
FAIL

url: //foobar.com/path/file.txt
checking http:////foobar.com/path/file.txt
FAIL

url: //cdn.example.com/js_file.js
checking http:////cdn.example.com/js_file.js
FAIL

I'm starting to think filter_var is not as legit as I want it to be. Does anyone else have suspicions of this function? Should I report any of this to the PHP devs? Or perhaps there is some detail in the RFC that I'm overlooking?

NogDog · May 10, 2017

So I started thinking about validating URLs in a way analagous to the best way to validate email addresses, which is to send an email to it with a verification link. In this case, see if we get a "meaningful" response for the url. The downside would be net lag.

<?php

function validate_url($url, $add_prefix=false, $debug=false)
{
    $url = trim($url);
    if($add_prefix and !preg_match('#^https?://#i', $url)) {
        $url = 'http://' . ltrim($url, '/');
    }
    if($debug) { echo "DEBUG: $url\n"; }
    $ch = curl_init($url);
    curl_setopt_array($ch, array(
        CURLOPT_NOBODY => true,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HEADER => true
    ));
    $result = curl_exec($ch);
    if($debug) { echo "DEBUG: $result\n"; }
    return preg_match('#HTTP/1\S* [23]\d\d #i', $result);
}

$tests = array(
    'http://www.google.com/',
    'https://www.google.com/search?q=php',
    'http://www.xixixixixixixixllllllllll.xyzzz/fubar',
    'http://www.google.com/sisisisisisisisisisisi.php',
    'foobar',
    'localhost',
    'google.com/'
);

foreach(array(null, true) as $prefix_flag) {
    echo "flag is: ".var_export($prefix_flag, 1)."\n";
    foreach ($tests as $url) {
        echo "\nTesting '$url': \n  ";
        echo validate_url($url, $prefix_flag, true) ? 'valid' : 'invalid';
        echo "\n";
    }
}

sneakyimp · May 10, 2017

Thanks for your response, but the point of me using filter_var is to prevent unnecessary connection attempts if the supplied value doesn't look like a valid url -- I'm trying to optimize a process that might have to fetch a million remote urls as quickly as possible. If we try fetching too many invalid remote urls from a machine, we can get banned by fail2ban or something. If the remote_url is not valid at all, we want to skip it entirely.

NogDog · May 10, 2017

Yeah, I was just sort of curious what it would be like to do it that way. I realize that if you just want to check the format, it's not necessarily ideal.

Unfortunately, I suspect you run into the same thing with trying to validate email address formats: the official specifications are complex with many variations, yet real-world browsers and such may allow things that technically are "invalid", and so forth. As a result, you may need to err on the side of leniency in order to avoid false negatives.

dalecosp · May 10, 2017

sneakyimp;11061859 wrote:

Some weird successes:

url: blarg/
checking http://blarg/
PASS

url: blarg/some/path/file.ext
checking http://blarg/some/path/file.ext
PASS

While the public Internet infrastructure does use TLDs to assist user-agents in resolving hostnames, valid hostnames do not require a TLD. And a URL can point to a valid hostname (without a TLD). http://blargh is valid on any machine that can resolve the name "blargh" ... as mine does now, after a quick hack to HOSTS.

What this means for your system remains to be seen; I should think you would want to ensure that the hostname is a FQDN ... you might be able to use a regex or a DB lookup (IETF or some other organization may have an list or DB of TLDs ... ) for that ... hmm ....

Given then that "http://blargh" is valid, no surprise on the 2nd one. The function makes no presumptions about file extensions (and who could blame it for that?)

Also, I'm not sure what to do about these failures:

url: //foobar.com
checking http:////foobar.com
FAIL

url: //foobar.com/
checking http:////foobar.com/
FAIL

url: //foobar.com/path/file.txt
checking http:////foobar.com/path/file.txt
FAIL

url: //cdn.example.com/js_file.js
checking http:////cdn.example.com/js_file.js
FAIL

If you'll remove the extra two "//" characters, this should work. No bugs to be seen here.

sneakyimp · May 10, 2017

NogDog wrote:
Unfortunately, I suspect you run into the same thing with trying to validate email address formats: the official specifications are complex with many variations, yet real-world browsers and such may allow things that technically are "invalid", and so forth. As a result, you may need to err on the side of leniency in order to avoid false negatives.

It is precisely this complexity which makes it such a good idea for the talented PHP devs to solve the issue. The docs say:

Validates value as URL (according to » http://www.faqs.org/rfcs/rfc2396), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail.

Aside from RFC 2396, I've also seen RFC 3986, dated January 2005, which obsoletes the earlier RFC and then 6874 and 7320 which seem even newer. Seems to me the function (or at least the docs) could use updating? 2396 was issued in August 1998 -- almost 20 years ago. 3986 was issued 12 years ago. I have real trouble comprehending these documents in any actionable way and was rather hoping for some insight from this august community of smarties.

I'm also wondering if anyone else can confirm that the FILTER_FLAG_SCHEME_REQUIRED flag does not work for this function. I was under the impression, but perhaps mistaken, that the function should ignore the absence of a scheme by supplying flags but omitting FILTER_FLAG_SCHEME_REQUIRED.

dalecosp;11061879 wrote:
While the public Internet infrastructure does use TLDs to assist user-agents in resolving hostnames, valid hostnames do not require a TLD. And a URL can point to a valid hostname (without a TLD). http://blargh is valid on any machine that can resolve the name "blargh" ... as mine does now, after a quick hack to HOSTS.

Thank you for this clarification. I can easily see how such a url might refer to some other host on my LAN or local net. I was under the (perhaps mistaken) impression that a URL had to be universally meaningful. Upon refelection, I see there's no basis for this belief.

dalecosp;11061879 wrote:
What this means for your system remains to be seen; I should think you would want to ensure that the hostname is a FQDN ... you might be able to use a regex or a DB lookup (IETF or some other organization may have an list or DB of TLDs ... ) for that ... hmm ....

My application will want to avoid any non-FQDN links, if only to prevent potential exploits or probing of my internal network. Not sure what sort of regex to use (as I mentioned, such parsing is tricky). I wonder if there's some kind of RFC discussing valid domain names? :rolleyes:

parse_url seems useful:

$url = "http://blarg/some/path/file.txt";
$v = parse_url($url);
var_dump($v);

result:

array(3) {
  'scheme' =>
  string(4) "http"
  'host' =>
  string(5) "blarg"
  'path' =>
  string(19) "/some/path/file.txt"
}

dalecosp;11061879 wrote:
If you'll remove the extra two "//" characters, this should work. No bugs to be seen here.

I still think it's a bit wrong that these are FAILed by the function. Apparently this would be in violation of RFC 3986, section 4.2.

Do we think there are any issues I should take to the PHP-DEV mailing list here? Seems to me there are a few:
functionality is out of date and should at least support RFC 3986 specs
omission of flag FILTER_FLAG_SCHEME_REQUIRED doesn't relax requirement for a scheme.
* shouldn't urls starting with // be valid?

Weedpacket · May 11, 2017

sneakyimp wrote:
dalecosp wrote:
If you'll remove the extra two "//" characters, this should work. No bugs to be seen here.

I still think it's a bit wrong that these are FAILed by the function. Apparently this would be in violation of RFC 3986, section 4.2.

That section doesn't have anything to do with the presence of [font=monospace]////[/font]....

I've been trying to make sense of the flags as well - the fact that the filter won't say why a given URL failed doesn't help. The only one I can seem to turn off is PATH_REQUIRED (which then allows [noparse]http://www.google.com[/noparse]).

But since many many valid URLs may not be valid for your purpose, it might be better to write something around [man]parse_url[/man] and validate each piece as appropriate. There seem to be quite a few problems with FILTER_VALIDATE_URL.

dalecosp · May 11, 2017

Weedpacket;11061897 wrote:
There seem to be quite a few problems with FILTER_VALIDATE_URL.

The major flaw there seems to be exactly what Sneakyimp was saying ... it's based on a fairly ancient RFC. Half of those are "doesn't support $foo (IDN, Hebrew, Tel, phar, IPv6, etc.) ... most of which are implicit or even explicit in the manual.

Derokorian · May 11, 2017

sneakyimp;11061891 wrote:
I'm also wondering if anyone else can confirm that the FILTER_FLAG_SCHEME_REQUIRED flag does not work for this function. I was under the impression, but perhaps mistaken, that the function should ignore the absence of a scheme by supplying flags but omitting FILTER_FLAG_SCHEME_REQUIRED.

It says in the docs that SCHEME and HOST are both applied by default.

php.net wrote:
5.2.1 FILTER_VALIDATE_URL now defaults to FILTER_FLAG_SCHEME_REQUIRED and FILTER_FLAG_HOST_REQUIRED.

In fact we can see here that those constants aren't even referenced while path and query are. If scheme or host are not present, it fails.

Derokorian · May 11, 2017

Weedpacket;11061897 wrote:
But since many many valid URLs may not be valid for your purpose, it might be better to write something around [man]parse_url[/man] and validate each piece as appropriate. There seem to be quite a few problems with FILTER_VALIDATE_URL.

In fact, filter_var VALIDATE_URL uses parse_url under the hood.

Weedpacket · May 11, 2017

Yes; and for the other half, RFC3986 notes among its lists of changes that its predecessor didn't allow [font=monospace]about:[/font], and that it had trouble distinguishing "host" and path.

Really it shouldn't be too hard to translate the ABNF of 3986 into a parser to replace whatever PHP's using now (just checked: the filter basically calls [man]url_parse[/man], which is hand-rolled; https://bugs.php.net/bug.php?id=72301 is right about there being a re2c-based parser and how nothing seems to use it). Attempting to change it might cause squawks about BC breakage, but I suspect that it's not that well-used as it is so won't be breaking anything that is not probably broken already.

Wow:
I'm not seeing something.
BRB

Weedpacket · May 11, 2017

That's why FILTER_FLAG_SCHEME_REQUIRED/FILTER_FLAG_HOST_REQUIRED don't do anything.

diff museum.php.net/php5/php-5.2.0/ext/filter/logical_filters.c museum.php.net/php5/php-5.2.1/ext/filter/logical_filters.c > diff.txt

...
488,491c457,460
<	if (
< 		((flags & FILTER_FLAG_SCHEME_REQUIRED) && url->scheme == NULL) ||
< 		((flags & FILTER_FLAG_HOST_REQUIRED) && url->host == NULL) ||
< 		((flags & FILTER_FLAG_PATH_REQUIRED) && url->path == NULL) ||
< 		((flags & FILTER_FLAG_QUERY_REQUIRED) && url->query == NULL)
<	) {
<bad_url:
<		php_url_free(url);
<		RETURN_VALIDATION_FAILED
---
	if (
		url->scheme == NULL || 
		/* some schemas allow the host to be empty */
		(url->host == NULL && (strcmp(url->scheme, "mailto") && strcmp(url->scheme, "news") && strcmp(url->scheme, "file"))) ||
		((flags & FILTER_FLAG_PATH_REQUIRED) && url->path == NULL) || ((flags & FILTER_FLAG_QUERY_REQUIRED) && url->query == NULL)
	) {
bad_url:
		php_url_free(url);
		RETURN_VALIDATION_FAILED

Thanks to Ilia: https://bugs.php.net/bug.php?id=39898

They're not "on by default", they're hardcoded and factored out.

Absolute URIs only, please. I insist.

sneakyimp · May 11, 2017

Weedpacket;11061897 wrote:
That section doesn't have anything to do with the presence of [font=monospace]////[/font]....

In my first round of tests, filter_var is flunking //example.com/:

// bool(false)
var_dump(filter_var("//cdn.example.com/js_file.js", FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED | FILTER_FLAG_PATH_REQUIRED));

Weedpacket;11061897 wrote:
There seem to be quite a few problems with FILTER_VALIDATE_URL.

I'd like to gingerly approach the PHP devs about fixing it. Perhaps we can make our own list of flagrant problems.

Derokorian;11061903 wrote:
It says in the docs that SCHEME and HOST are both applied by default.

Wondering a) if flags actually work and b) is it possible to ignore the scheme? I think flags may not work. I'm mistrustful of docs.

Derokorian;11061903 wrote:
In fact we can see here that those constants aren't even referenced while path and query are. If scheme or host are not present, it fails.

PHP source code relies very heavily on macros and and its parameter parsing code is pretty funky but yeah, I don't see those constants referenced in that function's source.

Weedpacket;11061907 wrote:
Yes; and for the other half, RFC3986 notes among its lists of changes that its predecessor didn't allow [font=monospace]about:[/font], and that it had trouble distinguishing "host" and path.

Really it shouldn't be too hard to translate the ABNF of 3986 into a parser to replace whatever PHP's using now (just checked: the filter basically calls [man]url_parse[/man], which is hand-rolled; https://bugs.php.net/bug.php?id=72301 is right about there being a re2c-based parser and how nothing seems to use it). Attempting to change it might cause squawks about BC breakage, but I suspect that it's not that well-used as it is so won't be breaking anything that is not probably broken already.

Thanks for this diligence, although I'm not exactly sure parsers and BC breakage.

Weedpacket;11061907 wrote:
Wow:
I'm not seeing something.
BRB

:eek:

Weedpacket;11061909 wrote:

That's why FILTER_FLAG_SCHEME_REQUIRED/FILTER_FLAG_HOST_REQUIRED don't do anything.

diff museum.php.net/php5/php-5.2.0/ext/filter/logical_filters.c museum.php.net/php5/php-5.2.1/ext/filter/logical_filters.c > diff.txt

...
488,491c457,460
<	if (
< 		((flags & FILTER_FLAG_SCHEME_REQUIRED) && url->scheme == NULL) ||
< 		((flags & FILTER_FLAG_HOST_REQUIRED) && url->host == NULL) ||
< 		((flags & FILTER_FLAG_PATH_REQUIRED) && url->path == NULL) ||
< 		((flags & FILTER_FLAG_QUERY_REQUIRED) && url->query == NULL)
<	) {
<bad_url:
<		php_url_free(url);
<		RETURN_VALIDATION_FAILED
---
	if (
		url->scheme == NULL || 
		/* some schemas allow the host to be empty */
		(url->host == NULL && (strcmp(url->scheme, "mailto") && strcmp(url->scheme, "news") && strcmp(url->scheme, "file"))) ||
		((flags & FILTER_FLAG_PATH_REQUIRED) && url->path == NULL) || ((flags & FILTER_FLAG_QUERY_REQUIRED) && url->query == NULL)
	) {
bad_url:
		php_url_free(url);
		RETURN_VALIDATION_FAILED

Thanks to Ilia: https://bugs.php.net/bug.php?id=39898

They're not "on by default", they're hardcoded and factored out.

Absolute URIs only, please. I insist.

This is not cool. I suppose I'll send an email to PHP devs about this now ten-year-old bug. This function looks pretty broken to me. If anyone has any requests or wants to chime in on what we would like from this function, I'll try and roll it into a polite email calling for some action.

Weedpacket · May 12, 2017

Hah, it looks like Derokorian and I managed to be in the same place at the same time. I took longer to write things up.

First of all, I think a working filter should validate against RFC3986: it's not just a cosmetic difference, as there are problems with RFC2396 that the former spec calls out.

Validating against RFC3986 will allow specifying that the filter validates URIs, not merely URI-references, which is probably what was intended.

The validation rule becomes "all URIs have a scheme part" (and in fact that's the only part it requires), followed by further checks such as "if the scheme is followed by '//' then an authority part must be present", and so forth.

It's not the place for a URL validator to pick and choose based on specific schemes (http: vs. file: vs. mailto: etc.). The "decision made" in the bug tracker:

pajoye@php.net wrote:
- remove option of having host & scheme optional
- No option to make them optional (what you have is not a url then)

is itself refuted in Illia's fix (Oops, sometimes a URL doesn't have a host part! Quick, bodge a special case! Have we got them all?)

Yes, maybe 'Every developer expects "blahblub" to be not valid.' but the fact is that RFC3986 allows an empty string as a valid URI-reference. Picking and choosing about which bits to accept or deny isn't going to produce anything intuitive because it would require the user to intuit which bits the implementer chose.

The problem with validating a URI-reference is that it could be a relative URI. By definition, such a beast can't be validated without resolving it against the base URI it is supposed to be relative to. So if relative URIs are to be accepted, it would be necessary for the user to pass a suitable base URI ([noparse]http://www.example.com/user[/noparse], [noparse]mailto:[/noparse], or whatever), and apply the RFC's resolution rules, failing the input string if resolution fails to produce a valid URI.

Any URI that passes the grammar test should not be invalidated on additional grounds if the scheme is unrecognised by the implementer (if they don't recognise the scheme, how do they know what's valid and what's not?) If the implementer does recognise the scheme, then additional checks may be performed. That means implementing the relevant sections of [noparse]RFC7230 (http, RFC6068 (mailto, RFC8089 (file RFC5538 (news[/noparse] and so on, depending on how energetic the implementer feels. The more esoteric the scheme, the more likely the user is in a better position to validate, and can use FILTER_CALLBACK for further refinement. It's not up to the implementer to discard a "pkcs11:" URI (RFC7512) just because it doesn't include a host part and they've never personally seen one themselves.

I think the strongest (most useful and least surprising) filter would be to make it validate against RFC3986's "URI" production (basically, absolute-URI with optional #fragment). More generic would be FILTER_FLAG_RELATIVE_URI allowing URI-references ("blahblub", "localhost", "" and all). Being able to provide a base URI the filter can resolve them against would allow validating URI-references as stringently as absolute URIs.

This has given me an itch to scratch.

sneakyimp · May 12, 2017

Weedpacket, that all sounds right to me -- what I understand of it anyway. Please tell me you are writing this function

Weedpacket · May 16, 2017

sneakyimp;11061969 wrote:
Please tell me you are writing this function

Well, since you mentioned it... here's an experimental thing to see what sort of work it would entail.

<?php
namespace URI;

class URIException extends \DomainException
{
}

class URI
{

public static function validate(string $str): bool
{
	try
	{
		$parse = static::parse($str);
	}
	catch(URIException $e)
	{
		return false;
	}
	return true;
}

protected $scheme;

protected $userinfo;

protected $host;

protected $port;

protected $path;

protected $query;

protected $fragment;

private static function bad_pct_encoded($str): bool
{
	// Look for a % not followed by two hexdigits.
	return (bool)preg_match('/%(?![A-F0-9]{2})/i', $str);
}

private static function proper_path(string $path): bool
{
	if($path[0] == '/')
	{
		$path = substr($path, 1);
	}
	$parts = explode('/', $path);
	if($parts[0] == '')
	{
		return false;
	}
	foreach($parts as $part)
	{
		if(!self::bad_pct_encoded($part))
		{
			return false;
		}
	}
	return true;
}

private static function proper_host(string $host): bool
{
	// Regular name
	if(!preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%]/i', $host) && !self::bad_pct_encoded($host))
	{
		return true;
	}
	// IPv4 literal
	if(preg_match('/^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$/', $host, $match))
	{
		return !($match[1] > 255 || $match[2] > 255 || $match[3] > 255 || $match[4] > 255);
	}
	// Other IP address literals
	if($host[0] != '[' || $host[-1] != ']')
	{
		return false;
	}
	$host = substr($host, 1, -1);
	// IPvFuture
	if($host[0] == 'v' || $host[0] == 'V')
	{
		return preg_match('/^v[0-9a-f]+[!$&\'()*+,\-.0-9:;=_a-z~]+$/i', $host);
	}
	// IPv6
	// Regex from http://home.deds.nl/~aeron/regex/
	return (bool)preg_match('/^(((?=.*(::))(?!.*\3.+\3))\3?|([\dA-F]{1,4}(\3|:\b|$)|\2))(?4){5}((?4){2}|(((2[0-4]|1\d|[1-9])?\d|25[0-5])\.?\b){4})$/i', $host);
}

private static function parse_authority(string $authority): array
{
	// Peel any userinfo off start of authority.
	if(preg_match('/^([a-z0-9\-._~!$&\'()*+,;=:%]*)@/i', $authority, $match))
	{
		$authority = substr($authority, strlen($match[0]));
		$userinfo = $match[1];
	}
	else
	{
		$userinfo = null;
	}
	if(!preg_match('/^[!$%&\'()*+,\-.0-9:;=\[\]_a-z~]*$/i', $authority))
	{
		throw new URIException('Invalid authority part');
	}
	// Peel any port off end of authority
	if(preg_match('/(?<!:):([0-9]+)$/', $authority, $match))
	{
		$host = substr($authority, 0, -strlen($match[0]));
		// The port is a number...
		$port = (int)$match[1];
		if($port > 65535)
		{
			throw new URIException('Invalid port number');
		}
		// ...But is represented by a string of digits.
		$port = (string)$port;
	}
	else
	{
		$host = $authority;
		$port = null;
	}

	if($userinfo !== null && (self::bad_pct_encoded($userinfo) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%]/i', $userinfo)))
	{
		throw new URIException('Invalid userinfo part');
	}
	if(!self::proper_host($host))
	{
		throw new URIException('Invalid host part');
	}

	return [$userinfo, $host, $port];
}

public static function parse(string $str): URI
{

	// Remove this test if and when IRIs are implemented.
	if(preg_match('/[^ -~]/', $str))
	{
		throw new URIException('URI supports ASCII only');
	}

	// Initial parsing regexp from RFC3986 Appendix B
	// Slightly modified to make scheme mandatory (we only deal with
	// absolute URIs, not URI-references)
	if(!preg_match('~^([^:/?#]+):(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?~', $str, $match))
	// parentheses---1--------1-2--3-------32-4------45--6-----65-7-8--87--
	{
		throw new URIException('Invalid URI');
	}

	$scheme = $match[1];
	$authority = isset($match[2]) ? $match[3] : null;
	$path = $match[4];
	$query = isset($match[5]) ? $match[6] : null;
	$fragment = isset($match[7]) ? $match[8] : null;

	if(!preg_match('/^[a-z][a-z0-9\-+.]*$/i', $scheme))
	{
		throw new URIException('Invalid scheme component');
	}
	if($query !== null && (self::bad_pct_encoded($query) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%\/?]/i', $query)))
	{
		throw new URIException('Invalid query component');
	}
	if($fragment !== null && (self::bad_pct_encoded($fragment) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%\/?]/i', $fragment)))
	{
		throw new URIException('Invalid fragment component');
	}

	if($authority !== null && $authority !== '') // hmm...
	{
		[$userinfo, $host, $port] = self::parse_authority($authority);
	}
	else
	{
		$userinfo = $host = $port = null;
		if(!($path == '' || $path == '/' || !self::proper_path($path)))
		{
			throw new URIException('Invalid path component');
		}
	}

	return new self($scheme, $userinfo, $host, $port, $path, $query, $fragment);
}

private static function canonical_pct_encoding($str)
{
	return ($str === null) ? null : preg_replace_callback('/%[0-9a-f]{2}/i', function ($s)
	{
		return strtoupper($s[0]);
	}, $str);
}

protected function __construct($scheme, $userinfo, $host, $port, $path, $query, $fragment)
{
	// Scheme names are canonically lower case. (RFC3986 §3.1, para.2)
	$scheme = strtolower($scheme);
	// Host names are canonically lower case (§3.2.2) apart from %-encoding
	if($host !== null)
	{
		$host = strtolower($host);
	}
	// %-encoding canonically uses uppercase A-F (§2.1)
	$userinfo = self::canonical_pct_encoding($userinfo);
	$query = self::canonical_pct_encoding($query);
	$fragment = self::canonical_pct_encoding($fragment);
	$host = self::canonical_pct_encoding($host);
	$path = self::canonical_pct_encoding($path);

	$this->scheme = $scheme;
	$this->userinfo = $userinfo;
	$this->host = $host;
	$this->port = $port;
	$this->path = $path;
	$this->query = $query;
	$this->fragment = $fragment;
}

public function __toString()
{
	$uri = $this->scheme . ':';
	if($this->host !== null)
	{
		$uri .= '//';
		if($this->userinfo !== null)
		{
			$uri .= $this->userinfo . '@';
		}
		$uri .= $this->host;
		if($this->port !== null)
		{
			$uri .= ':' . $this->port;
		}
	}
	$uri .= $this->path;
	if($this->query !== null)
	{
		$uri .= '?' . $this->query;
	}
	if($this->fragment !== null)
	{
		$uri .= '#' . $this->fragment;
	}
	return $uri;
}
}


class HttpUri extends URI
{

// http-URI = "http:" "//" authority path-abempty [ "?" query ] [ "#"
// fragment ]
public function parse(string $str): HttpUri
{
	parent::parse($str);
	// We have to have an authority (rather, a host, because we've parsed
	// userinfo and port already).
	if($this->scheme !== 'http' || $this->scheme !== 'https')
	{
		throw new URIException('scheme part does not identify http(s) URI');
	}
	if($this->host === null)
	{
		throw new UriException('http(s) URIs require an authority part');
	}
}
}

sneakyimp · May 16, 2017

Weedpacket;11062021 wrote:
Well, since you mentioned it... here's an experimental thing to see what sort of work it would entail.

Amazing!

Trying to use it (PHP 7.0.15-0ubuntu0.16.04.4) like so:

require_once "weedpacket.php";

$urls = array(
        "",
        "Buy It Now",
        "localhost/foo/bar",
        "blarg",
        "blarg/",
        "blarg/some/path/file.ext",
        "http://google.com",
        "http://google.com/",
        "http://google.com/some/path.ext",
        "http://google.com/some/path.ext?foo=bar",
        "example.com",
        "example.com/",
        "example.com/some/path/file.ext",
        "example.com/some/path/file.ext?foo=bar",
        "example.com:1234",
        "example.com:1234/",
        "example.com:1234/some/path/file.ext",
        "example.com:1234/some/path/file.ext?foo=bar",
        "//foobar.com",
        "//foobar.com/",
        "//foobar.com/path/file.txt",
        "//cdn.example.com/js_file.js"

);

foreach($urls as $u) {
	echo "url: $u\n";
	try {
		$v = URI\URI::parse($u);
		echo "OK?\n";
	} catch (Exception $e) {
		echo "EXCEPTION: " . $e->getMessage() . "\n";
	}
	echo "\n\n"; 
}

yields some errors:

PHP Parse error:  syntax error, unexpected '=' in /home/jaith/2017-05-16-url-check/weedpacket.php on line 179

Change that line to this:

list($userinfo, $host, $port) = self::parse_authority($authority);

solves the problem (at least temporarily -- not sure my change preserves the operation intended?

But then I get this error:

PHP Fatal error:  Cannot make static method URI\URI::parse() non static in class URI\HttpUri in /home/jaith/2017-05-16-url-check/weedpacket.php on line 275

The problem seems to be that URI::parse is a static function and called statically, whereas HttpUri::parse is not static and refers to $this so it cannot be static.

I modified the HttpUri class to try and remedy the situation:

class HttpUri extends URI
{

// http-URI = "http:" "//" authority path-abempty [ "?" query ] [ "#"
// fragment ]
public static function parse(string $str): HttpUri
{
    $v = parent::parse($str);
    // We have to have an authority (rather, a host, because we've parsed
    // userinfo and port already).
    if($v->scheme !== 'http' || $v->scheme !== 'https')
    {
        throw new URIException('scheme part does not identify http(s) URI');
    }
    if($v->host === null)
    {
        throw new UriException('http(s) URIs require an authority part');
    }
}
}

But this leads to complaints about the type hinting:

PHP Fatal error:  Declaration of URI\HttpUri::parse(string $str): URI\HttpUri must be compatible with URI\URI::parse(string $str): URI\URI in /home/jaith/2017-05-16-url-check/weedpacket.php on line 275

remove type hint from URI::parse and things seem better.

The output:

url: 
EXCEPTION: Invalid URI. Preliminary regex failed.


url: Buy It Now
EXCEPTION: Invalid URI. Preliminary regex failed.


url: localhost/foo/bar
EXCEPTION: Invalid URI. Preliminary regex failed.


url: blarg
EXCEPTION: Invalid URI. Preliminary regex failed.


url: blarg/
EXCEPTION: Invalid URI. Preliminary regex failed.


url: blarg/some/path/file.ext
EXCEPTION: Invalid URI. Preliminary regex failed.


url: http://google.com
OK?


url: http://google.com/
OK?


url: http://google.com/some/path.ext
OK?


url: http://google.com/some/path.ext?foo=bar
OK?


url: example.com
EXCEPTION: Invalid URI. Preliminary regex failed.


url: example.com/
EXCEPTION: Invalid URI. Preliminary regex failed.


url: example.com/some/path/file.ext
EXCEPTION: Invalid URI. Preliminary regex failed.


url: example.com/some/path/file.ext?foo=bar
EXCEPTION: Invalid URI. Preliminary regex failed.


url: example.com:1234
OK?


url: example.com:1234/
OK?


url: example.com:1234/some/path/file.ext
OK?


url: example.com:1234/some/path/file.ext?foo=bar
OK?


url: //foobar.com
EXCEPTION: Invalid URI. Preliminary regex failed.


url: //foobar.com/
EXCEPTION: Invalid URI. Preliminary regex failed.


url: //foobar.com/path/file.txt
EXCEPTION: Invalid URI. Preliminary regex failed.


url: //cdn.example.com/js_file.js
EXCEPTION: Invalid URI. Preliminary regex failed.

sneakyimp · May 16, 2017

The modified code:

namespace URI;

class URIException extends \DomainException
{
}

class URI
{

public static function validate(string $str): bool
{
    try
    {
        $parse = static::parse($str);
    }
    catch(URIException $e)
    {
        return false;
    }
    return true;
}

protected $scheme;

protected $userinfo;

protected $host;

protected $port;

protected $path;

protected $query;

protected $fragment;

private static function bad_pct_encoded($str): bool
{
    // Look for a % not followed by two hexdigits.
    return (bool)preg_match('/%(?![A-F0-9]{2})/i', $str);
}

private static function proper_path(string $path): bool
{
    if($path[0] == '/')
    {
        $path = substr($path, 1);
    }
    $parts = explode('/', $path);
    if($parts[0] == '')
    {
        return false;
    }
    foreach($parts as $part)
    {
        if(!self::bad_pct_encoded($part))
        {
            return false;
        }
    }
    return true;
}

private static function proper_host(string $host): bool
{
    // Regular name
    if(!preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%]/i', $host) && !self::bad_pct_encoded($host))
    {
        return true;
    }
    // IPv4 literal
    if(preg_match('/^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$/', $host, $match))
    {
        return !($match[1] > 255 || $match[2] > 255 || $match[3] > 255 || $match[4] > 255);
    }
    // Other IP address literals
    if($host[0] != '[' || $host[-1] != ']')
    {
        return false;
    }
    $host = substr($host, 1, -1);
    // IPvFuture
    if($host[0] == 'v' || $host[0] == 'V')
    {
        return preg_match('/^v[0-9a-f]+[!$&\'()*+,\-.0-9:;=_a-z~]+$/i', $host);
    }
    // IPv6
    // Regex from http://home.deds.nl/~aeron/regex/
    return (bool)preg_match('/^(((?=.*(::))(?!.*\3.+\3))\3?|([\dA-F]{1,4}(\3|:\b|$)|\2))(?4){5}((?4){2}|(((2[0-4]|1\d|[1-9])?\d|25[0-5])\.?\b){4})$/i', $host);
}

private static function parse_authority(string $authority): array
{
    // Peel any userinfo off start of authority.
    if(preg_match('/^([a-z0-9\-._~!$&\'()*+,;=:%]*)@/i', $authority, $match))
    {
        $authority = substr($authority, strlen($match[0]));
        $userinfo = $match[1];
    }
    else
    {
        $userinfo = null;
    }
    if(!preg_match('/^[!$%&\'()*+,\-.0-9:;=\[\]_a-z~]*$/i', $authority))
    {
        throw new URIException('Invalid authority part');
    }
    // Peel any port off end of authority
    if(preg_match('/(?<!:):([0-9]+)$/', $authority, $match))
    {
        $host = substr($authority, 0, -strlen($match[0]));
        // The port is a number...
        $port = (int)$match[1];
        if($port > 65535)
        {
            throw new URIException('Invalid port number');
        }
        // ...But is represented by a string of digits.
        $port = (string)$port;
    }
    else
    {
        $host = $authority;
        $port = null;
    }

    if($userinfo !== null && (self::bad_pct_encoded($userinfo) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%]/i', $userinfo)))
    {
        throw new URIException('Invalid userinfo part');
    }
    if(!self::proper_host($host))
    {
        throw new URIException('Invalid host part');
    }

    return [$userinfo, $host, $port];
}

public static function parse(string $str)
{

    // Remove this test if and when IRIs are implemented.
    if(preg_match('/[^ -~]/', $str))
    {
        throw new URIException('URI supports ASCII only');
    }

    // Initial parsing regexp from RFC3986 Appendix B
    // Slightly modified to make scheme mandatory (we only deal with
    // absolute URIs, not URI-references)
    if(!preg_match('~^([^:/?#]+):(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?~', $str, $match))
    // parentheses---1--------1-2--3-------32-4------45--6-----65-7-8--87--
    {
        throw new URIException('Invalid URI. Preliminary regex failed.');
    }

    $scheme = $match[1];
    $authority = isset($match[2]) ? $match[3] : null;
    $path = $match[4];
    $query = isset($match[5]) ? $match[6] : null;
    $fragment = isset($match[7]) ? $match[8] : null;

    if(!preg_match('/^[a-z][a-z0-9\-+.]*$/i', $scheme))
    {
        throw new URIException('Invalid scheme component');
    }
    if($query !== null && (self::bad_pct_encoded($query) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%\/?]/i', $query)))
    {
        throw new URIException('Invalid query component');
    }
    if($fragment !== null && (self::bad_pct_encoded($fragment) || preg_match('/[^a-z0-9\-._~!$&\'()*+,;=:@%\/?]/i', $fragment)))
    {
        throw new URIException('Invalid fragment component');
    }

    if($authority !== null && $authority !== '') // hmm...
    {
        list($userinfo, $host, $port) = self::parse_authority($authority);
    }
    else
    {
        $userinfo = $host = $port = null;
        if(!($path == '' || $path == '/' || !self::proper_path($path)))
        {
            throw new URIException('Invalid path component');
        }
    }

    return new self($scheme, $userinfo, $host, $port, $path, $query, $fragment);
}

private static function canonical_pct_encoding($str)
{
    return ($str === null) ? null : preg_replace_callback('/%[0-9a-f]{2}/i', function ($s)
    {
        return strtoupper($s[0]);
    }, $str);
}

protected function __construct($scheme, $userinfo, $host, $port, $path, $query, $fragment)
{
    // Scheme names are canonically lower case. (RFC3986 §3.1, para.2)
    $scheme = strtolower($scheme);
    // Host names are canonically lower case (§3.2.2) apart from %-encoding
    if($host !== null)
    {
        $host = strtolower($host);
    }
    // %-encoding canonically uses uppercase A-F (§2.1)
    $userinfo = self::canonical_pct_encoding($userinfo);
    $query = self::canonical_pct_encoding($query);
    $fragment = self::canonical_pct_encoding($fragment);
    $host = self::canonical_pct_encoding($host);
    $path = self::canonical_pct_encoding($path);

    $this->scheme = $scheme;
    $this->userinfo = $userinfo;
    $this->host = $host;
    $this->port = $port;
    $this->path = $path;
    $this->query = $query;
    $this->fragment = $fragment;
}

public function __toString()
{
    $uri = $this->scheme . ':';
    if($this->host !== null)
    {
        $uri .= '//';
        if($this->userinfo !== null)
        {
            $uri .= $this->userinfo . '@';
        }
        $uri .= $this->host;
        if($this->port !== null)
        {
            $uri .= ':' . $this->port;
        }
    }
    $uri .= $this->path;
    if($this->query !== null)
    {
        $uri .= '?' . $this->query;
    }
    if($this->fragment !== null)
    {
        $uri .= '#' . $this->fragment;
    }
    return $uri;
}
}


class HttpUri extends URI
{

// http-URI = "http:" "//" authority path-abempty [ "?" query ] [ "#"
// fragment ]
public static function parse(string $str): HttpUri
{
    $v = parent::parse($str);
    // We have to have an authority (rather, a host, because we've parsed
    // userinfo and port already).
    if($v->scheme !== 'http' || $v->scheme !== 'https')
    {
        throw new URIException('scheme part does not identify http(s) URI');
    }
    if($v->host === null)
    {
        throw new UriException('http(s) URIs require an authority part');
    }
}
}

sneakyimp · May 16, 2017

OK I've been running some tests and things are looking mostly good, but I see some behavior which seems noteworthy:
urls containing trailing spaces are passed as valid, e.g.:

$url = "http://example.com/lots of  space ";

Is this intentional? I see some discussion of whitespace in Appendix C of RFC 3986, but I kind of thought that spaces would become + or %20 in a proper URL. Seems really weird to be talking about urls broken by line wraps.
* urls with newline or tab chars have incorrect exception message:

URI supports ASCII only

Also, NO ONE has responded to my email the PHP internals list. If anyone is a member of this list, feel free to chime in and support me.

Derokorian · May 16, 2017

sneakyimp;11062035 wrote:
OK I've been running some tests and things are looking mostly good, but I see some behavior which seems noteworthy:
urls containing trailing spaces are passed as valid, e.g.:
$url = "http://example.com/lots of  space ";
Is this intentional? I see some discussion of whitespace in Appendix C of RFC 3986, but I kind of thought that spaces would become + or %20 in a proper URL. Seems really weird to be talking about urls broken by line wraps.
* urls with newline or tab chars have incorrect exception message:
URI supports ASCII only
Also, NO ONE has responded to my email the PHP internals list. If anyone is a member of this list, feel free to chime in and support me.

What's not valid about it? That it hasn't been encoded with + or %20 yet? I'm not sure what the problem is, every tool I know will automatically change that at time of request, including curl, wget, and browsers.

filter_var results a bit weird for FILTER_VALIDATE_URL

Ssneakyimp

NogDog

Ssneakyimp

NogDog

dalecosp

Ssneakyimp

Weedpacket

dalecosp

DDerokorian

DDerokorian

Weedpacket

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Weedpacket

Ssneakyimp

Ssneakyimp

Ssneakyimp

DDerokorian