Weedpacket;11044815 wrote:Well, you'll need to make sure your literals are encoded in the source code using UTF-8 (whether your command line does that I don't know, but I suspect it may not).
As I am using Ubuntu, and based on most of the behavior I've seen, I think that I have managed to properly encode my source as UTF-8. The mb_* functions seem to properly recognize å as lowercase and Å as uppercase and convert between the two. I have also used xxd to inspect the contents of the text files I have created containing UTF-8 strings and these appear to match the binary encodings that I've seen in various UTF-8 charts. I believe I mentioned this above.
Weedpacket;11044815 wrote:If you want to see if a single character appears at least three times consecutively you don't need to see how many times it appears, just whether it appears at least three times: [font=monospace]/(.)\1\1/u[/font] would be sufficient (this would also avoid matching "qwerqwerqwer").
Thanks for pointing out that these regexes will also match repeated patterns. The OWASP guidelines don't seem to take exception to that kind of repetition so I expect I'll adopt your pattern.
Weedpacket;11044815 wrote:
I suppose you could use the Unicode datasets to extract a comprehensive list of characters and their categories.
Oh my I was sniffing around those files a little bit and became immediately aware that Unicode is quite complicated.
Weedpacket;11044815 wrote:Generate a UTF-8-encoded list of characters:
for($i = 0; $i < 65536; $i++)
{
$l = chr($i >> 8);
$r = chr($i & 255);
$o = iconv('UCS-2', 'UTF-8', $l.$r);
echo dechex($i)," = ",$o,"\n";
}
Thanks for that bit of code. I'm not at all sure what the bitshifting is about, but it looks to me like you are attempting to traverse the ordinal space of UCS-2 (old-school UTF-16??) chars and generate their utf-8 counterparts using iconf. Interestingly, that code generates an E_NOTICE for 2048 of those ordinals. Here are a few:
d8, d9, da, db, dc, dd, de, df, 1d8, 1d9, 1da, 1db, 1dc, 1dd, 1de, 1df, 2d8, 2d9, 2da, 2db, 2dc, 2dd, 2de, 2df...
I'm guessing the reason you stopped at 65536 is because this will exhaust the UCS-2 ordinal space because it's only a 2-byte encoding scheme. Note that this would mean that we never get around to ��, the cat face with wry smile char (U+1F63C or 😼).
I doubt I'll attempt the repeated char checker on the entire UTF-8 space. I've tested this function with �� and it seems to recognize when there are 3 and when there are not
function string_repeats_char_too_much($str) {
if (preg_match('/(.)\1\1/u', $str)) {
return TRUE;
} else {
return FALSE;
}
}
I'm trying to to use the Unicode preg matching to concoct a regex that will match not just latin numbers (0-9) but also Chinese Numerals or Japanese Numerals, etc.:'
function string_has_digit($str) {
if (preg_match("/\p{N}+/u", $str)) {
return TRUE;
} else {
return FALSE;
}
}
It seems to work for latin numbers, but curiously, 七, the Japanese char for 7, returns false:
var_dump(string_has_digit("七"));
Using my string_has_digit function against your loop above, I get 691 chars that return TRUE. ㈦ is one of these, 七 is not. WEIRD.