regular expression to check file name

dberebi · Nov 17, 2007

I'm working on CMS that should enable to chose the file name of the published content:
www.domain.com/cms/file-name.php

I want to enable only alphanumeric character underscore and hyphen, so used this regex:
eregi('^{[a-z]{1}[\w-]{0,}$',$_POST['fname'])}

but this is always false....
why???

YAOMK · Nov 17, 2007

eregi('/^[a-zA-Z0-9\-\_]*\Z/',$_POST['fname'])

dberebi · Nov 17, 2007

but what the '\z' does?
and I also want the first character to be a letter

YAOMK · Nov 17, 2007

the \Z marks the true end in Perl based regular expressions. It is popular belief that the $ marks the end, but in reality it does not.

$pattern = '/^[a-zA-Z][a-zA-Z0-9\-\_]*\Z/';

dberebi · Nov 17, 2007

so the '^' is the true start?
and what the differences between the '$' and the '\z'?

thanks for replying...

dberebi · Nov 17, 2007

I check it and I should also use '\A' instead of '^'

^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
\b Match a word boundary
\B Match a non-(word boundary)
\A Match only at beginning of string
\Z Match only at end of string, or before newline at the end

\z Match only at end of string

NogDog · Nov 17, 2007

As the preg functions are preferred over the ereg functions, I would suggest:

if(preg_match('/^[a-z][-\w]*$/i', $_POST['fname']))
{
   // valid
}
else
{
   // invalid
}

laserlight · Nov 17, 2007

It is popular belief that the $ marks the end, but in reality it does not.

It does mark the end if it is not in multi-line mode via the m modifier.

YAOMK · Nov 17, 2007

I read somewhere that the caret ^ and dollar $ (more so the dollar) are vulnerable when using regular expressions to validate user input. My understanding is that the anchors \A and \Z match the whole input as opposed to their counterparts, you can read more about it here.

It does mark the end if it is not in multi-line mode via the m modifier.

That is interesting, I'm aware of the differece between multi line and single lines, but in the discussion I observed their examples didn't include the m modifier, I guess their code could have been buggy.

laserlight · Nov 17, 2007

I read somewhere that the caret ^ and dollar $ (more so the dollar) are vulnerable when using regular expressions to validate user input.

I believe you interpreted the article wrongly. What vulnerability does it state?

Interesting, I'm aware of the differece between multi line and single lines, but in the discussion I observed their examples didn't include the m modifier.

Perhaps you missed:

In Perl, you do this by adding an m after the regex code, like this: m/^regex$/m;.

If you are talking about:

Let's see what happens when we try to match ^4$ to 749\n486\n4 (where \n represents a newline character) in multi-line mode.

Then I note that ^4$ is an incomplete PCRE pattern to begin with since it lacks delimiters, and for this example a correct full pattern would be: /^4$/m

If this is what you refer to by "vulnerability", then I must say it is not: clearly the third line matches the pattern. If one intends to match the entire string against '4', then of course this is wrong in multi-line mode. On the other hand, if one wants to match a line against '4', then /\A4\Z/m is wrong. So, perhaps regex is inherently vulnerable?

YAOMK · Nov 17, 2007

Laserlight - Thanks for elaborating on the subject. Unfortunately I can't seem to find the discussion where the issue was raised.

From what I recall their pattern was well formed with perl like delimiters etc but not with \n or m. The vulnerability they discussed was that part of the string was not being matched against, just like you showed in multi line content and vice versa, my guess is that I may have missed the m or something to that extent.

NogDog · Nov 17, 2007

If using the PCRE (preg_*) regexp functions, unless you use the "m" modifier (after the closing delimiter), The "^" will be the very beginning of the sting and the "$" will be the very end of the string, regardless of the number of newlines.

m (PCRE_MULTILINE)

By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl.

When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect. [/quote]

(From the PCRE modifiers page)

YAOMK · Nov 17, 2007

Thanks for your insight Nog.