[RESOLVED] preg_replace and regex help!

corcode

Hey.

So I've got a text file and I'm looking to parse through its contents line by line and read certain bits of data into a database.

To this end I've read the contents of the file into an array and I'm going through it line by line, identifying the patterns using preg_replace so as to filter out the bits I don't need, and assign the values of what I do need to a variable.

The basic structure is this:

File.txt contains this:

User 1: corcode (1500 points)
User 2: corcode2 (500 points)
User 3: corcode2 (600 points)
User 4: corcode2 (2400 points)

The info I'm looking from each line is the User number, the user name, and the number of points.

For every line of the file that's like this, I'm using the preg_match function as follows:

preg_match('/User (?P<user_num>\w+): (?P<username>\w+).(?P<points>\w+) points/', $file[$line], $user);

This, for the most part, allows me to store the values using the results, $user[user_num], $user[username] and $user[points].

The problem is that there's no real format to what the usernames in the text file have to adhere to, so usernames like cor.cod.e aren't caught.

I then used \S instead of \w in the username subpattern, but there are also usernames which include whitespaces (eg: "cor code");

This has me kind of stumped, it's not essential that I use a named sub pattern, so long as I can get the info I need in whatever way that works I'm happy enough, but I'm pretty new to reg expressions and can't think of how I would catch all 3 types of patterns.

Is there an easy way that I'm overlooking to catch:

User 1: Corcode2 (1500 points)
User 2: Cor Code5 (1500 points)
User 3: Cor-code99 (1500 points)
User 4: Cor.co.de (1500 points)

The rest of the format never changes, the username is always prefixed by "User $no: " and suffixed by " (xxxx points)", is there a way to catch everything in between there (not including the whitespaces following the ":" and before the "(")?

Also, Im having a bit of bother with escaping metacharacters at another part of each file.

There's a list of payments, which I'm looking to break down and enter individually.

In the file, they'd be displayed as:

$6.50+$4.30+$3.20 USD

But the amounts are variable so I was looking to use subpatterns <payment1>, <payment2>, <payment3> to get each individual one as a decimal figure without the dollar sign (6.50, 4.30, 3.20), I've tried backslashing the dollar/plus signs and every combination of \Q's and looked up loads of tutorials but just can't see anything that readily applies.

Any help at all is much appreciated!
Thanks in advance, corcode.

PradeepKr

Lets work step by step,

Take a backup of your file, and then,

Inplace of this,

preg_match('/User (?P<user_num>\w+): (?P<username>[^(]+).(?P<points>\w+) points/', $file[$line], $user);

Try this,

preg_match('/User\s*(?P<user_num>\w+):\s*(?P<username>[^(]+)\((?P<points>\d+)\s*points\)/', $file[$line], $user);

Note: You might need to trim spaces from username, trim($user['username'])

nevvermind

First case:

preg_match('#User (?P<user_num>\d*): (?P<username>.*)(?= \(\d* points\)) \((?P<points>\d*)#', $string, $user);

Let's break it down! 🆒 We'll discard the "?P<whatever>" just to have a clear idea.

User - matches that string AND the space at the end. So as to be concise, so are any spaces in the expressions: they match white space.
The above regex has the delimiters as "#" - just my fancy.
B[/B] - matches 0 or more (*) numbers (\d)
B(?= (\d* points))[/B] - matches ANY (.) characters, 0 or more times () IF preceded (?=) by the characters space, a left-bracket, 0 or more numbers, space , letters p,o,i,n,t,s and a right-bracket. Prototype: (<expressions_matched_if_preceded_by_expression_1>)(?=<expression_1>). That's "positive lookahead".
((\d) - matches a left-bracket and 0 or more digits. I choose not the match the " points" expression, because of the conditional above, which does that for us.
You must remember, and I think you know that, that ANY space is matched, it's not just separation between expressions. So regex "hello, world" will not match "hello,world".

Second case:

preg_match_all('#(?<=\$)\d+\.?\d+#', $string, $payment);

Break down:

Here, we're using preg_match_all.
B(\d+.?\d+)[/B] - matches 1 or more (+) decimals (\d), followed by 0 or 1 "." (you could have "$30" w\o the decimal point), followed by 1 or more decimals; all that IF anteceded by the dollar character ($). Prototype: (?<=<expression_1>)(<expression_matched_IF_anteceded_by_expression_1>). That's "positive lookbehind".

It's not the case to use associative arrays, so you can sum up like this:

echo $payment[0][0] + $payment[0][1] + $payment[0][2]

Late again, I see. Sorry, I was writing this post, but in the meantime, PradeepKr has answered. I'll post nonetheless. 😃
And yes, do sanitize and trim the user names.

----------------------LATER EDIT---------------------------

Having a great deal of time on my hands, I've peeked at PradeepKr's regex.
He said to trim the matched username, of course, after the matching has occurred. That needn't be the case if we change his regex to this, adding "\s":
User\s(\w+):\s([^{(]+)\s((\d+)\s*points)}

When I said to trim the usernames, I was referring to do that BEFORE adding that username to the text file. When a user inputs his nickname with spaces before or after, you should first trim the input text (along with sanitizing to avoid security problems), THEN add it to the text file. You mustn't permit this from happening in your txt:

User 1:          Corcode2         (1500 points)

Note that PradeepKr's regex will break if the nickname includes a left bracket, because of the "[^(]" in his code. It won't work on:
User 1: Corcode(2) (1500 points)

nevvermind

Can't edit my post anymore. :rolleyes:

For the second case, you can discard the second pair of brackets. They're redundant.
The regex should be this:
CODE\d+.?\d+[/CODE]

corcode

Excellent!

It's great to have (two) answers on the plate for me but as well as that a bit of insight as to how to do it for myself, cheers nevermind. I've been developing in php for a while now but never really had much cause to look beyond preg_replace's use other than for checking emails, which was copied and pasted from somewhere one time or another, made progress in learning it yesterday but when it came to anything tricky or conditional, I was left wanting in the syntax...I guess the only thing for it is practice.

Thanks again!

nevvermind

Heh, you'd be surprised. Until I've read your post, I didn't know about the "?P<user_num>" thingy and assoc arrays.

nrg_alpha

nevvermind;10962329 wrote:
...Until I've read your post, I didn't know about the "?P<user_num>" thingy

Hehe..that 'thingy' is known as a named captures (internally, the regex system still assigns named captures to the appropriate index number [so while the entire pattern is stored as index 0, your first named capture is still stored as 1, but you can obviously access this capture by the name you give it as opposed to its index number).
As a side note to naming convention, PHP 4 and 5 treat names that are numbers differently (so I read). Best practice is to to avoid numeric named captures altogether.

code == no no[/code]

nevvermind;10962308 wrote:
The regex should be this:
CODE\d+.?\d+[/CODE]

In the event lookaround assertions give people the heebie-jeebies.. an alternative (at least in this case anyway):

\$\K\d+\.?\d+

P.S I have not been coding in PHP (nor used regex) for the last 7 months... man, am I ever RUSTY! lol Slowly getting back into the swing of things..

Shake it off..... shake it off...

nevvermind

Thanks, man, for the "\K" thingy. :p That's also something I didn't know.

Found some info about it and sounds promising. Note from php.net: "...since PHP 5.2.4".

You're right about the second regex. The strings are to simple to use conditionals, so a mere "#\$(\d+.?\d+)#" or "#\$\K\d+.?\d+#" should suffice.

nrg_alpha

nevvermind;10962429 wrote:
The strings are to simple to use conditionals, so a mere "#\$(\d+.?\d+)#" or "#\$\K\d+.?\d+#" should suffice.

Agreed... All in all, there are many ways to skin a cat so-to-speak. In that first example, "#\$(\d+.?\d+)#", this solution definitely works, albeit at the cost of a capture (which in this case, isn't really necessary).. By using \K, we in essence ditch what was matched before it (in this case, the dollar sign), and continue matching from there, giving us a clean value for index 0 (without any capturing fuss).

But yeah, some say po-tae-toe, others say po-ta-toe 😉