what's wrong (reg.expr.)

twopeak · Dec 31, 2002

why would it be incorrect? to me it seems very logical!
The parameter tells the script to return the weather for Antwerp, and not the general weatherpage, like it would if you don't write the parameter!
Besides, that can't be wrong; var $buffer does contain all the necessary information!
it's really that regular expression that is wrong!

xblue · Dec 31, 2002

I'd agree, since it should make a difference whether you include a local file or some content via http.

Now, do you get any output at all? (I think you probably should?)

What striked me most: are you sure eregi_replace is what you need? The way your pattern looks, you'd only replace <title>some text</title> with some text, leaving all the rest as it is.

preg_match() or, for things that might be in there more than once, preg_match_all() might be more useful since you can read all matches into an array for later use.

twopeak · Jan 2, 2003

ok, this is what I've got now...

$buffer = "<html><head><title>something</title></head><body>hello, this is some stupid text<!-- start forecast by LOC code (smb) --> this is the text I would like to extract <!-- end forecast by LOC code (smb) --> and this is some more stupid text <br><i>containing HTML and stuff, because I got it from a HTTP request</body></html>";

echo htmlspecialchars($buffer);
echo "<br><br><hr><br><hr><br><br>";
// begin = "<!-- start forecast by LOC code (smb) -->"
// end = "<!-- end forecast by LOC code (smb) -->"


$bericht = preg_match_all("/.*start forecast(.*)end forecast.*/", "AAAAAA\\1", $buffer);
echo htmlspecialchars($bericht);
echo "<hr><hr>";
preg_match_all('/<title>(.*?)title>/', $buffer, $aMatches);
echo $aMatches[0];
echo "<hr>";
echo $aMatches[1];

none of these work...
The first one returns "0" so probably it wants to tell me there are no matches (there are...)
The second returns me two times "Array" which makes me believe this is or not an array, or that array element doesn't exist...

does it matter what is inside the .* I mean, if there is some code that could be understood like a regexp
or anything else?

twopeak · Jan 2, 2003

with
print_r(array_values($Matches));

I found out I do have array entries, but it's all empty!
Why?

xblue · Jan 3, 2003

Hi again,

this is from the manual:

$matches[0] is the first set of matches, and $matches[0][0] has text matched by full pattern, $matches[0][1] has text matched by first subpattern and so on. Similarly, $matches[1] is the second set of matches, etc.

(http://www.php.net/manual/en/function.preg-match-all.php)

So, since there is probably only one title, $matches[0] should be an array containing the first match, where $matches[0][0] contains the complete match and $matches[0][1] contains what was matched by (.*?), and matches[1] would not be set at all.

twopeak · Jan 3, 2003

the problem is that it's all empty!
probably meaning it's empty..

xblue · Jan 3, 2003

Hi,

what is this:
preg_match_all("/.start forecast(.)end forecast.*/", "AAAAAA\1", $buffer);
???

Mind the syntax:

preg_match_all ( string pattern, string subject, array matches [, int flags])

Thus, you are looking for your pattern in the string AAAAAA\1 and save the matches in $buffer, overwriting the former value of $buffer.

I checked your code without that (and a little modified) and it's doing fine:

$buffer = "<html><head><title>something</title></head><body>hello, this is some stupid text<!-- start forecast by LOC code (smb) --> this is the text I would like to extract <!-- end forecast by LOC code (smb) --> and this is some more stupid text <br><i>containing HTML and stuff, because I got it from a HTTP request</body></html>";

preg_match('/<title>(.*?)title>/', $buffer, $aMatches);
echo htmlspecialchars($aMatches[0]);
echo "<hr>";
echo htmlspecialchars($aMatches[1]);
echo "<hr>";

milind24 · Jan 3, 2003

Hello ,

First thing is change variable name buffer to any other name or imm. assign buffer value to other variable .
second thing is preg_match_all is storing values in Multiple array not single array .

Try following code its working ..

$buffer = "<html><head><title>something</title></head><body>hello, this is some stupid text this is the text I would like to extract  and this is some more stupid text containing HTML and stuff, because I got it from a HTTP request</body></html>";
$myvar = $buffer;
echo htmlspecialchars($buffer);
echo " <hr> <hr> ";
// begin = ""
// end = ""

$bericht = preg_match_all("/.start forecast(.)end forecast.*/", "AAAAAA\1", $buffer);

echo htmlspecialchars($bericht);
echo "<hr><hr>";
//preg_match_all('/<title>(.?)title>/', $buffer, $aMatches);
preg_match_all("/(<([\w]+)[^>]>)(.*)(<\/\2>)/",$myvar, $aMatches);

for ($i=0; $i< count($aMatches[0]); $i++)
{
echo "Matched text: ".$aMatches[0][$i]." ";
echo "part 1: ".$aMatches[1][$i]." ";
echo "part 2: ".$aMatches[3][$i]." ";
echo "part 3: ".$aMatches[4][$i]." ";
}

twopeak · Jan 3, 2003

uhmmmmm

didn't worked...I still get an empty array with 4 indexes!

preg_match_all("/(<([\w]+)[^>]> )(.)(<\/\2> )/",$HTML_file, $aMatches);

for ($i=0; $i< count($aMatches[0]); $i++)
{
echo "Matched text: ".$aMatches[0][$i]." ";
echo "part 1: ".$aMatches[1][$i]." ";
echo "part 2: ".$aMatches[3][$i]." ";
echo "part 3: ".$aMatches[4][$i]." ";
}
print_r(array_values($aMatches));
?>

because I was getting an empty page with the number of matches, I tried to print all the array values, showing me I got an empty array as a result...

twopeak · Jan 3, 2003

could it be I have problems because there are newlines inside that HTML code?
because I can extract the title, I can extract from
"start forecast" till "LOC code" (inbetween is only "by")
but when I try from "start forecast" till "end forecast"
I have a problem!

twopeak · Jan 4, 2003

if someone wonders what the problem was:
it needed a /s modifyer in the back, to span over multiple lines!

I found it using one of the sites from the usercomments in the php manual!
it's perl, but I got a solution now

twopeak · Jan 15, 2003

This is a very good reply I got in a newsgroup a while ago, might help people who do the same
it's solved now!!!

I had a doozy of a time matching HTML from pages for a long time, and there
were 3 things that helped me.

Use non-greedy .'s.. .? is your friend.

if you have to match over multiple lines, use 'ms' --> s/fjk(.*?)kd/ms

Normalize the webpage before you match against it, here is what I
commonly do:
function normalize_page ( &$page ) {
$page = preg_replace( "/\r/", "", $page); <-- Strip out
carraige returns
$page = preg_replace( "/\n/", "", $page); <-- Strip out new
lines returns
$page = preg_replace( "/\t/", "", $page); <-- bye bye tags
//$page = preg_replace( "/ /", " ", $page); <-- Use this with
some caution.
$page = preg_replace( "|</(.*?)>|", "</$1>\n", $page); <-- This breaks
up tags onto new lines
$page = preg_replace( "|><|", ">\n<", $page); <-- This also
breaks up things.
}

Normalizing a page before reg'exing it can also help when html formats
change "just a little", a space here, or a tab/newline there, and your regex
is broken. However, normalizing the data first helps to keep your regexes
functioning.

Just my $.02.

--Brian

what's wrong (reg.expr.)

Ttwopeak

Xxblue

Ttwopeak

Ttwopeak

Xxblue

Ttwopeak

Xxblue

Mmilind24

Ttwopeak

Ttwopeak

Ttwopeak

Ttwopeak