reading text between html tags- can you help me?

virva

I have succeeded in reading text from another web site and printing it into my own page (I have a permission). I have done this by reading the text between certain line numbers. Here's the code:

$fcontents = file ("http://www.mol.fi/Tietoa/Ammatti/00/3/0/1/30110.html", "r");

$n=54;
$i=12;
while($i < $n){
echo "$fcontents[$i]";
$i++;
}

So, now I would like to do the same by searching html tags from the text and printing everything between them. Is that possible and if so, how to do it?

Weedpacket

It depends on how easy it is to identify the wanted text. Get all of the page into a single string (join it with '', say).

The preg_match() function will prove useful. To find the stuff between <tag> and </tag> the expression would be

preg_match('#<tag[^>]*>((?:(?!</tag>).)*)</tag>#is', $html, $match);

and the bit between the first <tag>/</tag> pair in the page will be in $match[1].

If there is more than one such pair, you could use preg_match_all() to get all of them and pick the one you want out of the resulting array, or first use something like strchr() or preg_match() (again), to first narrow down where you want to search. (For example, if what you want is in the body, but there's the risk that there might be something in the head that will match, use the regexp above to match the text between <body> and </body> - then you don't have to worry about the contents of the <head> entity.)

Some people have tried to use the XML functions - I wouldn't: even strict HTML4.0.1 is invalid as XML, and most HTML out there isn't compliant with any HTML standard. The XML functions will choke.

virva

Hmm... I tried it, but it doesn't work, at least not like this:

$fd = fopen ("http://www.mol.fi/Tietoa/Ammatti/00/3/0/1/30110.html", "r");
while (!feof ($fd)) {
$html = fgets($fd, 4096);

preg_match('#<A NAME="työtehtävät"[^{>]>((?🙁?!<A} NAME="kelpoisuusehdot">).))<A NAME="kelpoisuusehdot">#is', $html, $match);

echo $match[1];

Weedpacket

Originally posted by virva
Hmm... I tried it, but it doesn't work, at least not like this:

Okay, I have a bit more information now, and can give a simpler answer. Since only one anchor entity can have a given name (and I'm guessing that anchor Työtehtävät doesn't have any other attributes), the pattern can be simplified:

'#<a name="työtehtävät">(.*)<a name="kelpoisuusehdot">#is'

Note that this assumes that the two tags are exactly <a name="työtehtävät"> and <a name="kelpoisuusehdot"> (apart from maybe using upper case letters). If either tag has other attributes following the name, then putting a [^>]* before the corresponding > will take care of them.

Be aware also that it doesn't keep the <a> tags themselves. If you do want those as well, move the () outwards to enclose them, also (just keep them inside the ##).

One last thing - this will only match if the entire thing you're wanting to match (tags included) is read as part of a single 4096-byte chunk (because of the way you're reading the page in). Instead of testing each chunk separately, concatenate them as you read them, and then test the concatenated string; with each chunk you test the entire page you've read so far. If the pattern is in the page, then eventually it will be in the string (even if it spreads across more than one chunk). To do that it's just a matter of replacing $html = with $html .=. Oh, and replacing fgets() with fread() might speed things up a bit, 'cos it won't keep stopping on every newline character.

intenz

Try...

$fp = fopen('bla.html','r');
$fcontents = fread($fp,filesize('bla.html'));
ereg('<title>(.*)</title>',$fcontents,$ple);
echo $ple[1];

Weedpacket

Originally posted by intenz
Try...

$fp = fopen('bla.html','r');
$fcontents = fread($fp,filesize('bla.html'));
ereg('<title>(.*)</title>',$fcontents,$ple);
echo $ple[1];

[/B]

You can't filesize() a remote file.

intenz

I know that.
Maybe the file is local?

If file remote, replace this line...

$fcontents = fread($fp,filesize('bla.html'));

with...

$fcontents = fread($fp,10000);

Weedpacket

Originally posted by intenz
I know that.
Maybe the file is local?

Not with a name starting with "http://" it's not.

If file remote, replace this line...

if you're absolutely certain that the text you're matching WILL appear in its entirety in the first 10000 bytes of the file. I checked: at the moment the file is 4810 bytes long, but who am I to assume that it will never get longer?

Why live dangerously?

virva

Hey, it works! Thanks a lot.😃

But... now it prints the text twice, why's that?
Here's the code:

$fd = fopen ("http://www.mol.fi/Tietoa/Ammatti/00/3/0/1/30110.html", "r");
while (!feof ($fd)) {
$html .= fread($fd, 4096);

preg_match('#<a name="työtehtävät">(.*)<a name="kelpoisuusehdot">#is', $html, $match);
echo $match[1];

}

intenz

Try...

<?
$fp = fopen('http://www.mol.fi/Tietoa/Ammatti/00/3/0/1/30110.html','r');
$fcontent = fread($fp, 10000);
preg_match('#<a name="työtehtävät">(.*)<a name="kelpoisuusehdot">#is', $fcontent, $match);
echo $match[1];
?>

Weedpacket

Originally posted by intenz
Try...
<?
$fcontent = fread($fp, 10000);
?>
[/B]

Why?

intenz

Because the file is remote and I don't know the filesize?

Weedpacket

Originally posted by intenz
Because the file is remote and I don't know the filesize?

So how do you know it's not going to be more than 10000 bytes?

virva

Okay, guys. It's not worth arguing of the filesize. Seems that now this 4096 is enough, and I don't think it's gonna get bigger. But can you help me with the last question, which was, why does it now print the text twice?
Thanks a lot.

Weedpacket

Originally posted by virva
Okay, guys. It's not worth arguing of the filesize. Seems that now this 4096 is enough, and I don't think it's gonna get bigger. But can you help me with the last question, which was, why does it now print the text twice?
Thanks a lot.

Whups, didn't spot you'd said that.

Tuning the size of the chunk is possible - the smaller the chunk the less that has to be read into memory before a match is found, the larger the chunk the faster the loop will run; it's a tradeoff: 4096 is a good size in this case.

Yeah, it will print twice when (1) the bit you're matching appears entirely in the first chunk, and (2), the file is read in two chunks. That's what's happening here.

What you want is to stop reading once you've found a match and jump out of the loop (once you've found it there's no point in continuing to search!). For this the break; statement was designed:

// To suppress a possible warning.
$match=array();
$fd = fopen ("http://www.mol.fi/Tietoa/Ammatti/00/3/0/1/30110.html", "r");
while (!feof ($fd))
{   $html .= fread($fd, 4096);
    // If we find what we're looking for - stop looking
    if(preg_match('#<a name="työtehtävät">(.*)<a name="kelpoisuusehdot">#is', $html, $match))
        break;
}

if(isset($match[1])) // Found a match
{   echo $match[1];
}
else
{   // Problem - we read the entire
    // file but didn't find what we're looking for.
}

virva

Yeah. I haven't been programming very much, so that 'break' didn't just pop up into my mind... 🙂

But thank you VERY much for your help. Of course, if you still want to share your knowledge, I posted another question about variables. Just in case you would like to answer to that... 😉