how to get contents of an HTML tag

sfullman · Aug 10, 2008

I developed a function using regex but it's running into problems.
what I want to do is find a tag on an HTML page (let's say "div") for example, with an attribute (let's say "id") having a specific value (let's say "content"), and return the inner HTML of that div:

//example
function xml_read_tag($string, $tagname, $attribute, $value){
   //black box here :-)
   return $contents;  //where contents is the innerHTML of <div id="content">...</div> in this case

}

Does php have any native functions that do just this?

Thanks in advance!!
Samuel

sneakyimp · Aug 11, 2008

PHP has DOM functionality. I'm not sure how handy they will be. You might get pretty far using pattern matching.

sfullman · Aug 11, 2008

this appears to be the ticket - could you give me an example of how you'd do something akin to getTagHTMLByAttribute(), i.e. the inner html content (outer html content would be even better) - there's a lot of documentation to wade through here

Thanks,
Samuel

sneakyimp · Aug 11, 2008

Sorry! I haven't waded through it yet either.

bradgrafelman · Aug 13, 2008

Here'd be an example:

$doc = new DOMDocument;
$doc->loadHTMLFile('myfile.html');
$node = $doc->getElementById('content');

echo $node->nodeValue;

Note that the above code (specifically, the getElementById() method) requires valid HTML content (e.g. a valid DTD); if you don't have valid HTML (who would do such a thing, though?!), you could use XPath:

$doc = new DOMDocument;
$doc->loadHTMLFile('myfile.html');

$xpath = new DOMXPath($doc);
$node = $xpath->query('*/div[@id="content"]')->item(0);

echo $node->nodeValue;

sfullman · Aug 13, 2008

hi bradgrafelman,

the xpath coding produces this error:

Parse error: syntax error, unexpected T_OBJECT_OPERATOR in /home/rbase/dev/devteam/development/site_spider/xpath.php on line 6

(which is this line):

$node = $xpath->query('*/div[@id="content"]')->item(0);

the other method gives various errors also..

am running php 4x I believe. suggestions?

scrupul0us · Aug 13, 2008

try:

$node = $xpath->query('*/div[@id="content"]')->item[0];

bradgrafelman · Aug 13, 2008

sfullman wrote:
am running php 4x I believe. suggestions?

Upgrade to a version of PHP that is still supported (e.g. PHP5). :p In all seriousness, you should consider upgrading ASAP.

EDIT: Actually, now that I test this myself, I'm not sure you're able to use DOMDocument like this on PHP4 at all.

kiwibrit · Oct 21, 2011

Hmm.

Using this web page as a test subject (confirmed valid here) I have problems.

$node = $dom->getElementById('intwrap');

throws two warnings:
ID navigation already defined in http://www............., line: 32

ID maincontent already defined in http://www.........., line: 96

but then does give me the text content in full. However there are no line breaks, and no html tags, which I need.

I can get rid of the error warnings

$dom = new DOMDocument;
libxml_use_internal_errors(true); 
$dom->loadHTMLFile("$url");
libxml_use_internal_errors(false);
$node = $dom->getElementById('intwrap');
echo $node->nodeValue;

I suspect that I will get the tags from

echo $node->nodeContent;

But at the moment I am getting response problems. If I sort it, I will come back and leave a note for others with similar problems. If anybody comes up with a bright idea in the meantime, I'd be grateful.

[edit] Argh! have just discovered nodeContent doesn't exist. Suggestions, anyone?

johanafm · Oct 21, 2011

Yes, $node->nodeValue only includes textNodes of $node. For outer html of el $el of doc $doc, you use

$doc->saveHTML($el);

For inner html, you just take strlen $len of $el->nodeName, and take the substr between $len + 2, -($len + 3) since the start tag has two extra chars, < and >, and the end tag has 3, < / >. This is assuming you're not for some reason dealing with a text node, since text nodes have no tags (<#text>, </#text>). But on the other hand, text nodes don't contain anything but character data anyway, so in those cases, $el->nodeValue is all you need anyway.

kiwibrit · Oct 21, 2011

Thanks for that. I thought that if I incorpoated your suggestion into my code, then for outer html I should have

$dom = new DOMDocument;
libxml_use_internal_errors(true); 
$dom->loadHTMLFile("$url");
libxml_use_internal_errors(false);
$node = $dom->getElementById('intwrap');
echo $dom->saveHTML($node);

But that gives me an error - DOMDocument::saveHTML() expects exactly 0 parameters, 1 given

Have I misunderstood you?

Weedpacket · Oct 21, 2011

No, but the DomDocument::saveHTML method only got its optional parameter in PHP 5.3.6; you must be using an older version; if instead DomDocument::saveXML doesn't work, you could look at importing the node into a new document and then convert the new document to a string.

Are you needing to turn it into a string with tags and all, or is that just for examining the content and you'll actually be using the DOM object itself in the real code? If it's the latter then you don't need to convert it into a string in the first place.

kiwibrit · Oct 21, 2011

Good point - I am using 5.3.5 locally for testing - and that matches my web host, so it is what I should code for. Having dome what is recommended elsewhere in the forum, and not started a new thread, I can't mark it as resolved - but it certainly has been for me, thank you very much indeed. This achieves what I was trying to do:

$dom = new DOMDocument;
libxml_use_internal_errors(true); 
$dom->loadHTMLFile("$url");
libxml_use_internal_errors(false);
$node = $dom->getElementById('intwrap');
$content = $dom->saveXML($node);
$content = str_replace("
", "", $content);
$file = 'test.txt';
file_put_contents ($file, $content);
header('Location: test.txt');

Our company site, which I coded, is being massively re-skinned, and being given a CMS. I'm happy with the work being outsourced, since I am now semi-retired and working just a couple of days a week. The new contractor (whom I helped choose and get along well with) was asking for content.

As a first step, a lot of content needs to come across from the old site. For the most part, on each page the content lives inside a div with the id 'intwrap'. So I am now able to offer them the guts of each page's content coding, including paths to images and so forth.

how to get contents of an HTML tag

Ssfullman

Ssneakyimp

Ssfullman

Ssneakyimp

Bbradgrafelman

Ssfullman

Sscrupul0us

Bbradgrafelman

Kkiwibrit

Jjohanafm

Kkiwibrit

Weedpacket

Kkiwibrit