In short (more info below for bold bullets)
1. Create a new dom document: $doc = new DOMDocument, $doc->loadHTML($htmlString);
2. write a function "traverse" that can walk any such tree rooted at $root
3. To perform replacements in the entire document body get the (first) body element from the dom document, $body,
and call traverse($body). The tree passed to traverse is then rooted at the body node
4. rewrite your existing string replacing function to handle DOM nodes rather than just plain text
1. The DOMDocument class will parse html code into a DOM tree representation, which you may then traverse. Also note that even if you pass only an html snippet such as
<div>text<img></div>
the DOMDocument will add doctype, html, head and body elements. This means that you can still use the first body element to process all contents.
2. Traversing trees is so commonplace that different methods of doing so has their own names (depth-first: pre-order, in-order, post-order as well as breadth-first). It shouldn't take long to write the code to handle the tree traversal), but if you get stuck I'm guessing the algorithms to do so should be all over the net.
In order to know what to do, you also need to understand the structure of the DOM tree. Each node in the DOM tree is a called a DOM node, and these can be of different types. Each such node has a node value indicating the type of node. These are the node values, their pre-defined php constants and their common names.
1 XML_ELEMENT_NODE - Element
2 XML_ATTRIBUTE_NODE- Attribute
3 XML_TEXT_NODE - Text
4 XML_CDATA_SECTION_NODE - CDATA section
5 XML_ENTITY_REF_NODE - Entity reference
7 XML_PI_NODE - Processing instruction
8 XML_COMMENT_NODE - Comment
9 XML_DOCUMENT_NODE - XML document
12 XML_NOTATION_NODE - Notation
Assuming you are working with html and that you do not need to deal with nested document (in iframes), you should only have to deal with text nodes and element nodes.
The dom tree structure of this html code
<div>
Have you seen this?
</div>
<div>
<img src="http://example.com/thumb.jpg">
http://example.com/thumb.jpg
</div>
Top level consists of 3 nodes:
- XML_ELEMENT_NODE (div)
- XML_TEXT_NODE (white-space between divs)
- XML_ELEMENT_NODE (div)
The first div contains
- XML_TEXT_NODE (Have you seen this?)
Text nodes contain only text and thus can have no child nodes
The second div contains
- XML_TEXT_NODE (white-space)
- XML_ELEMENT_NODE (img with src="http://example.com/thumb.jpg")
- XML_TEXT_NODE (http://example.com/thumb.jpg)
By only processing text nodes for replacement, the above image node will no longer present a problem.
The traverse function has to check the type of each node as it walks the tree. If it is of type text, it is a target for your url to image or anchor replacement.
4. However, you will no longer be able to simply replace the text as you did previously. If you replace a text node containing 'http://example.com' with '<a href="http://example.com">example.com</a>' you have performed string replacement rather than inserting an actual anchor. I.e. when you eventually retrieve the html code, it will look like
<a href=... <
Rewrite your existing string replacing function so that instead of simply replacing strings, it actually creates dom nodes to represent those. In short, it will have to be able to break up a text node into three new nodes: text node (text before replacement) + element node (anchor or image) + text node (text following replacement)