problems with converting text to image or link. Double jeapardy in play

schwim

Hi there everyone!

I've got an issue with a function I'm using. It's intention is to find URL's in a string of text and convert images into <img 's and links into <a href's . The issue I'm running into is that it's sometimes doing a very poor job, trying to apply both to an image or a normal link will get encapsulated into an img tag. This seems to be happening when more than one url is passed in a string of text. It starts to mess with the URL inside the the img or a href tag. I'm wondering if there's something I can do to make sure that if it does one thing, it doesn't try to do another.

Here's the function:

/* Convert urls and images into html links. */
function UrlsInText($text){
	$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
	preg_match_all($reg_exUrl, $text, $matches);
	$usedPatterns = array();
	foreach($matches[0] as $pattern){
		if(!array_key_exists($pattern, $usedPatterns)){
			$usedPatterns[$pattern]=true;
			// now try to catch last thing in text          

			$pattern2 = substr($pattern, -3);
			if($pattern2 == "gif" || $pattern2 == "peg" || $pattern2 == "jpg" || $pattern2 == "png"){
				$text = str_replace($pattern, '<img src="'.$pattern.'" width="400" alt="Automagically inserted image!">', $text);   

			}else{
				$text = str_replace($pattern, '<a href="'.$pattern.'">'.$pattern.'</a>', $text);
			}
		}
	}
	return $text;            

}

Thanks for your time!

johanafm

Your code has no problem with multiple urls in the same text as far as I can tell. But you may run into problems when parsing html, because of naive matching and lack of context awareness.

For example

<div>
Have you seen this?
</div>
<div>
    <img src="http://example.com/thumb.jpg">
</div>

<div>
    The original image can be found <a href="http://example.com/images/image.jpg">here</a>!
</div>

would turn into

<div>
Have you seen this?
</div>
<div>
    <img src="<a href="http://example.com/thumb.jpg">">http://example.com/thumb.jpg"></a>
</div>

<div>
    The original image can be found <a href="<a href="http://example.com/images/image.jpg">here</a>!">http://example.com/images/image.jpg">here</a>!</a>
</div>

Where the first replacement is because the first match for your pattern is against this

http://example.com/thumb.jpg">

Lack of context in this regard refers to the fact that you are replacing stuff inside the src attribute of an img. Also note that you are matching the end delimiter for the src attribute (") and also the end delimiter for the img tag (>)

And the second match is against

http://example.com/images/image.jpg">here</a>!

Similarily, you replace things inside the href attribute of the anchor. Also note that you are match here includes
- end delimiter for href attribute: "
- end delimiter for anchor start tag: >
- anchor text: here
- closing tag for anchor: </a>
- text immediately trailing the anchor end tag: !

">here</a>!

If these are the issues you refer to, then perhaps you should use PHP DOM to construct a DOM tree, and then only perform replacements in text nodes?

schwim

Ooh, that's exactly what I was seeing!

<img src="<a href="http://example.com/thumb.jpg">">http://example.com/thumb.jpg"></a>

If these are the issues you refer to, then perhaps you should use PHP DOM to construct a DOM tree, and then only perform replacements in text nodes?

Forgive my ignorance, but what would this entail? I see from the php manual page:

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.

I don't understand at all what this means. I'd really love to modify this to handle these issues properly as the function gets used a bunch.

johanafm

In short (more info below for bold bullets)
1. Create a new dom document: $doc = new DOMDocument, $doc->loadHTML($htmlString);
2. write a function "traverse" that can walk any such tree rooted at $root
3. To perform replacements in the entire document body get the (first) body element from the dom document, $body,
and call traverse($body). The tree passed to traverse is then rooted at the body node
4. rewrite your existing string replacing function to handle DOM nodes rather than just plain text

1. The DOMDocument class will parse html code into a DOM tree representation, which you may then traverse. Also note that even if you pass only an html snippet such as

<div>text<img></div>

the DOMDocument will add doctype, html, head and body elements. This means that you can still use the first body element to process all contents.

2. Traversing trees is so commonplace that different methods of doing so has their own names (depth-first: pre-order, in-order, post-order as well as breadth-first). It shouldn't take long to write the code to handle the tree traversal), but if you get stuck I'm guessing the algorithms to do so should be all over the net.

In order to know what to do, you also need to understand the structure of the DOM tree. Each node in the DOM tree is a called a DOM node, and these can be of different types. Each such node has a node value indicating the type of node. These are the node values, their pre-defined php constants and their common names.

1 XML_ELEMENT_NODE - Element
2 XML_ATTRIBUTE_NODE- Attribute
3 XML_TEXT_NODE - Text
4 XML_CDATA_SECTION_NODE - CDATA section
5 XML_ENTITY_REF_NODE - Entity reference
7 XML_PI_NODE - Processing instruction
8 XML_COMMENT_NODE - Comment
9 XML_DOCUMENT_NODE - XML document
12 XML_NOTATION_NODE - Notation

Assuming you are working with html and that you do not need to deal with nested document (in iframes), you should only have to deal with text nodes and element nodes.

The dom tree structure of this html code

<div>
Have you seen this?
</div>
<div>
    <img src="http://example.com/thumb.jpg">
http://example.com/thumb.jpg
</div>

Top level consists of 3 nodes:
- XML_ELEMENT_NODE (div)
- XML_TEXT_NODE (white-space between divs)
- XML_ELEMENT_NODE (div)

The first div contains
- XML_TEXT_NODE (Have you seen this?)

Text nodes contain only text and thus can have no child nodes

The second div contains
- XML_TEXT_NODE (white-space)
- XML_ELEMENT_NODE (img with src="http://example.com/thumb.jpg")
- XML_TEXT_NODE (http://example.com/thumb.jpg)
By only processing text nodes for replacement, the above image node will no longer present a problem.

The traverse function has to check the type of each node as it walks the tree. If it is of type text, it is a target for your url to image or anchor replacement.

4. However, you will no longer be able to simply replace the text as you did previously. If you replace a text node containing 'http://example.com' with '<a href="http://example.com">example.com</a>' you have performed string replacement rather than inserting an actual anchor. I.e. when you eventually retrieve the html code, it will look like

&lt;a href=... &lt;

Rewrite your existing string replacing function so that instead of simply replacing strings, it actually creates dom nodes to represent those. In short, it will have to be able to break up a text node into three new nodes: text node (text before replacement) + element node (anchor or image) + text node (text following replacement)