I've run into a tricky (for me) regular expression problem. I'm working with text that's been formatted by TinyMCE into regular HTML. The text will typically contain a bunch of <p>-wrapped text, but possibly also some <ul> and <ol> sets.
I need to split the text into an array of chunks of text, where each chunk is either a paragraph (without the <p> tags) or a ul or ol (WITH the <ul>/<ol> tags). From this I do further formatting before delivering to templates.
So for example, source text might look like:
<p>First paragraph.</p>
<ul>
<li>List item one</li>
<li>List item two</li>
</ul>
<p>A new paragraph.</p>
And the array after splitting would need to look like:
$text = array(
'First paragraph.',
'<ul>
<li>List item one</li>
<li>List item two</li>
</ul>',
'A new paragraph.');
I've been using preg_match_all to break apart everything within <p> tags. To have this work on <ul>s I must first ADD <p> tags to the <ul>s. So I've been using this function:
function splitText($text)
{
// first stick extra p tags onto ul and ol so they won't be ignored by the next step
$pattern = array("/<ul>(.+)<\/ul>/Us", "/<ol>(.+)<\/ol>/Us");
$replace = array("<p><ul>$1</ul></p>", "<p><ol>$1</ol></p>");
$text = preg_replace($pattern, $replace, $text);
// grab paragraph contents; look for any character (.); s modifier adds newline to "."; U modifier makes it ungreedy - otherwise treats the text block as a single paragraph
preg_match_all("/<p>(.+)<\/p>/Us", $text, $postText);
return $postText[1];
}
The problem is, the first pattern is too ungreedy when it comes to nested <ul>s - it seems to bail out at the first nested </ul>.
I tried removing the "U", making it greedy, but then I run into a problem if I have several sets of <ul>s with <p> paragraphs in between - it treats the several sets as one big nested-ul-with-paragraphs-inside-it.
What I need is something in between - that will "know" when it's arrived at the closing </u> of a set of nested <ul>s, without continuing on and including subsequent <ul>s.
Any of you regexp wizards out there have a tip or two you'd like to share?