Splitting HTML text into paragraphs/array - regexp

bunner_bob · Mar 24, 2008

I'm using preg_split to split HTML text composed of multiple paragraphs into an array of separate paragraphs, minus the paragraph tags. The way I'm doing it feels clunky (though it DOES work) - I figure there must be a more elegant approach. I tried preg_match_all but I ended up with an array of arrays - not sure why but I must have done something wrong.

Here's how I have been doing it:

$this->text = preg_split("/<p>/", $text, -1, PREG_SPLIT_NO_EMPTY);

foreach ($this->text as $k => $v)
{
	$this->text[$k] = preg_replace("/<\/p>/", '', $v); // get rid of end "p" tags
}

This was my attempt with preg_match_all:

$pattern = "/<p>.*<\/p>/";

preg_match_all($pattern,  $text, $this->text);

But instead of netting me an array where each element was a paragraph, it gave me an array with one element, and THAT element was an array where each element was a paragraph - i.e. it nested it one extra level deep. Also, I'd love to strip off the <p> tags in the process. Tried parentheses around the "contents" - i.e. (.*) but that gave me another nested array element. Obviously I don't understand preg_match_all

I'm thinking preg_split is the way to go but I wish it were possible to strip of the end </p> tags at the same time.

Any tips?

NogDog · Mar 24, 2008

The mutli-dim array is the way preg_match() works: each full-pattern match is in an array under the [0] index, each match of the first sub-pattern under the [1] index, and so forth. You could therefore do:

preg_match_all('#<p>(.*)</p>#Us', $text, $matches); #U modifier: ungreedy "*", s: '.' includes newline
$this->text = $matches[1]; // matches of the first parenthesized sub-pattern

halojoy · Mar 24, 2008

hi
I have used this, in a script of mine.
I had first same trouble as you have.

<?php // code by halojoy 2008-03-24

// get a match of at least 1 char, that DOES NOT begin qith '<'
// and is located between <p> and </p>

$find = '<p>([^<]+)</p>';


//////  Another way for any tag //////

// find contents between TD tags in HTML CODE
$code = 'my html table';

$s1 = '<td>';                  // any opentag, this case TD
$s2 = '</' . substr( $s1, 2 ); // the matching closing tag to $s1

// search for this
$find = '#' . $s1 . '([^<]+)' . $s2 . '#';

// first match
preg_match($find, $code, $pm);
$first_match = trim( $pm[1] );  // trim away any spaces

// all matches
preg_match_all($find, $code, $pm_all);
$all_matches = $pm_all;

// DISCLAIMER!
// unfortunately, will not give correct result if the text between TD-tags 
// has some other <tag> inside for example:
//    <td> here is some <b>text</b> with one tag in   </td>
// ... and is CASE sensitive:   <TD>will not be found ....

?>

As you can see will not work for every situation.
But my $find can be fixed.
Somebody may help us to make a better PREG.

A $find that takes everything, until correct match closing tag is found.
Search for:
<td> or <TD>
get ( not: </td> or </TD> )
</td> or </TD>

Regards
halojoy

bunner_bob · Mar 24, 2008

NogDog wrote:

preg_match_all('#<p>(.*)</p>#Us', $text, $matches); #U modifier: ungreedy "*", s: '.' includes newline

Hey - that's nice! I guess I had some aversion to only using "part" of an array (don't want to waste that code - kids on the other side of the world going without and all...)

Assume you're using # as your pattern delimiter so you don't have to escape the / before p.

Adding the "s" made it "actually work" for me - I was trying to figure out how to do that. Basically it changes the behavior of "." within the expression, even though it is outside the expression itself. That right?

At first I didn't understand the U ungreedy modifier. It seemed to work fine whether I had it in there or not. But I realized it wasn't splitting it into multiple paragraphs, rather just considering the whole thing a single paragraph. Since I was re-paragraphing after my processing, the end result looked the same either way until I tried to do something that counted the number of paragraphs, and I realized there was only one.

Thanks again - very very helpful!

Bob

Splitting HTML text into paragraphs/array - regexp

Bbunner_bob

NogDog

Hhalojoy

Bbunner_bob