I've got a large string of HTML (all one big string).
I know there are parts of the HTML, like comments that repeat themselves multiple times... An example of this is "<!-- HEADLINE AND ARTICLE -->".
I'm trying to remove the pieces like this that are similar. (doing indexing, and multiple sections like this throw it off).
I did the basic:
preg_match_all(preg_match_all('@(.{15,}).*\1@si', $page_data, $duplicates));
I would like it to match 15+ characters. (so it doesn't match multiple uses of "the" and things like that).
I've tried with different flags, with different results. Nothing near what I'm looking for.
The code above produces NO matches. (not to mention, runs through a TON of HTML in about 1/3 of a second.)
Not being exactly sure how the backreferences worked. With \1 instead of \1, I get an array looking like:
(
[0] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
)
[1] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
)
)
Can anyone help?