Unique preg_match issue

tisource

After fighting this issue, and after a load of research, I've become desperate for an answer. I can't seem to find a solution.

I have this html document I'm scanning with php on the server. It needs to pull out certain blocks of text. These "blocks" are defined by HTML comments and are incremental.

The text I'm scanning for is everthing between two "" tags, for example. For some odd reason, I can't seem to get my preg's to match properly.

This is what I have thusfar for my first parameter to preg_match:

/<!.ARTICLE 1.>.<!.ARTICLE 1.*>/i

Any ideas? Any tips would be greatly appreciated.

buraks78

try this, because i am not sure if you can use the dot operator for spaces.

/<!.\sARTICLE\s1\s.>.<!.\sARTICLE\s1\s.*>/i

in perl thereis also a g flag, i dont if php has that also for repetitive matches.

i have one question btw, the opening and closing tags are the same. that might be a problem if there are multiple sections in the same file.

Weedpacket

/<!-- ARTICLE 1 -->(.*)<!-- ARTICLE 1 -->/s

The 's' because I'm guessing you want . to match newlines as well.

Shrike

/<!-- ARTICLE \\d+ -->(.*?)<!-- ARTICLE \\d+ -->/

Don't be greedy 😉

Bleh board won't let me eneter backslashes, even escaped ones.

tisource

No go.....

I've tried all the suggestions, and no matches. I can't change my html pages, unless I manually changed them in hundreds of html pages.

Specifically, I could change the html comments so my matches are easier (so the start and end comments are different). Unfortunately, since each block uses the exact same text, I can't do a global replace on all the files...

I would imagine there has to be a way to match this, but I am still lost. Any other suggestions?

Weedpacket

It would help a bit if we knew what was happening instead of working properly. From your description "The text I'm scanning for is everthing between two "" tags, for example":

preg_match_all("/<!--\s*ARTICLE\s+(\d+)\s*-->(.*)<!--\s*ARTICLE\s+\\1\s*-->/s", $text, $match)

will find all of the articles in $text and put them in $match[2].

Is there something about the format of the tags we've missed?

Or perhaps the problem is not there at all but earlier - the text you're scanning isn't what it should be?

tisource

My apologies for the lack of information.

As you might have guessed, this is a website for a newspaper. The pages (hundreds, mind you) are in basic html format. The front page news, for example, runs all of its stories in one html doc (in contrast to having seperate docs for each story).

Each section is seperated by html comments  (which are, by the way, the same top and bottom -- the html comments are the same). Within those html comments are the article's headline, lead, and story information (including photos, etc).

The purpose of the php script is that it is supposed to parse this news page and generate a web page that contains a preview of each story (a headline, a thumbnail of a photo, and a lead into the story). If the user wishes to read further, they click on a "read more" link which takes them to the html page.

Traditionally, I have managed this "preview" page on my own...meaning it was a seperate html document (I simply do the copy-n-paste routein from the news page). I am trying to automate that process, and effectively eliminate the time spent on that page.

These html comments in the news page were put there to make editing easier (I work strictly by html, and won't use a wysiwyg editor). After we started heavily implementing php on the site, I realized I didn't have much in these news pages to distinguish each article, besides the html tags.

I can parse each story individually (extract the headline and story info) if I can first extract just the content between the html comments, but it becomes difficult if I'm scanning for headlines in the entire page (with all the articles).

Long story....whew! Hope that more explains my situation. I don't want to go back and change html pages, because there are hundreds of pages.

Thanks for the help thusfar.

tisource

Thank you for all of your help. I have recently had time to work with the problem again. The following is what I got working:

preg_match_all("/(.)/s", $file, $story);

where 'n' is the number of the story I want to extract.

Interestingly enough, I had another error that was preventing the contents of the file from being passed to this function (the $file variable was empty...no wonder I couldn't get it to work). Between that, and my problems with the preg_match, I had my hands full.

Again, thanks for the help. I've got this one solved.