regex question

scrupul0us · Feb 26, 2007

im not sure how that gets me the data here:

DATA

i need the tag to be the ending delimiter there as the ending would include things i do not want

NogDog · Feb 26, 2007

That's called "moving target requirements." Let us know what the exact requirements are, then we can provide an exact solution.

scrupul0us · Feb 26, 2007

i reference post three and post five:

#3:
e.g if i wanted to match:
dataother html and line breaksdata

#5:
no... i want the parts called data:
DATA...(other html and linebreaks)...DATA

SO

VARIABLE 1...(other html and linebreaks)...VARIABLE 2

i need to get variable 1 and variable 2

nogdog... i like ya man, but if u need more clarification :rolleyes:

NogDog · Feb 26, 2007

Does the DATA end only at a or tag, or at any <...> tag?

scrupul0us · Feb 26, 2007

just in those two unique instances:

and

theres nothing else i care about in the page

NogDog · Feb 26, 2007

See if this works (I've not tested it):


$pattern = '@<strong(?:\s[^>]*)?>(.*?)(?:</strong>|<(?:/\s*)br>)@is';

scrupul0us · Feb 26, 2007

That worked very well... however, i am guilty some what of the 'moving target requirements'

really what im trying todo is parse .;this;. page

ideally my output would be a multi dim array where each months events would be loaded into it the array

well use the first event as an example... i need:

-the numbers (1 - 31)
-the name (Cervical Health Awareness Month)
-the website (www.nccc-online.org/awareness.php)

i was trying to regex the page, but im horrible at doing this and going through it one step at a time is sure to drive everyone crazy... so in short thats what im trying todo

Weedpacket · Feb 26, 2007

Why reinvent the parser? Why not just use the DOM extension's loadHTML() method?

scrupul0us · Feb 27, 2007

well im not sure if it makes a whole ton of difference, but if u look at the HTML its not exactly compliant and its working in quirks mode

ive honestly never used the loadHTML function or any of the DOM functions... that being said, im off to learn something new

scrupul0us · Feb 27, 2007

just in using the based function i get errors:

$html = file_get_contents("http://www.healthfinder.gov/library/nho/nho.asp?year=2007");
$dom = new DomDocument;
$dom -> loadHTML($html);

i get a TON of entity mismatch errors within the document b/c it isnt a compliant site

Weedpacket · Feb 27, 2007

Suppress the error messages with @. The document is still loaded and parsed (which it would have to be for the error messages to be generated). Most of those errors are corrected in the process. Some of the errors need to be cut down to make the code manageable. Who uses tags these days? preg_replace('!</?font.*?>!','', $html). What's left should parse legibly. A lot of the grief comes from braindead constructs like ...

Each item is in a td element that has attributes width="284" and valign="top". You're after the content of the strong element each contains. (Note I said "element", not "tag" - two different things.)

$itemXpath = new DOMXPath($doc);
$items = $itemXpath->query("//td[@width='284'][@valign='top']/strong");

The result of that is a DOMNodeList that can be traversed.

$doc = new DomDocument;
$html = preg_replace('!</?font.*?>!','', $html);
@$doc->loadHTML($html);

$itemXpath = new DOMXPath($doc);
$items = $itemXpath->query("//td[@width='284'][@valign='top']/strong");


for ($i = 0; $i < $items->length; $i++) {
    echo $items->item($i)->nodeValue . "\n";
}

The $items do contain more detail than is just contained in the nodeValue, but given the mess the code is to begin with, it may not be trivial to separate the date from the title.

scrupul0us · Feb 27, 2007

awesome dude... ill give it thee ol' college try when i get home this evening.. thanks for the direction

regex question

Sscrupul0us

NogDog

Sscrupul0us

NogDog

Sscrupul0us

NogDog

Sscrupul0us

Weedpacket

Sscrupul0us

Sscrupul0us

Weedpacket

Sscrupul0us