need regular exp to extract article body

Anon

Hi everyone,

I need help on forming a regular expression that can extract an article body. For example articles on ZDNet or Cnews ..etc..

I know that the body of an article will usually have continuous words .. go to http://www.zdnet.com/zdnn/stories/news/0,4586,5099135,00.html?chkpt=zdhpnews01

article for example..

the beginning of an article is like this ..

"Intel will release new chips at the Comdex trade show, its first low-power designs for super-thin servers squeezed into cabinets by the dozens, a source familiar with the plan said."

the ending is like this..

"A lot of those guys were coming on stream just when the economy decided to take its downward turn," Brookwood said, but also, the power-saving difference just wasn't that big between Intel and Transmeta."

so ... how do i extract everything in between and of course the start and end block of text as show above.. 🙂

SO.. anyone knows how do i tell php to extract something the moment they see a continuous stream of text... exceptional case would be characters such as < > , " ' ; ... those characters can be considered too .. as sometimes in the article they have the paragraph tag or break tag...

i need a generic solution that can work for most articles...

thank you very very much 🙂

[deleted]

Cool, another vincent :-)

When will this forum start forcing unique usernames?

"For example articles on ZDNet or Cnews ..etc.. "

don't they have copyright?

Anyway, you can't just grab on quoted text, because there may be quotes inside the text too.

You'll have to find some HTML markers that define the text.

amcgrath

stop talking to ourselves and answer the damned question 😉 lol

Anon

mm... wait.. dont worry bout copyright.. thats not the issue here actually ..

and i dont mean quoted text..
i just quote it to show u guys that the article is actually one big chunk of text with not much html inside..

so can we use regular expression to tell them that... ok .. now go and extract everything that begins with xxx amount of text( means no html or watever rubish ) ... extract until you see it ends with xxx amount of text ..

get it ? 🙂

anyway i check out some books..

take a look at this

^[a-zA-Z0-9 \f\r\t\n\r.-\$]{200}.*[a-zA-Z0-9 \f\r\t\n\r.-\$]{200}$

what do you guys think of that reg exp ? i think its wrong somewhere but i think you guys know what i m trying to achieve here right ? 🙂

i m basically telling them to extract something that begins with 200 characters of either a-z or A-Z or has spaces or has . or - or $ or ..etc...
until it ends with 200 characters of a-z... etc...

🙂

Anon

so far.. i manage to get up to this

$text = eregi_replace("<head[^{<>]>.</head>","} ",$text);
$text = eregi_replace("<script[^{>]>.</script>","} ",$text);
$text = eregi_replace("<style[^{>]>.</style>","} ",$text);
$text = eregi_replace("(<[a-z0-9 ]+>)","\1 ",eregi_replace("(</[a-z0-9 ]+>)","\1 ",$text));

eregi("([a-z0-9\$!,\"\'\/ <>\f\r\t\n\r.-]{255}.*[a-z0-9\$,\"\'\/ <>\f\r\t\n\r.-]{255})",$text,$text2);

the first few is just to clean the page a bit before i start extracting ...
the important line is the last one... it worked .. but still not really perfect... any expert wish to make it better ?

oh.. its kinda slow too ...

cordex

I've used the following code for a bit.

I think I stole most of the code from one of the Code Libraries, but I don't remember who wrote it ... sorry to whomever I am not giving credit to.

function grabnews($rtrn_st="")
{
if(!($myFile=fopen("LOCATION","r")))
{
$tick_st = "News is Being Updated, Please Be Patient.";
exit;
}
while(!feof($myFile)) $myLine.=fgets($myFile,255);
fclose($myFile);
$start="UNIQUE TEXT AT BEGINNING OF ARTICLE (look for a unique set of HTML tags that precede it or something)";
$end="UNIQUE TEXT AT END OF ARTICLE";
$start_position=strpos($myLine, $start);
$end_position=strpos($myLine, $end)+strlen($end);
$length=$end_position-$start_position;
$tick_st=substr($myLine, $start_position, $length);
$tick_st = eregi_replace("$start","",$tick_st);
$tick_st = eregi_replace("$end","",$tick_st);
// I used this section to kill excess HTML and such
return ($tick_st);
}// end grabnews

Anon

thanks a lot ben for the code.. 🙂

but my problem is, there is no unique text to indicate beginning and ending of an article 🙂 ..

which is why i need some help to form a regular expression..

the only way i can think of to detect an article is that an article will have blocks and blocks of text.. with not much html in between.. so... i m using that indication as a way to extract... 🙂

but its still aint perfect... arrghh..

🙂

[deleted]

Shouldn't you be looking for markers that define where the text starts? like a table cell? If the layout of the page is always the same, there must be something that always precedes the text. Then you can grab the content of from there.

PS. and you should worry about copyright. Not because I am a saint or anything, but because the fines for publishing copyrighted material can get high. Very high. To the point where you have to close your website because of it.

Anon

i m doing a search engine..
therefore.. if copyright matters to me.. then it will be a prob to google and altavista too 🙂

anyway, as i say before... i m trying to get one regular expression that will do the job for any articles ...
so.. there is not much uniqueness as a marker..
and even if there is, it will be different for each site... then u will need one reg exp for each site..

therefore, the only way i can think of as i mentioned before would to trace for a block of text.. then start extracting till no block of text.. unless someone out there can think of something better.. hehe

[deleted]

If it's for a search engine, then why bother to grab only the article text? Just grab the full text, strip all the HTML tags and put the entire text in the engine. You're only looking for matches so the bit of extra content doesn't matter.

As for the displaying of a 'preview' on the search results page, use what google etc do, search for the keywords and display that part of the text that has the keywords in it. Nobody will ever know you also grabbed the rest.

Anon

well, looks like you are trying to ask me to ignore the problem 🙂

anyway, the more important part is actually the preview .. or summary .. 🙂 .. i want to display first 250 character of the article ACCURATELY .. so which is why i need to extract the exact article .. so that i can cut the top part out as the summary...
and then i can also index the article accurately as well .. rather then some exrta keywords which will cause inaccuracy.. 🙂

[deleted]

I'm just wondering if there is a problem in the first place. I think you are doing a lot of work for nothing.

Anyway, try something like this:
preg_match_all("/(([a-zA-Z0-9]+?[ ,.-\/\'`\"\$]+){10,})/s", $sText, $aMatches, PREG_PATTERN_ORDER);

It will help to use strip_tags() to get rid of all the HTML first.

Anon

thanks a lot vince 🙂 will try it out ...

well, i will let u know in details about my project later if u r interested 🙂

pls email me so i can contact u through email directly if u dont mind... still need to bug once in a while... but if u dont like it.. then nvm.. 🙂

i m working on my final year project anyway .. it an article search engine..

Anon

yo vince... the reg exp u have me has some prob...

Parse error: parse error, expecting ','' or';''

mm.. but i m not sure which part is causing the prob..
🙂

[deleted]

Probably something other than my regexp, because that worked in my test :-)

btw you can email me through phpbuilder