Advice concerning site search 'context extraction'

tisource

I'm currently working on a php script (for a newspaper site) that searches an index for matching articles. It then grabs a small portion of the story that shows where the match occured in the story body.

The code works perfectly, except for one scenario. There are typically no right answers with programming (meaning, an end result can be derived by many different means), but some solutions are definitely better than others.

For example, in the newspaper realm, users often search for articles (or obits, weddings and such) that contain specific names. Lets say a user searches the site's archives for "john doe."

It does the query, finds the articles, and begins extracting "snippets" of the story (so many words left and right of the queried word) where the query word was found (think "google" or "yahoo").

In this example, lets say this is a portion of the story in the archive:

"The city council unanimously agreed with John Doe's proposal to add street lights to the intersection of first and main."

My search code first looks for the word "John" (first word in the query) and grabs something like this:

"...agreed with John Doe's proposal..."

it then looks at the second word in the query, "Doe" and grabs this:

"...with John Doe's proposal to..."

You see, there is redundancy here. If I display the results together, it reads like this:

"...agreed with John Doe's proposal"..."with John Doe's proposal to..."

It is highly redundant and looks very stupid.

I'm looking for a clean, but scalable solution. In other words, I would like an approach that is variable to the number of words in my query.

How would you suggest I approach this?

sfullman

Hmm.. computers are stupid, you're almost going to have to know string positions to merge those together. It should actually be very simple, string 1 with John starts say at 1000 in the string, string 2 starts with 1006 - just write a function to merge them.

If this helps send me some of your code, I'd be interested in seeing your approach.

another slower method is word splits

$array=preg_split('/\s+/',$string);

you then iterate through the array, but the splitting will cost you time-wise if you get too fancy.

Sam

tisource

Splicing the two strings together wasn't something I had considered.

I have actually considered eliminating the second string if other query words were in string one.

In other words, "Doe" is found in extracted string 1, so kill string 2, because its not needed... but that isn't necessarily the cleanest or most efficient approach.