I'm currently working on a php script (for a newspaper site) that searches an index for matching articles. It then grabs a small portion of the story that shows where the match occured in the story body.
The code works perfectly, except for one scenario. There are typically no right answers with programming (meaning, an end result can be derived by many different means), but some solutions are definitely better than others.
For example, in the newspaper realm, users often search for articles (or obits, weddings and such) that contain specific names. Lets say a user searches the site's archives for "john doe."
It does the query, finds the articles, and begins extracting "snippets" of the story (so many words left and right of the queried word) where the query word was found (think "google" or "yahoo").
In this example, lets say this is a portion of the story in the archive:
"The city council unanimously agreed with John Doe's proposal to add street lights to the intersection of first and main."
My search code first looks for the word "John" (first word in the query) and grabs something like this:
"...agreed with John Doe's proposal..."
it then looks at the second word in the query, "Doe" and grabs this:
"...with John Doe's proposal to..."
You see, there is redundancy here. If I display the results together, it reads like this:
"...agreed with John Doe's proposal"..."with John Doe's proposal to..."
It is highly redundant and looks very stupid.
I'm looking for a clean, but scalable solution. In other words, I would like an approach that is variable to the number of words in my query.
How would you suggest I approach this?