Speed up large query

scraff

Hello all,

A while back I optimised a search query (with the help of phpbuilder members!) to speed if the search. The search went through 100's of thousands of rows to bring back results to the user. The problem was this took around 6 to 8 seconds to search the DB, and then going to the next page took 6 to 8 seconds etc. So the kind users here suggested a new table to store temporary results. This worked great, when clicking on the next page, or refreshing the page, the subsequent pages loaned in around 0.03 seconds, which is great.

But the problem still remains on the first search, the user has to wait a miserable 6 seconds for the page to load.
Is there a way to bring back results quicker than this when searching so many rows? At the moment, the search trawls through all the records searching for the keywords, then orders them based on relevance. Is there a better way to do this?

As the database grows it will get slower, and Google looks down on these slow load times!

johanafm

If you are serious about providing relevant search results at high speeds, you should look at search indexers. There are open source solutions which should give you results in a fraction of a second for the amount of data you describe. Lucene and Sphinx are just two options. Google around and read up on different solutions. Apache SOLR (which uses Lucene) might do the trick for you, since you can query it over HTTP using XML or JSON.

Installing and configuring new things may take some time to do though. As such, you might try what I describe below instead since it should require little work time to tets. But I don't know if it'd improve the speed at all, so you'd have to try it out and see what happens. Also note that once you've processed all data as outlined below, you'd have to do this every time a new text is stored. But if it gives fast enough search results, then the time it takes to process the text once should be ok.

Go through all texts that needs to be searchable. Break it up into individual words. Store all those words in a separate table words, where "all" most likely need not be all words. For one, you can omit any word of 3 chars or less, at least unless you lower the default 3-character threshold (on MySQL, check docs for others).
Create a text_words table in which you place word_id, text_id, word_count.
Whenever someone performs a search, use the words table, join with text_words and join with texts. Pull out a decent subset of texts_ids then join this on the text table. That is, along the lines of
```
SELECT id, title, body, MATCH(title, body) AGAINST ('the words here') AS score
FROM (
	SELECT text_id, SUM(word_count) wordcount
	FROM words w
	INNER JOIN text_words tw ON w.id = tw.word_id
	WHERE word IN ('the', 'words', 'here')
	-- AND word_count > ???
	ORDER BY wordcount DESC
	LIMIT 50
) tmp
INNER JOIN texts t ON t.id = tmp.text_id
WHERE MATCH(title, body) AGAINST ('the words here')
ORDER BY score DESC;
```
Do note that there is no weighting of keywords here in the inner query, although it may be added as well of course. There is also no differentiating between title and body in the inner query when it comes to weight. This may also be added if necessary. You'd have to try it out yourself, but it's possible that this approach would get good enough results on the inner query while sufficiently cutting down on data for the fullt text search. You may also want to tweak the row limit for the inner query.

Another thing you might do is start by retrieving max word count for each word in the search.

SELECT MAX(word_count) AS wordcount, word_id
FROM words w
INNER JOIN text_words tw ON tw.word_id = w.id
WHERE word IN ('the', 'words', 'here');

Then use those max word counts in the inner query by dividing it by 2 or possibly even a lower number. The reason is that if max word count for some word is 100 in a single text, then it's unlikely you need to search all texts where it exists 5 times. Dividing by 2 would start checking in texts where it is found 50 times. Inner query modified as follows

SELECT MAX(word_count) AS wordcount, word_id
FROM words w
INNER JOIN text_words tw ON tw.word_id = w.id
WHERE word IN ('the', 'words', 'here');

This will require an INDEX (word, word_count) on the words table.

If the cutoff is 5 for 'the' and 10 for 'words', then a text with 12 'words' and 4 'the' would only have a wordcount sum of 12, since 'the' would not be retrieved due to a too low word count.

If this approach would give enough speed but somewhat lacking in relevance, you could fire off a separate process which perform the same query using nothing but built in full text indexing. This would be used on subsequent queries (2nd page etc), which might lead to slightly differnt search results the other time around.

scraff

Thanks a lot for your reply johanafm!

I will give your post another read over to get it clear in my head. But in the mean time, is what you describe similar/same as FULLTEXT search? I have just been looking this up, and it seems similar to what you describe. Will setting the column I want to search as FULLTEXT so the same as what you have described above, but automatically?

My table contains lists of products, ranging from branded towels to home entertainment systems, so I wasnt sure if there would be too many words to index?

Again, thanks for your reply

Weedpacket

You might also want to look at using a DBMS that is designed for document storage/retrieval (a "document-oriented database"). Off the top of my head there are MongoDB, BaseX, and CouchDB - but there are certainly others.

johanafm

You havn't even tried adding an index? What I describe does of course involve having a full text index on the relevant field because without proper indexing, all queries slow to a crawl even for relatively small datasets. You also said you got help here to improve the search time. As such, I naturally assumed proper indexing wasn't the issue.

As always:
- Want to know how/if something works? Try it out and see what happens. You'll find out.
- If you have no clue about what you're doing, don't do it in a production environment or on live data.

sneakyimp

The exact nature of your search probably matters here. One thing to keep in mind: proper indexes work miracles for many types of searches but occasionally the nature of a search simply requires looking through a lot of data.

Another thing to keep in mind: a query like SELECT * FROM table WHERE textColumn LIKE '%foo%' will not take advantage of a FULLTEXT index (at least I don't think it will). In order to take advantage of FULLTEXT indexes, you need to use the MATCH commands as described by johanafm. Maintenance of fulltext indexes also requires some over head -- the indexes can be quite large.

There are a variety of non-relational databases (like Amazon SimpleD😎 that claim to be arbitrarily scaleable. If you can do without some of the features of a relational database -- and I'm not exactly sure what features are excluded by simpledb -- then you might have some luck there. Another possibility, which may or may not work depending on your site's business model, is to just create a Google Custom Search for your site and leave the searchability in Google's hands.