Currently, my table has about 80,000 records (all text- song lyrics mainly) totalling about 130 megs from a table dump.
If someone wants to find particular lyrics for a particular song, I have options to search through artist, album, songtitle or search the entire database, which scans through 80,000 records of lyrics via the mysql LIKE operator.
My theory is that string comparison is vastly slower than integer comparison. So slow, that I think a different method may work faster, which consists of:
Two tables.
One to store ALL unique words found in the entire database (I have a script that does this- see attached). This also has an id column (for reference in the other table). The table looks like this:
[size=1]
mysql> describe lyrics_word_index;
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(16) | | PRI | NULL | auto_increment |
| unique_word | char(25) | | UNI | | |
+-------------+----------+------+-----+---------+----------------+
[/size]
The other table will be a relationship table. It will have a link to an id on the unique word index table, a link to an id on the lyrics table, and relevance (calculated at indexing time). It looks like this:
[size=1]
mysql> describe lyrics_related_id;
+---------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------+------+-----+---------+----------------+
| table_id | int(16) | | PRI | NULL | auto_increment |
| related_id | int(16) | | MUL | 0 | |
| word_index_id | int(16) | | MUL | 0 | |
| relevance | double | | MUL | 0 | |
+---------------+---------+------+-----+---------+----------------+
[/size]
The idea is that I will issue something like this:
SELECT a.id,artist,songtitle,album,songbody
from
lyrics as a, lyrics_word_index as b, lyrics_related_id as c
where
a.id = b.related_id
AND
b.word_index_id = c.id
AND
c.unique_word = '$UNIQUE_WORD'
Advantages:
Everything is indexed, whereas using the LIKE option, a tablescan is required.
Relevance is already calculated prior to the select
-(I will take this time to note that I have tried converting the entire table to use mysql FULLTEXT and the fastest query I could achieve was ~ 7 seconds which is completely unacceptable)
At about 1.2 million records in the related_id table and 90,000 records in the unique table, a query like the one mentioned above takes about .17 seconds (applying this logic, 8 million records would take ~1.13 seconds?).
Disadvantages:
The related_id table is going to be huge. It essentially (roughly) is comprised of total unique words * 80,000 which means I'm looking at about 8 million records in the related_id table.
Indexing, currently, is VERY slow. It literally takes me 2 days to do this. See attached for the script. If anyone can help me make this faster, it would be nice. I'm seriously thinking about converting this to a C program, because it takes about 5 minutes per 200 records which equates to 33 hours of a continuous KEEP-ALIVE connection (which I can't seem to figure out how to achieve, so I have to set up cronjobs to do 200 at a time.. sigh)
Any thoughts / Ideas on how to improve this idea? To Get away from using LIKE?
thanks for all who respond/suggest,
cheers,
k.
ps-attached is a .txt file due to not allowing .php file uploads.