Find similar record in database

starbbs

I have an idea that troubles me for months.I really have no good idea to work this out. Maybe you guruus can help me with this.

I would like to have a function/class that can help me find records that can find similar /matched records in a table. The idea is to find comapare each record with onother and find a probable match.

Ex:

Each record has Name, Address, Place to live, Birthrecord, Licenseplate, Car brand.

I want to find release this function to the database. When it finds for example a Opel, it also finds other records if there is a Opel inside the record. But when This Opel and the lisenseplate is inserted twice with different drivers, it finds this match automatically case two fields are the same (Brand and licenseplate)

Or: it returns records that show that the same person with a different birthday always drives an opel.

Get it ?

This function is selfsupporting, so no real input is needed. Maybe in the function itself you can define some 'weight' of importance to each field it scans. Sothat for ex: A birthday and a Licenseplate is more important than a Brand.

Can someone help me with this ?

AstroTeg

Sounds like a combination of exact matches and fuzzy matches. The exact matches are pretty easy - just simple SQL queries. The fuzzy matches are more difficult. You need fuzziness to handle classic misspellings and renaming of popular names. You probably want a way to find a driver of an Opel with a first name of Robert, Rob, Bob, or Roberto and maybe a matching last name. Same with the address. What if they spell out Avenue or just enter Ave?

For these, you'll need to implement an algorithm to consolidate the names and addresses into something basic enough you can then query for matches. Basically, you need to build a generic key that holds just enough information about your record yet reduces as many as the spelling and naming consistencies as possible. One trick is to remove vowels and some letter combinations. Others are to replace common names with the same name and then reduce from there.

Oracle has some good documenation on how they do it. Trick is I can't seem to find it out in the public (did a quick google search). You have to be logged into Metalink to get at the info (look for address key or customer key generation).

starbbs

Thanks for your answer but i also like some strategic planning on how to do this best

Fuzzysearch is already worked out. So i have made this work in a function.

For these, you'll need to implement an algorithm to consolidate the names and addresses into something basic enough you can then query for matches.

Can you explain how to do this in terms of planning ?

Like, select table, select field, get a record, put all words in an array, compare them, do the fuzzy, remember this record and compare it with all other records. If match ? put in array, goto the second record bla bla bla.... etc

Oracle has some good documenation on how they do it. Trick is I can't seem to find it out in the public (did a quick google search). You have to be logged into Metalink to get at the info (look for address key or customer key generation).

I searched and rearched but no luck... i did find something on a fulltext index and the MATCH option in mysql. I think this is not the way to do this

Thnks

AstroTeg

The approach I talking about uses keys. Depends on how you plan things out, but the system I've worked with had a customer key to describe the name of the company, an address key to describe the address/location, and a person key to describe the person's first and last name. These keys get tied to the records (either in their own column or in a seperate table - your design call on this). Then its a matter of doing direct matching on the keys themselves. For those keys that match each other, you could consider those records duplicates or the same.

I'm not sure what you mean by "how to do this in terms of planning?"

It would just be a simple SQL query to compare one key with another. Or you could mix it up and compare 1 record's multiple keys in one query using AND statements in your WHERE clause. Or you really could mix it up and concatinate the multiple keys to make one key and then compare this new key to concatinated keys in your tables. That could actually work out pretty well depending on your needs...

starbbs

What i meant is:

Is the best aprouch to get record 1, compare it with all other records, find all matches, determine the percentage of a match and than goto record 2, compare it with all other recorcds etc etc etc.

Or some kine of funtion that takes random recordid's and compare each found random record with all other records.

Let say: that the we have 1000 records, isnt it a better approuch to first get record 500, do it's thing, than record 1, than record 1000:

500 compare......
1 compare
1000 compare
499 compare
2 compare
999 compare

I hope you understand what i meant.,

I think the stratagy for comparing is this:

Find out how many records we have

1: Get a NEW record id
2: Get all text/string fields from the

record and put them in $orig_string
3: remove all common words like:

and,this etc from $orig_string
4: remove garbage in $orig_string
5:format and delimit all found words in $orig_string
6: Get a NEW record id, not the same as the first of course.
7: same as 2 put in $new_string
8: same as 3 ""
9: same as 4 ""
10: same as 5 ""
11: compare $old_string and $newstring
12: get word 1 from $orig_string and compare this with all found words in $newstring
13: found match ? (fuzzy OR 100%) remember the record id and maybe some other information cause we found a match !
14: get word 2 from $origstring
15: same as 13
16: etc etc, after doing all found words in $orig_string select a new record that isn't already processed
17: same as 1

Can you advise me with this approuch ?

AstroTeg

Keep this in mind: What is the MOST expensive (in terms of processing time) operation your code will have?

Database querying. Every single query will ding you in processing time. Also, the complexity of the query will ding you too. But here, I don't think you'll have a very complex query.

With that in mind, check out your steps:

2: Get all text/string fields from the
record and put them in $orig_string
3: remove all common words like:
and,this etc from $orig_string
4: remove garbage in $orig_string
5:format and delimit all found words in $orig_string

Couldn't you just save the resulting string in a field somewhere in your database? And when any of the record's data changes, you can update this string on the fly. THEN you can build your search string and search it against a list of these already keyed strings leaving you with a pretty simple text search which could probably be indexed to increase search result times.

Remember to keep it simple.

With this:

500 compare......
1 compare
1000 compare
499 compare
2 compare
999 compare

You sound like you're trying to do a tree search. But this is what your database engine is supposed to be doing for you. You're working against your database and not letting it do its job by using this approach (indexing the column will allow the database to automatically do some nifty search tricks).