SELECTing distinct rows

l008com

I have a very simple table of addresses:
name, address, city, state are the four columns
There are 120,000 items in this table, and I want to select 30 of them at random. But because there are WAY more Cali and Texas entries than other states, I want a query that will only select unique rows. In other words, I want the state column to be distinct but none of the others. Here is my query as it stands:

SELECT id,name,address,city,state,url FROM chefmoz_list WHERE batch = 0 ORDER BY RAND() LIMIT 30

It works great but in any given group of 30, I might get 8 california addresses. What I'm trying to get is random groups of 30 addresses, from 30 difference states, every time.

Piranha

First, your table is not normalized. You should think about doing that. To get to know how read the articles about database normalization on phpbuilder.com.

Second, don't use "ORDER BY RAND()", it is terribly slow. There are a sticky explaining 2 ways of handling this in the forums.

For your question, there is really no reason to handle it until you have corrected the two things above.

NogDog

Piranha wrote:
First, your table is not normalized.

Why do you say that? There's nothing obviously non-normalized about it that I can see, though I'll admit I am by no means a database guru.

To the OP: I don't see any simple solution. All I can think of is pretty ugly: do a select of all needed fields with a random order into a temp table, then select from that table with a GROUP BY on the state column and LIMIT of 30.

Piranha

NogDog wrote:
Why do you say that? There's nothing obviously non-normalized about it that I can see, though I'll admit I am by no means a database guru.

I say so because there is both street address, city and state. I fully understand that there could be many cities with the same name in different states, and that they could actually have streets with the same name as well. But it should still be an id for state and the different states stored in another table. Ok, not the worst case of not normalized table I have seen, but still not normalized.

l008com

I'm using the two letter code for the states, not the full names. I don't see how changing it to numbers is going to make very much difference at all in this simple case. Plus I know RAND() is slow but its fine for what I"m using it for.

cgraz

I think there's a fine line between normalizing and going overboard. I'm sure a fName field would hold the name John and Michael quite often. And the lName field would likely have Johnson and Smith. But I wouldn't go so far as to create a separate names table and join based on a NameID

States only have 50 values, so they are more likely to appear repeatedly in the table. But they are also unlikely to change. If a user moves to a different states, his/her record can easily be updated. But the names of the states are likely not going to be changing anytime soon, so normalizing states would probably be a bit unnecessary IMO. He could save himself a join.

l008com

I think the context of this table is misunderstood. Its not an address book, its a one time list that i'm just plucking addresses out of for a paper mass mailing. The states are two letter codes already, so I don't see what the big deal is between two letters for the state, or two numbers that have the two letters in another table. Either way, the query to do what I want to do should be nearly the same.

Piranha

For me the difference is if it is possible to insert something that is bogus information or not. With another table and a foreign key it is not possible to insert "cs" instead of "ca" for california, a simple typo in other words. But of course, if this is handled in another way in the database I would say that this is ok.

l008com

no data is ever inserted. I am using it to extract data only, not add data. Now that we're past this (i hope), any ideas how I can get the unique states i want from my query?

NogDog

Did you read my first reply?

Sxooter

l008com wrote:
I have a very simple table of addresses:
name, address, city, state are the four columns
There are 120,000 items in this table, and I want to select 30 of them at random. But because there are WAY more Cali and Texas entries than other states, I want a query that will only select unique rows. In other words, I want the state column to be distinct but none of the others. Here is my query as it stands:

It works great but in any given group of 30, I might get 8 california addresses. What I'm trying to get is random groups of 30 addresses, from 30 difference states, every time.

Selecting things randomly is generally not a fast thing to do, and with 120,000 rows to go through, this will definitely NOT be fast. select * from table order by rand() is suitable for small sets only, say 100 to 1000 items max. Anything bigger than that and your database will not be able to give you a result quickly.

If you want good performance, you might want to look at pre-creating your results and then just walking through them with an autoinc field in a dummy table (or use a sequence if you're in pgsql or oracle...)

In this case I'd make one table for each largely populated state, then start grouping the states that are smaller together. Make the entries in random order. Have a column in each table for ordering by. Using a php script or something like it go round robin from one table to the next. Then, select all of that into a summary table one table at a time, and cluster on the id field.

It's a lot of setup, but after that, you can just select a random number +x from the db.

i.e. select * from biggiantmastermutanttable where id between 23401 and 23430 and you're sure to get a random sample, and no two states repeat.

Recreate the master table every few hours / days / weeks / months as needed.