Getting duplicate results on joined query

schwim

Hi there folks,

I've got an issue with a query that a member helped me build on this forum. The query is supposed to pull links from the database when matched with either the primary category, which is stored in the link's table or to the secondary cat, which is stored in linkcats.

$q_urls = "
	SELECT shortenified_urls .* 
	FROM shortenified_urls 
		LEFT JOIN linkcats 
			ON linkcats.link_id = shortenified_urls.id
		WHERE 
			(shortenified_urls.primary_cat = $vu_cat_view 
			OR linkcats.cat_id = $vu_cat_view)
		AND shortenified_urls.is_social = '1' 
		AND shortenified_urls.active = '1' 
		ORDER BY date_added DESC
	";

The problem that I'm having is that it seems to be duplicating the record any time that it's matched with the primary_cat in the link's record. It seems to be showing the proper count for any secondary cat matches. Here's a screenie of what's happening ( In the screenshot, the bold cat is a primary_cat match and the non-bold following are secondary cats found in linkcats ) :

http://en.zimagez.com/zimage/screenshot-10162014-111452am.php

I'm not sure why this is happening. Any help on figuring it out would be greatly appreciated!

NogDog

Well, maybe the quick-and-dirty fix would be to use SELECT DISTINCT?

schwim

Thanks a bunch for you help, Nog.

I tried that initially, but that borked my pagination system. I can start working on fixing that with DISTINCT but thought there might be something in the query that was obviously off that might get me back on track.

schwim

NogDog;11043475 wrote:
Well, maybe the quick-and-dirty fix would be to use SELECT DISTINCT?

Playing with DISTINCT, I'm having trouble getting all the info from the row, which I need. Do I have to run a second query inside the loop to get the rest of the information or can I tell it DISTINCT id, but I still need all the rest of the info from the row ()? DISTINCT id, and *, DISTINCT id is not working for me.

sneakyimp

If you want to use DISTINCT, then you have to carefully select which columns you want to select from your JOIN query. I'm also wondering if perhaps you might want to use an INNER JOIN rather than a LEFT JOIN. A left join can produce records when there is no matching record in the table joined. In your case, you SELECT * FROM shortenified_urls LEFT JOIN linkcats -- this might return records in shortenified_urls that have no counterpart at all in linkcats. An INNER JOIN would return only records that had a match in both tables.

If you are getting 'duplicates' then you need to be more specific about what exactly is getting duplicated. If your pagination is off for some reason, then you may need to adjust the other query that is responsible for paginating stuff. You'll need to be more specific if you want real answers here.

schwim

Thanks very much for your help, Sneaky. What I ended up doing to get rid of the dupes was to use GROUP BY and then actually duplicate the query for the purposes of pagination. As I read more, I know that this is not a satisfactory resolution, but have not figured out how to get rid of them without doing that.

I did read the man pages and some tutorials on the various types of JOIN, but it didn't stick well enough to notice what you pointed out. I will give the original query another go swapping the JOIN types and see how I fare.

Thanks again for taking the time to help!

sneakyimp

It's not clear what database engine you are using, but if you are using MySQL, you might consider using the FOUND_ROWS() function that it offers. The basic idea is that you add the keyword SQL_CALC_FOUND_ROWS to your query after the word SELECT. You can then use the results of your first query to output a particular page. You can then run another query which will tell you how many rows would have been found without the LIMIT statement:

SELECT SQL_CALC_FOUND_ROWS * FROM tbl_name WHERE id > 100 LIMIT 10;
SELECT FOUND_ROWS();

It can be very helpful.

NogDog

PostgreSQL does something similar with OVER(), e.g.:

select *, OVER() AS full_count FROM . . . LIMIT 20

johanafm

schwim;11043661 wrote:
What I ended up doing to get rid of the dupes was to use GROUP BY and then actually duplicate the query for the purposes of pagination. As I read more, I know that this is not a satisfactory resolution, but have not figured out how to get rid of them without doing that.

sneakyimp;11043689 wrote:
if you are using MySQL, you might consider using the FOUND_ROWS() function

(MySql) From what I read several years ago, duplicate selects were supposed to be (potentially) MUCH more efficient than the SQL_CALC_FOUND_ROWS approach. If you want a definitive up-to-date answer, you'd have to either clock the two approaches or find recent information that matches whatever MySql version you are using. Things might have changed.

The reason is (used to be?)…

that SQL_CALC_FOUND_ROWS will perform full table scans, whereas the two queries

SELECT count(*) &#8230;;
SELECT &#8230; LIMIT &#8230;;

will use index (if possible).

This means that if some query runs too slow without index, only the double query approach will benefit from it. Unless you have a sufficiently small data set, all queries will run slowly without proper indexing, which means you are likely to need the double query approach in most cases.

johanafm

I just found an article from 2007 with a recent reply. The reply argues that when both approaches needs full table scans, the SQL_CALC_FOUND_ROWS has to be faster (one vs two full table scans). He also claims that such a scenario might lead to the most efficient being double-query for the first N pages and SCFR for the last M pages.

While I agree with his conclusion that you might need to test against your particular set of data, I probably wouldn't resort to using a slow SCFR over a slow double-query. In this case I'd rather start looking for ways of making either approach "not slow" instead. For example

I'd definitely try adding indices to avoid full table scans (make double-query fast)
If 1. fails for some reason, it might be possible to avoid more than one use of SCFR or one count query, or at the very least limit a count query to only new data. How to make this work would depend on
are updates and inserts done at known times so that you don't have to deal with table changes during the use of a paginated set
are updates and inserts performed infrequently enough that they usually will not disturb paginated sets, and can you make such inserts/updates invalidate pre-calculated sizes for paginated sets (like any other cache)
could a third query (or second query if using SCFR) with something like MAX(updated_time), which has to be done against an indexed column, be used to verify that no new data has arrived. And when new data appears in the table, this threshold could be used to handle the new data, such as not include it in the original query, or include it in the original query (which may lead to data moving around across pages) and inexpensively update the total row count.
It might make more sense to present a user with an unpaginated set.
If there is little data, but the queries take a lot of time, sending all data to the user is probably smarter. And if there is a lot of data, but the user is likely to be on a fast connection and decent screen size, it may also be appropriate to send all data at once. It could be that you should only avoid the un-paginated result for cell phone users.
Perform one full table scan without limit to retrieve all primary keys for the result set, then store these separate search-table, properly ordered.
For subsequent queries, you'd simply fetch from the search-table and join on whatever table(s) hold the original data.