[RESOLVED] INNER JOIN problem

Square1

Hi All:

I am in the midst of rewriting my site from ASP classic into PHP (MSSQL > MySQL) and may have painted myself into a corner with this problem. Hope the wisdom of this board could solve this!

I originally (MS SQL) had a table of athletes. It had their name, sport, playing status (retired, active etc.), 12 columns for up to 12 different teams they played on and individual columns for each type of award they may have won. So there was a HOF column, for example, and if this athlete was a Baseball HOFer it would say "Baseball" in this column for this athlete's record.

When rewriting the code to PHP I felt there was a better way to store this info. The problem with the old was that I was limited, especially in "teams", the number of options I can apply to any one athlete by the number of columns I had. I had 12 team columns, meaning I can use "NY Yankees" in team column "Team1", "Baltimore Orioles" in "Team2" and so on for "Reggie Jackson". But what about if an athlete was on more than 12 teams? Problem. Plus, the athlete table got very unwieldy with all these columns.

I thought I solved this problem by setting up a PlayerDetails table. This table would keep a separate record for each detail of the player and link to the Athlete table via the playerID hash. So now, there would be 2 separate records for the above example. One record would have "Reggie Jackson", the "Team" column would have "NY Yankees" and the playerID column would have his hash that matched the hash for "Reggie Jackson" in the Athlete table. I would have a second record the same as this with the exception of the team being "Baltimore Orioles". As Reggie was on 5 different teams and has 4 awards and records, there are a total of 9 records for "Reggie Jackson" all with the same playerID number linking this PlayerDetails table to the Athlete table.

Here is the FROM portion of my query to show you what I am doing:

PlayerDetails INNER JOIN ( Sports INNER JOIN ( Locations INNER JOIN ( Players INNER JOIN ( Promoters INNER JOIN  Events ON  Promoters.prID =  Events.prID) ON  Players.plID =  Events.plID) ON  Locations.loID =  Events.loID) ON  Sports.SportCode =  Players.SportCode) ON PlayerDetails.plID =  Players.plID

Now the problem...

As "Reggie Jackson" has 9 records in the PlayerDetails table, every time I search for "Reggie Jackson" his result is repeated 9 times. If I rewrite the FROM portion not to link PlayerDetails.plID = Players.plID everything is fine (with the exception that I cannot get results for "Team" or "Awards and Records" searches as expected). In hindsight, I see why this is happening. But I HOPE there is a way to remedy this without having to rewrite a TREMENDOUS amount of code!

Thanks!

johanafm

First off, I'd recommend rewriting the query from the unreadable nested joins to a straightforward join. I.e.

FROM Player
	INNER JOIN PlayerDetails ON Players.plID = PlayerDetails.plID
	INNER JOIN Sports.SportCode ON Players.SportCode = Sports.SportCode
	-- and so on

That way you can easily see which columns are used to link what tables. Well, at least it's easier to read for me. Should you find your way easier to read, by all means, stick with it.

As for your issue, I think you'd have to be more specific. If you want search for and display players, search the players table. As soon as you link to PlayerDetails, you will obviously get one entry per player detail that exist for each player. Also, if I understand you correctly, PlayerDetails contains the teams a person has been on, so it would make more sense to restructure this into two new tables: Teams and TeamPlayer, where Teams contains the team names and an id (and possibly other info) and the TeamPlayer contains a player id and a team id.

However, if you wish to do something like listing each player name and next to that all teams he has played for, in one row, you could use the aggregate function GROUP_CONCAT(), which works like php's implode() function on the specified field. Do note that you will also need to add a GROUP BY clause to the query, and that every selected field must be either part of an aggregate function or part of the group by clause.
With MySQL, you will need to take extra care, since MySQL doesn't enforce this SQL standards rule, at least not unless you have specified to run MySQL in strict mode in mysql.ini.

SELECT p.name, GROUP_CONCAT(t.team)
FROM Players P
	INNER JOIN TeamPlayer TP ON P.plID = TP.plID
	INNER JOIN Teams T ON t.id = TP.team_id
GROUP BY p.name

Square1

Thank you johanafm

Let me start by cleaning up the INNER JOIN. What am I doing wrong with the following? I tried it w/o the "AND" as well but no luck

			FROM
				Events
				INNER JOIN Events ON Players.plID =  Events.plID AND
				INNER JOIN Events ON Promoters.prID = Events.prID AND
			  	Players
				INNER JOIN Players ON PlayerDetails.plID = Players.plID AND
				INNER JOIN Players ON Sports.SportCode =  Players.SportCode AND
				Locations
				INNER JOIN Locations ON  Locations.loID =  Events.loID

johanafm

Just drop the ands.

Square1

johanafm;10989595 wrote:
Just drop the ands.

Thank you. I did that (and dropped "Players" and "Locations") and that worked and it it MUCH more user friendly.

As for my original problem of duplicate results records from the Events table I tried a simple solution... DISTINCT eventsID (the Events table auto increment) and that SEEMS to work.

But another problem reared its head. The results are taking about 15 seconds to appear. I will start a clean thread for this. Thanks johanafm.

johanafm

Square1;10989608 wrote:
simple solution... DISTINCT eventsID (the Events table auto increment) and that SEEMS to work.

Once again, you are (probably not) selecting the correct data or are joining unnecessary tables.

Let's say you have 3 tables, event(id), player(id, event_id), player_detail(id, player_id) and the first table contains (1), i.e. one event with id: 1, the second contains (1, 1), (2, 1) and the third contains (1, 1), (2, 1). Reading this out, you have one event with two players, and the first of those players has two player details.

Actually, I doubt you have this strcuture, but it's just to be used as an example to explain what happens when selecting data, with or without dsitinct and group by. Also note that I do not bother with names for events, players and teams. I keep it down to ids.

If you do

SELECT event.id AS event
FROM event
INNER JOIN player ON event.id = player.event_id

you will get

output wrote:
1
1

If you also select player.id

SELECT event.id AS event, player.id AS player
FROM event
INNER JOIN player ON event.id = player.event_id

you'd get

output wrote:
1, 1
1, 2

That is, no matter what data is selected, you will have 2 rows. However, if you add DISTINCT to the first query, you'd get only one row instead of 2. In the second case, you'd still get 2 rows, since 1,1 is not the same as 1,2. That is, each ROW is still DISTINCT from all other rows.

But, if you only want distinct event ids, why are you even joining them to the player table? Why not simply

SELECT id FROM event

Since I do not know what queries you are actually executing since you only show half of them, and since I can't guess at the queries purpose (I don't care at all about sports so I can't even begin to relate to what you might want to do), I cannot say for certain that you are doing something wrong, but my instinct tells me it's likely.

My hunch tells me you are
1. either joining in unnecessary tables, not selecting the appropriate fields and/or not using aggregate functions and a group by clause properly,
2. or do not have an appropriate schema

For example, in my example above, there'd be no purpose at all to join all three tables, since a player has only participated with one team in any particular event, even though the result for player 1 would be

event  player  detail
1	1	1
1	1	2

And in this case I'd say that the error is in the schema. The player should be linked to teams via a link table and the teams should be linked to events via a link table. The team_player table might contain (player_id, team_id, start_date, end_date), with a primary key of (player_id, team_id, start_date).
The event table should contain start and end dates for each event, which means that the link table team_event only contains (event_id, team_id), with the primary key containing both fields.

Now the schema is correct. If you want to list events between certain dates, you'd still not link anything to the events table. If you however want to show the same list of events along with their teams, you might

SELECT event.id, GROUP_CONCAT(team.id)
FROM event
INNER JOIN team_event ON event.id = team_event.event_id
INNER JOIN team ON team.id = team_event.team_id
WHERE event.start_date BETWEEN @date_one AND @date_two
GROUP BY event.id

Or the same as above, except that the list contains event, team

SELECT event.id, team.id
FROM event
INNER JOIN team_event ON event.id = team_event.event_id
INNER JOIN team ON team.id = team_event.team_id
WHERE event.start_date BETWEEN @date_one AND @date_two

But once again, if you don't want team info, just event info, then do not join in team data.

Square1

Didn't know you reply to a "Resolved" topic!

Thank you for the, excellent , detailed answer.

I believe it is a schema issue. Very similar to the first code example you are giving, I am getting duplicate records appearing due to the PlayersDetails table linking to the Players table via the same PlID (player ID #) for multiple records. The following is a sample from the PlayersDetails table

(PlayerID) 1, (Team) Yankees
(PlayerID) 1, (Team) Orioles
(PlayerID) 1, (Team) Athletics

So when I do a search for PlayerID 1, 3 duplicate records appear (assuming I don't filter the search further by team).

Your answer said "The team_player table might contain (player_id, team_id, start_date, end_date), with a primary key of (player_id, team_id, start_date)."
This is how I had it originally coded except that where you have "team_id", I had 12 columns "team_id_1", "team_id_2" etc so I can assign multiple teams to a single player. But the table got very large with all those columns so I tried to break the teams out into their own PlayerDetails table. But I am seeing this may not work! Additionally, this PlayerDetails table has about 45,000 records and between this and about 5 INNER JOINS it is making the results take very long to appear.

johanafm

Square1;10989909 wrote:
(PlayerID) 1, (Team) Yankees
(PlayerID) 1, (Team) Orioles
(PlayerID) 1, (Team) Athletics

So when I do a search for PlayerID 1, 3 duplicate records appear (assuming I don't filter the search further by team).

The point here is that IF you join the tables holding information about player and what teams he's played for, you should get those 3 records.

If you just want to retrieve the player information, you should NOT join in the table for players' team information. You never retrieve more information than you want to deal with. This is what you should do

SELECT player.id
FROM player

There is no reason at all to perform two needless joins and DISTINCT away the result of the join.

SELECT DISTINCT player.id
FROM player
INNER JOIN player_team AS pt ON player.id = pt.team_id
INNER JOIN team AS t ON pt.team_id = t.id

However, if you do need information from another table, you obviously need the join. But it doesn't necessarily have to mean that it has to return several rows. You can use aggregate functions (sum, avg, group_concat and others) and a group by clause to push groups of data into one row. Just be aware that every single thing in the select clause (the selected fields) MUST be either part of the group by clause or be part of an aggregate function.

SELECT players.id, GROUP_CONCAT(teams.name SEPARATOR ', ')
FROM players
INNER JOIN player_teams AS pt ON players.id = pt.player_id
INNER JOIN teams ON pt.team_id = teams.id
GROUP BY players.id

which would give you

pl.id   team.name
1       Yankees, Orioles, Athletics

In this example two fields are selected: players.id and teams.name. players.id is part of the group by clause and teams.name is part of an aggregate functions, so all selected fields are either in the group by clause or part of an aggregate function.

Square1;10989909 wrote:
This is how I had it originally coded except that where you have "team_id", I had 12 columns "team_id_1", "team_id_2" etc

Which breaks 1NF (First Normal Form) - see Database Normalization.

Square1;10989909 wrote:
so I can assign multiple teams to a single player. But the table got very large with all those columns so I tried to break the teams out into their own PlayerDetails table. But I am seeing this may not work!

Why wouldn't it work? This is exactly the way to go

CREATE TABLE players (
	id	INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
	name,
	etc
);

CREATE TABLE teams (
	id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
	name VARCHAR(10),
	etc,
);

CREATE TABLE player_team (
	player_id	INT UNSIGNED,
	team_id		INT UNSIGNED,
	start_date	DATE,
	end_date	DATE,
	PRIMARY KEY (player_id, team_id, start_date)
	FOREIGN KEY (player_id) REFERENCES players(id),
	FOREIGN KEY (team_id) REFERENCES teams(id),
);

Square1;10989909 wrote:
Additionally, this PlayerDetails table has about 45,000 records and between this and about 5 INNER JOINS it is making the results take very long to appear.

It should not take long at all. Assuming you have indices on all relevant fields, these joins should be very quick. Note that both primary and foreign keys are automatically indexed.
The only joins on the above tables would involve primary and foreign keys, which means that all those indices are covered. But you may want to search on start_date (WHERE start_date ....) and in that case you might need to add an index specifically for start_date since a composite key such as (player_id, team_id, start_date) can be used for player_id alone, a combination of player_id and team_id or a combination of all 3 values. But it cannot be used for team_id or start_date or a combination of team_id or start_date. Do note that team_id allready has its own index due to team_id being a foreign key.

Make use of EXPLAIN to see the query execution plan, indices involved and other useful information. It's a great help for identifying efficiency problems.

Square1

Wow... I never heard of 1NF. I read the link you provided and yes, going with 12 team columns is a no-no. So I guess I am on the right track but am not executing it correctly.

I removed the DISTINCT. I added GROUP BY the table ID and that cleared up the duplicate issue. But the SSSSSLLLLLOOOOOWWWWWNNNNNEEEEESSSS is still a problem.

I have this problem address in a separate thread. I am indexing the table with 45,000 rows but it is still showing that is it reading 45,000 rows on EXPLAIN.