Add AI PK even if you will never use it in a query?

Good2CU

If I have a table with 2 or 3 fields that need to be unique for a record (either Unique Index or Compound/Composite Key) should I still add an auto increment primary key? The primary key would never be queried so it seems like a waste of space. Is this beneficial for performance/scaleability?

Let's say I have a Many:Many table that is combining attributes from 2 separate Unique Attribute tables, and I will want to query on column 1 to get the attributes in column 2.

NogDog

As far as my understanding of data normalization goes, you should be fine with that as long as no other fields in the table describe only one of those two key fields (i.e. any other non-key fields should describe some aspect of the combined key fields, not just one of them). So for me, the question would be whether or not you ever foresee a circumstance where you would want to have a column in another table point to a record in this table, in which case I would then probably want a single-column primary key that can be used as the foreign key.

But I'll be curious to see what any of the real DB experts have to say. 🙂

sneakyimp

Without a primary key or a single column that uniquely identifies each record, updates and deletes require that you specify more than one field to identify the record you are hoping to edit/delete. This can be a pain in the ass if you end up typing a lot of sql commands to remove individual records because the commands are just longer:

DELETE FROM mytable WHERE col1='foo' AND col2='BAR';

Regarding performance, this could be considerably slower than just deleting/updating a record based on its primary key. The question arises -- are col1 and col2 indexed?

If you'd like to prevent duplicate records in this table, you could possibly define a unique index for multiple columns:

ALTER TABLE mytable ADD UNIQUE INDEX(col1, col2, col3);

Such an index would result in exceptions/errors being thrown if you try to insert a duplicate record.

If you are worried about the primary key being a 'waste of space' I would imagine it probably doesn't take a lot of space but would depend on how big a number you use for your primary key. Offhand, I don't know of the MySQL storage engine stores integer primary keys as 32bits or 64 bits (depending on the definition) or whether the db engine stores an integer or as a string representation of an integer. If they are stored as strings and you allow a 10-digit value, you're only talking about 10MB for a million-record table. If they are stored as 8-btye integers, it's slightly less I think. Not a lot in either case. You could create a table and fill it with data to see the size difference. I don't expect it would be much.

If you have a 2-column table and are always querying to find something based on the contents of column1, then column1 should be indexed. You probably don't need to index column2. There is some overhead in maintaining both this index and the index corresponding to a primary key, but I don't expect it would add much overhead.

Good2CU

Thanks, interesting stuff.

Deletes would be rare, and updates likely non-existent. If I were to move or just delete historical data, I could still do it based on a timestamp which I almost always like to use in each table. The AI PK wouldn't correspond to anything, so I'm not sure how helpful that would be for deletes unless it's just doing a range similar to with a timestamp.

If it's possible, I'll do a unique index on multiple fields and another index on the primary field that would be used in the where clause.

Another thought I had was that if you use a AI PK, can mysql still go and insert data on the actual hard drives based on the order of the meaningful field that will be used in the where clause? Say you have a city_id field with some other id field combining to be unique. As you add new records over time you will have the old city_id's repeated many times, so it'd be nice to keep them in order on the hard drive so they can be looked up faster with a recursive lookup rather than looking through each row or some other method of indexing which would seem slower and/or with more overhead.

NogDog

As long as you are indexing your tables according to how they'll be used, you shouldn't have to worry about how it's stored on disk -- the indexes are there in order to optimize searches and sorts so you don't have to. 🙂 (In fact, you probably cannot control how it's stored on disk even if you wanted to: it's usually pretty difficult to anticipate just how any DBMS is going to actually store the data.

Weedpacket

Note that a primary key can involve more than one column; any set of columns that is guaranteed to have a unique set of values for every row (without including any columns that aren't necessary to achieve this) is a candidate primary key.

If there is no such set of columns then it means you're happy to have duplicate rows.

johanafm

First, lots of theory. Finally, an example which may relate to your data. If you want more specific help, provide us with the specifics of your data. The first thing to realize is that there is no one "correct way of doing things". There are only good and bad ways of doing things in relation to specific data and specific usage of that data. With the exception of database normalization.

sneakyimp;11014367 wrote:
DELETE FROM mytable WHERE col1='foo' AND col2='BAR';
are col1 and col2 indexed?

As Weedpacket points out, and in line with the original question, I'd say these two columns would indeed be the primary key, if no AI PK is chosen instead. If this is the case, they should be defined as primary key with either one of

PRIMARY KEY (col1, col2)
PRIMARY KEY (col2, col1)

and a primary key is always indexed, just like a foreign key (FK) is.

If an AI PK is chosen, then you should have at least one of

UNIQUE INDEX (col1, col2)
UNIQUE INDEX (col2, col1)

since you definitely want the DB to ensure that the data is indeed unique if it's supposed to be unique. However, speed may still be an issue.

You have no control over the order in which things are stored on disk - that's what indexes are for. The SQL standard even states that a table is in no particular order unless you specify an ORDER BY clause. Also, since you may either wish to retrieve rows based on different columns, or wish to order by different columns, there is no one way to order a table to cater to many different such orderings.

If you have both an AI PK and a UNIQUE INDEX (col1, col2), the AI PK can be faster. It generally comes down to key size.

CREATE TABLE tbl (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
a1 VARCHAR(20),
a2 VARCHAR(255),
UNIQUE INDEX (a1, a2),
UNIQUE INDEX (a2, a1)
);

INSERT INTO tbl(a1, a2) VALUES
('a', 'a'),
('b', 'a'),
('c', 'a'),
('c', 'b');

DESCRIBE SELECT *
FROM tbl
WHERE a1 = 'b' AND a2='a';

DESCRIBE SELECT *
FROM tbl
WHERE id=2;

The id index is 4 bytes, the (a1, a2) is 281 bytes.

The reason key size matters is that disk access is done one block at a time, and roughly takes 5-8 ms. Thus, the more index keys that fit in one block, the fewer block accesses are needed to find the one you need. If you have your index keys spread over too many blocks, a second level index (which indexes the index itself) can be used to cut down on block accesses to insanly many to 2. If a second level index still would mean insanely many block accesses, a third level could cut this down to 3. The theory on indexing is more complex than that, and if you're interested you should be able to find books or online reasources dealing with this topic. But the cool thing is that you don't have to: The DB manages this for you, courtesy of people who do nothing else 🙂

If you wish to check what actual performance difference there is between something like the above two selects, write a php script that fills the above table with a few million rows of data, and then arbitrarly choose a few different rows and time how long it takes to retrieve them using both of the two indexes one at the time. Perhaps you'd need to disable query caching to get proper results, or switch between which key you retrive a row by first.

I believe Brad brilliantly summarized wether to choose to add an AI PK or not: If you ever foresee a circumstance where you would want to have a column in another table point to a record in this table
I'd like to emphasize one of the main concepts of DB normalization: never store duplicate data.
Generally, if you have a table storing "real data" (i.e. not a link table), it will be referenced in other tables. Thus if you have a table of places (country, city) and then want to store events (country, city, event name, date) you'd have duplicate storage of both countries and then (country, city). If a country changes names, you'd have to change everywhere. Missing one place => DB in inconsistent state.

An AI PK would obviously still be duplicated in referencing tables. But it never changes. Now you just change country names in once place if needed. Also, storing a 4 byte integer many times takes less space than storing the country name everywhere.

If you're dealing with a link table event(city_id, event_id, date), which references city (id, country_id, city name) and event_info (id, event name, event info). Then, you'd only need to store city_id with the event, not both country and city, since the city_id uniquely identifies a country.

This link table has a candidate key of (city_id, event_id, date). Since this information will probably never be referenced from other places there is no need to add an AI PK. Moreover, you will only access this table either through the other tables or directly by selecting city, date and event from lists which show the city and event names to the user while it uses their ids when querying the database. Some people always add an AI PK though. The extra space is usually of little concern.

Compound keys: It is generally being a good idea to let the database ensure that constraints that are supposed to be there (uniqueness of col1, col2) actually are there in reality. But there are other upsides and that's when a user needs to search data himself. The user rarely knows the id of what they want, so first they search for col1=@val1 AND col2=@val2. Once they have the information, the AI PK can be used subsequently to keep track of that data. It's also of use for range searches: col1=@val1 AND col2>@val2 means that col1 is used to find all rows matching that, and then the rest of the index is used for the range.

Index (col1, col2) can be used to search for col1, or col1 AND col2, but not col2 alone. For that you'd need an index on col2 as well. To make use of an index if you search for col2=@val2 AND col1>@val1, you'd need an index of (col2, col1). Thus, both indices may be needed.

Storing indices takes time. Given your use case with few or no updates and few deletes, having multiple indices is not detrimental for performance. But if you have lots of inserts, updates and deletes, then too many indices will result in performance degradation.

sneakyimp;11014367 wrote:
Offhand, I don't know of the MySQL storage engine stores integer primary keys as 32bits or 64 bits (depending on the definition) or whether the db engine stores an integer or as a string representation of an integer.

The anwer is here. INT is 4 bytes, BIGINT is 8. The storage used is entirely dependent on the data type specified for the column. You can INSERT INTO tbl (stringcolumn) VALUES (1), in which case the integer 1 in the insert query is converted into the string '1'.

sneakyimp;11014367 wrote:
Not a lot in either case.

As Sneaky points out, if you have a table where each row uses up 100 bytes, adding a 4-byte key means a 4% additional storage. On the other hand, if that also means only storing 4 bytes whenever a row is referenced by an FK instead of storing a 20 character string (20-60 bytes if UTF-8 is used), then you will actually be saving a lot of space instead.

Good2CU;11014375 wrote:
Another thought I had was that if you use a AI PK, can mysql still go and insert data on the actual hard drives based on the order of the meaningful field that will be used in the where clause? Say you have a city_id field with some other id field combining to be unique. As you add new records over time you will have the old city_id's repeated many times, so it'd be nice to keep them in order on the hard drive so they can be looked up faster

Keeping rows "in order" is exactly what indices do. Let's go with two tables

CREATE TABLE country (
  id INT UNSIGNED AI PK,
  country VARCHAR,
  INDEX (country)
) 
CREATE TABLE city (
  id INT UNSIGNED AI PK,
  country_id INT UNSIGNED,
  city VARCHAR,
  INDEX (city)
  FOREIGN KEY (country_id) REFERENCES country(id)
)

And your base data is

INSERT INTO country (country) VALUES
('Canadia'),
('USA');

INSERT INTO city (country_id, city) VALUES
(1, 'Montréal'),
(2, 'Vancouver'),
(2, 'New York');

Even with a later

INSERT INTO city (country_id, city) VALUES (1, 'Jasper')

your index will take care of storing things for quick retrieval due to the index on country_id. This index is implicit since you have a FOREIGN KEY for this field. Thus, when you

SELECT *
FROM city
INNER JOIN country ON country.id = country_id
WHERE country='Canadia';

The database will start by finding matching rows for country = 'Candia'. Since there is an index on country, this will be used. Once it finds the one matching row, it has access to its id. Then it joins on city using the id and retrieves the 3 rows having country_id = 1. Except for the first part where it locates the row for Canada, this would be the same as

SELECT *
FROM city
WHERE country_id = 1;

You can see this by adding DESCRIBE in front of "SELECT". If you now CREATE INDEX cid_city ON city(country_id, city) and modify the query above to

SELECT *
FROM city
WHERE country_id = 1 AND city like 'J%';

The key used for the city table would be changed to cid_city, since the DB can then make use of the compound index on (country_id, city).

Derokorian

OMG johanfm, that was such a great and detailed explanation. Thank you so much for taking the time.

Some people always add an AI PK though.

I'm one of these people, and my thought process behind it is as simple as: How do I know what requirements may arise in the future during the life cycle of said application.

Weedpacket

Derokorian wrote:
I'm one of these people, and my thought process behind it is as simple as: How do I know what requirements may arise in the future during the life cycle of said application.

I tend to go this way as well, but if there is an integer UNIQUE NOT NULL column already present then it makes sense to use that. Should it turn out later not to be "unique" after all, then given that it used to be unique, it's more likely because there is now something being duplicated that should be in another table.

tomhath

NogDog;11014387 wrote:
(In fact, you probably cannot control how it's stored on disk even if you wanted to: it's usually pretty difficult to anticipate just how any DBMS is going to actually store the data.

Some let you force the data to be stored in the same order as one of the indexes so data retrieval is faster. SQL Server calls it a "clustered index".