First, lots of theory. Finally, an example which may relate to your data. If you want more specific help, provide us with the specifics of your data. The first thing to realize is that there is no one "correct way of doing things". There are only good and bad ways of doing things in relation to specific data and specific usage of that data. With the exception of database normalization.
sneakyimp;11014367 wrote:
DELETE FROM mytable WHERE col1='foo' AND col2='BAR';
are col1 and col2 indexed?
As Weedpacket points out, and in line with the original question, I'd say these two columns would indeed be the primary key, if no AI PK is chosen instead. If this is the case, they should be defined as primary key with either one of
PRIMARY KEY (col1, col2)
PRIMARY KEY (col2, col1)
and a primary key is always indexed, just like a foreign key (FK) is.
If an AI PK is chosen, then you should have at least one of
UNIQUE INDEX (col1, col2)
UNIQUE INDEX (col2, col1)
since you definitely want the DB to ensure that the data is indeed unique if it's supposed to be unique. However, speed may still be an issue.
You have no control over the order in which things are stored on disk - that's what indexes are for. The SQL standard even states that a table is in no particular order unless you specify an ORDER BY clause. Also, since you may either wish to retrieve rows based on different columns, or wish to order by different columns, there is no one way to order a table to cater to many different such orderings.
If you have both an AI PK and a UNIQUE INDEX (col1, col2), the AI PK can be faster. It generally comes down to key size.
CREATE TABLE tbl (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
a1 VARCHAR(20),
a2 VARCHAR(255),
UNIQUE INDEX (a1, a2),
UNIQUE INDEX (a2, a1)
);
INSERT INTO tbl(a1, a2) VALUES
('a', 'a'),
('b', 'a'),
('c', 'a'),
('c', 'b');
DESCRIBE SELECT *
FROM tbl
WHERE a1 = 'b' AND a2='a';
DESCRIBE SELECT *
FROM tbl
WHERE id=2;
The id index is 4 bytes, the (a1, a2) is 281 bytes.
The reason key size matters is that disk access is done one block at a time, and roughly takes 5-8 ms. Thus, the more index keys that fit in one block, the fewer block accesses are needed to find the one you need. If you have your index keys spread over too many blocks, a second level index (which indexes the index itself) can be used to cut down on block accesses to insanly many to 2. If a second level index still would mean insanely many block accesses, a third level could cut this down to 3. The theory on indexing is more complex than that, and if you're interested you should be able to find books or online reasources dealing with this topic. But the cool thing is that you don't have to: The DB manages this for you, courtesy of people who do nothing else 🙂
If you wish to check what actual performance difference there is between something like the above two selects, write a php script that fills the above table with a few million rows of data, and then arbitrarly choose a few different rows and time how long it takes to retrieve them using both of the two indexes one at the time. Perhaps you'd need to disable query caching to get proper results, or switch between which key you retrive a row by first.
I believe Brad brilliantly summarized wether to choose to add an AI PK or not: If you ever foresee a circumstance where you would want to have a column in another table point to a record in this table
I'd like to emphasize one of the main concepts of DB normalization: never store duplicate data.
Generally, if you have a table storing "real data" (i.e. not a link table), it will be referenced in other tables. Thus if you have a table of places (country, city) and then want to store events (country, city, event name, date) you'd have duplicate storage of both countries and then (country, city). If a country changes names, you'd have to change everywhere. Missing one place => DB in inconsistent state.
An AI PK would obviously still be duplicated in referencing tables. But it never changes. Now you just change country names in once place if needed. Also, storing a 4 byte integer many times takes less space than storing the country name everywhere.
If you're dealing with a link table event(city_id, event_id, date), which references city (id, country_id, city name) and event_info (id, event name, event info). Then, you'd only need to store city_id with the event, not both country and city, since the city_id uniquely identifies a country.
This link table has a candidate key of (city_id, event_id, date). Since this information will probably never be referenced from other places there is no need to add an AI PK. Moreover, you will only access this table either through the other tables or directly by selecting city, date and event from lists which show the city and event names to the user while it uses their ids when querying the database. Some people always add an AI PK though. The extra space is usually of little concern.
Compound keys: It is generally being a good idea to let the database ensure that constraints that are supposed to be there (uniqueness of col1, col2) actually are there in reality. But there are other upsides and that's when a user needs to search data himself. The user rarely knows the id of what they want, so first they search for col1=@val1 AND col2=@val2. Once they have the information, the AI PK can be used subsequently to keep track of that data. It's also of use for range searches: col1=@val1 AND col2>@val2 means that col1 is used to find all rows matching that, and then the rest of the index is used for the range.
Index (col1, col2) can be used to search for col1, or col1 AND col2, but not col2 alone. For that you'd need an index on col2 as well. To make use of an index if you search for col2=@val2 AND col1>@val1, you'd need an index of (col2, col1). Thus, both indices may be needed.
Storing indices takes time. Given your use case with few or no updates and few deletes, having multiple indices is not detrimental for performance. But if you have lots of inserts, updates and deletes, then too many indices will result in performance degradation.
sneakyimp;11014367 wrote:
Offhand, I don't know of the MySQL storage engine stores integer primary keys as 32bits or 64 bits (depending on the definition) or whether the db engine stores an integer or as a string representation of an integer.
The anwer is here. INT is 4 bytes, BIGINT is 8. The storage used is entirely dependent on the data type specified for the column. You can INSERT INTO tbl (stringcolumn) VALUES (1), in which case the integer 1 in the insert query is converted into the string '1'.
sneakyimp;11014367 wrote:
Not a lot in either case.
As Sneaky points out, if you have a table where each row uses up 100 bytes, adding a 4-byte key means a 4% additional storage. On the other hand, if that also means only storing 4 bytes whenever a row is referenced by an FK instead of storing a 20 character string (20-60 bytes if UTF-8 is used), then you will actually be saving a lot of space instead.
Good2CU;11014375 wrote:
Another thought I had was that if you use a AI PK, can mysql still go and insert data on the actual hard drives based on the order of the meaningful field that will be used in the where clause? Say you have a city_id field with some other id field combining to be unique. As you add new records over time you will have the old city_id's repeated many times, so it'd be nice to keep them in order on the hard drive so they can be looked up faster
Keeping rows "in order" is exactly what indices do. Let's go with two tables
CREATE TABLE country (
id INT UNSIGNED AI PK,
country VARCHAR,
INDEX (country)
)
CREATE TABLE city (
id INT UNSIGNED AI PK,
country_id INT UNSIGNED,
city VARCHAR,
INDEX (city)
FOREIGN KEY (country_id) REFERENCES country(id)
)
And your base data is
INSERT INTO country (country) VALUES
('Canadia'),
('USA');
INSERT INTO city (country_id, city) VALUES
(1, 'Montréal'),
(2, 'Vancouver'),
(2, 'New York');
Even with a later
INSERT INTO city (country_id, city) VALUES (1, 'Jasper')
your index will take care of storing things for quick retrieval due to the index on country_id. This index is implicit since you have a FOREIGN KEY for this field. Thus, when you
SELECT *
FROM city
INNER JOIN country ON country.id = country_id
WHERE country='Canadia';
The database will start by finding matching rows for country = 'Candia'. Since there is an index on country, this will be used. Once it finds the one matching row, it has access to its id. Then it joins on city using the id and retrieves the 3 rows having country_id = 1. Except for the first part where it locates the row for Canada, this would be the same as
SELECT *
FROM city
WHERE country_id = 1;
You can see this by adding DESCRIBE in front of "SELECT". If you now CREATE INDEX cid_city ON city(country_id, city) and modify the query above to
SELECT *
FROM city
WHERE country_id = 1 AND city like 'J%';
The key used for the city table would be changed to cid_city, since the DB can then make use of the compound index on (country_id, city).