Faster Kids, FASTER!!

dalecosp

We're told "Select Count() is BAD!". I assume this is the reason I see this in a stats routine here:

   if (is_numeric($product)) {
      $sql = "select id from stats where listing=$product and time >= $start and time <= $end;";
      $result = $db->query($sql);
      echo "<p>Total Views for dates indicated: {$result->num_rows}</p>";
   }

Would "select count(id) be faster"? ID is a primary/auto-incremented key.

Any other thoughts? This takes about 10 shades of infinity to run.

NogDog

Speed-wise, I don't know for sure, but seems like resource-wise doing select count(id) would be less intensive, as it would not have to store all the matching results somewhere, ready to be called by a fetch loop.

IIRC, MySQL claims that it optimizes select count(*) for efficiency, so that might be the best option if using MySQL (may depend on storage type?).

dalecosp

Very little appreciable difference, I'm afraid:

Mysql size of result set test, START at 1552416197
Total Views for dates indicated: 186304
MySQL size of result set test, End at 1552416219

MySQL size of result set test, TIME: 22 seconds

Mysql count(primary_key) test, START at 1552416219
Total Views for dates indicated: 186304
MySQL count(primary_key) test, End at 1552416240

MySQL count(primary_key) test, TIME: 21 seconds

This is MySQL 5.6 on FreeBSD 10.4. The table in question contains about 14.8M rows, and as shown, the result set is 186K. System load average was moderately-light (.30). RAM:

real memory  = 4294967296 (4096 MB)
avail memory = 2967719936 (2830 MB)

NogDog

Oh...if we're talking that sort of response time, I suspect the place for optimization is in the where clause, ensuring indexes on all columns being used there, and possibly the order in which they are checked (e.g. which would produce the smaller sub-set: the check on listing or the date range?). Might be time for an EXPLAIN?

dalecosp

NogDog

"EXPLAIN"

Hmm.

Looks OK if we're inquiring about one listing ID (3 possible indexes), but not if we're leaving that off (as in, how many pageloads for ALL products in the given time period ... NULL indexes )

I'll report back Real Soon Now(tm), LOL ...

laserlight

The way I see it, it isn't true that "Select Count() is BAD!" I suspect this arose because people were using MySQL with a storage engine for which SELECT COUNT(*) queries for the entire table would retrieve a cached row count, then they switched to another storage engine and lo, it was slow because there was no such cached row count, so they jumped to this conclusion. Since the issue is with the full table scan, replacing that COUNT(*) with a query of the entire table that you then take the row count of the result set isn't necessarily going to be any faster, and it might even be slower. I think that the fast solutions tend to involve stuff like caching the row count yourself, and periodically synchronising it by doing an out-of-band full table scan with COUNT(*).

But, this is not what you're doing here. You're not doing a COUNT(*) to get the row count of the entire table; rather, you have a particular result set and want to count the number of rows. I'm inclined to say that COUNT(*) would be the right solution here as it expresses that you wish to count the number of rows of the result set; COUNT(id) for id being a non-NULL primary key should work too and corresponds to more closely to the current query, but it also introduces the requirement of id being not null and indexed, otherwise you might not be counting the rows you want to count, or the lack of an index could introduce a time penalty.

Speaking of index though, my feeling is that that could ultimately be why the query is slow: it looks like listing is a foreign key, so presumably it is already indexed, but if time isn't indexed, perhaps a full table scan needs to be done to determine the rows to select that meet your time constraints, and this could be slow for a sufficiently large table.

dalecosp

laserlight

Listing and time are indexed together, so a query on a specific listing is pretty quick by comparison. Doing "all listings in a given time interval" is painfully slow, though. It appears from EXPLAIN that doing "all listings" has NO key available, so I've got to figure out what the optimal keying structure will be so that either type of query could work.

mysql> describe stats;
+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| id        | int(11)      | NO   | PRI | NULL    | auto_increment |
| listing   | int(11)      | YES  | MUL | NULL    |                |
| referer   | varchar(512) | YES  |     | NULL    |                |
| time      | int(11)      | YES  |     | NULL    |                |
| is_bot    | tinyint(4)   | YES  |     | NULL    |                |
+-----------+--------------+------+-----+---------+----------------+

mysql> show indexes from stats;
+----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table          | Non_unique | Key_name           | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| stats          |          0 | PRIMARY            |            1 | id          | A         |    14878754 |     NULL | NULL   |      | BTREE      |         |               |
| stats          |          1 | listing_time_where |            1 | listing     | A         |      826597 |     NULL | NULL   | YES  | BTREE      |         |               |
| stats          |          1 | listing_time_where |            2 | time        | A         |    14878754 |     NULL | NULL   | YES  | BTREE      |         |               |
+----------------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

dalecosp

Indexing was indeed the issue.

I think we had assumed that this:

create index listing_time_where on stats (listing,time);

Would do what we wanted. A test index:

create index foobar on stats (time);

Took nearly 14 minutes to build, but now select count(id) where time > $start and time < $end takes about 1 second.

Thank you, gentlemen!

NogDog

Yeah, if you do a multi-column index, it will only work optimally if you, in fact, select on those columns in that order (or the first n columns of the index, at least). The index can sort of be thought of as a separate table sorted on the specified columns, as if you did an order by listing, time in this case; so if just searching for time, it would essentially have to do a full scan, or at least go through every listing and then jump to the relevant time within each listing.