How to speed up queries on a large 220 million rows table (9 gig data)?
We have a social site where members can rate each other for compatibility or matching. This
user_match_ratingstable contains over 220 million rows (9 gig data or almost 20 gig in indexes). Queries against this table routinely show up in slow.log (threshold > 2 seconds) and is the most frequently logged slow query in the system:
Query_time: 3 Lock_time: 0 Rows_sent: 3 Rows_examined: 1051 "select rating, count(*) as tally from user_match_ratings where rated_user_id = 395357 group by rating;" Query_time: 4 Lock_time: 0 Rows_sent: 3 Rows_examined: 1294 "select rating, count(*) as tally from user_match_ratings where rated_user_id = 4182969 group by rating;" Query_time: 3 Lock_time: 0 Rows_sent: 3 Rows_examined: 446 "select rating, count(*) as tally from user_match_ratings where rated_user_id = 630148 group by rating;" Query_time: 5 Lock_time: 0 Rows_sent: 3 Rows_examined: 3788 "select rating, count(*) as tally from user_match_ratings where rated_user_id = 1835698 group by rating;" Query_time: 17 Lock_time: 0 Rows_sent: 3 Rows_examined: 4311 "select rating, count(*) as tally from user_match_ratings where rated_user_id = 1269322 group by rating;"
- protocol version: 10
- version: 5.0.77-log
- version bdb: Sleepycat Software: Berkeley DB 4.1.24: (January 29, 2009)
- version compile machine: x86_64 version_compile_os:redhat-linux-gnu
SHOW COLUMNS FROM user_match_ratings;
╔═══════════════╦════════════╦════╦═════╦════════╦════════════════╗ ║ id ║ int(11) ║ NO ║ PRI ║ NULL ║ auto_increment ║ ║ rater_user_id ║ int(11) ║ NO ║ MUL ║ NULL ║ ║ ║ rated_user_id ║ int(11) ║ NO ║ MUL ║ NULL ║ ║ ║ rating ║ varchar(1) ║ NO ║ ║ NULL ║ ║ ║ created_at ║ datetime ║ NO ║ ║ NULL ║ ║ ╚═══════════════╩════════════╩════╩═════╩════════╩════════════════╝
select * from mutual_match_ratings where id=221673540;
╔═══════════╦═══════════════╦═══════════════╦════════╦══════════════════════╗ ║ id ║ rater_user_id ║ rated_user_id ║ rating ║ created_at ║ ╠═══════════╬═══════════════╬═══════════════╬════════╬══════════════════════╣ ║ 221673540 ║ 5699713 ║ 3890950 ║ N ║ 2013-04-09 13:00:38 ║ ╚═══════════╩═══════════════╩═══════════════╩════════╩══════════════════════╝
The table has 3 indexes set up:
- single index on
- composite index on
- composite index on
show index from user_match_ratings;
╔════════════════════╦════════════╦═══════════════════════════╦══════════════╦═══════════════╦═══════════╦═════════════╦══════════╦════════╦═════════════════════════╦════════════╦══════════════════╗ ║ Table ║ Non_unique ║ Key_name ║ Seq_in_index ║ Column_name ║ Collation ║ Cardinality ║ Sub_part ║ Packed ║ Null ║ Index_type ║ Comment ║ ╠════════════════════╬════════════╬═══════════════════════════╬══════════════╬═══════════════╬═══════════╬═════════════╬══════════╬════════╬═════════════════════════╬════════════╬══════════════════╣ ║ user_match_ratings ║ 0 ║ PRIMARY ║ 1 ║ id ║ A ║ 220781193 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ║ user_match_ratings ║ 1 ║ user_match_ratings_index1 ║ 1 ║ rater_user_id ║ A ║ 11039059 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ║ user_match_ratings ║ 1 ║ user_match_ratings_index1 ║ 2 ║ created_at ║ A ║ 220781193 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ║ user_match_ratings ║ 1 ║ user_match_ratings_index2 ║ 1 ║ rated_user_id ║ A ║ 4014203 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ║ user_match_ratings ║ 1 ║ user_match_ratings_index2 ║ 2 ║ rater_user_id ║ A ║ 220781193 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ║ user_match_ratings ║ 1 ║ user_match_ratings_index3 ║ 1 ║ rated_user_id ║ A ║ 2480687 ║ NULL ║ NULL ║ BTREE ║ ║ ║ ╚════════════════════╩════════════╩═══════════════════════════╩══════════════╩═══════════════╩═══════════╩═════════════╩══════════╩════════╩═════════════════════════╩════════════╩══════════════════╝
Even with the indexes these queries are slow.
Would separating this table/data unto another database on a server that has enough ram to store this data in memory would this speed up these queries? Is there anything in anyway that the tables/indexes are set up that we can improve upon to make these queries faster?
Currently we have 16GB of memory; however we are looking into either upgrading the existing machine to 32GB or adding a new machine with at least that much, maybe solid state drives as well.
Your question is incredible. I am very much interested for your current solution that how you managed to get result in <= 2 seconds? Because I have one table which has only 20 millions records and still it takes 30 seconds for `SELECT QUERY`. Would you please suggest? P.S. Your question forced me to join this community (y) ;)
Look at the indexes on the table you are querying.. often a lot improvements to queries can be made by creating the appropriate index. Not always but seen a lot instances where queries are made fast by providing a index against the columns on the where clause on a query. Especially if a table grows larger and larger.
Thoughts on the issue, thrown in random order:
The obvious index for this query is:
(rated_user_id, rating). A query that gets data for only one of the million users and needs 17 seconds is doing something wrong: reading from the
(rated_user_id, rater_user_id)index and then reading from the table the (hundreds to thousands) values for the
ratingis not in any index. So, the query has to read many rows of the table which are located in many different disk locations.
Before starting adding numerous indexes in the tables, try to analyze the performance of the whole database, the whole set of slow queries, examine again the choices of the datatypes, the engine you use and the configuration settings.
Consider moving to a newer version of MySQL, 5.1, 5.5 or even 5.6 (also: Percona and MariaDB versions.) Several benefits as bugs have been corrected, the optimizer improved and you can set the low threshold for slow queries to less than 1 second (like 10 milliseconds). This will give you far better info about slow queries.
The choice for the datatype of
VARCHAR(1)? Why not
CHAR(1)? Why not
TINYINT? This will save you some space, both tin the table and in the indexes that (will) include that column. A varchar(1) column needs one more byte over char(1) and if they are utf8, the (var)char columns will need 3 (or 4) bytes, instead of 1 (tinyint).
How much performance impact or storage wastage in terms of % if u use the wrong datatype?
@FlyingAtom It depends on the case, but for some indexed columns that still need to be scanned (for example when you don't have a where clause but you are only retrieving that column), the engine might decide to scan the index instead of the table, and if you optimize your data type into one half the size then the scan would be twice as fast and the response would be half the size. If you are still scanning the table instead of an index (for example when you retrieve more columns not only the ones in the index), then the benefits would be less significant.
I handled tables for the German Government with sometimes 60 million records.
We had a lot of this tables.
And we needed to know many times the total rows from a table.
After talking with Oracle and Microsoft programmers we was not so happy ...
So we, the group of database programmers, decided that in every table is record one always the record in which the total record numbers is stored. We updated this number, depending on INSERT or DELETE rows.
We tried all other ways. This is by far the fastest way.
We use this way now since 1998 and had never any wrong number of rows, in all our multi million record tables.
I'd suggest into looking into some of the features introduced in the last 18 years. Among others, `count(*)` has some improvements.
I will try to partition on rating types, like:
mutual_match_ratings_N, mutual_match_ratings_S, etc.
You should perform a query for each type, but perhaps that's faster than the other way. Give it a try.
This assumes you have a fixed number of rating types, and that you don't need this table for other queries that would be worst with this new structure.
If that's the case, you should look for other approach, or maintain two copies of the table ( your initial table, and partitioned ones ) if that's affordable in terms of space and maintainability ( or application logic ).