Can MySQL reasonably perform queries on billions of rows?
I am planning on storing scans from a mass spectrometer in a MySQL database and would like to know whether storing and analyzing this amount of data is remotely feasible. I know performance varies wildly depending on the environment, but I'm looking for the rough order of magnitude: will queries take 5 days or 5 milliseconds?
Each input file contains a single run of the spectrometer; each run is comprised of a set of scans, and each scan has an ordered array of datapoints. There is a bit of metadata, but the majority of the file is comprised of arrays 32- or 64-bit ints or floats.
|----------------+-------------------------------| | OS | Windows 2008 64-bit | | MySQL version | 5.5.24 (x86_64) | | CPU | 2x Xeon E5420 (8 cores total) | | RAM | 8GB | | SSD filesystem | 500 GiB | | HDD RAID | 12 TiB | |----------------+-------------------------------|
There are some other services running on the server using negligible processor time.
|------------------+--------------| | number of files | ~16,000 | | total size | 1.3 TiB | | min size | 0 bytes | | max size | 12 GiB | | mean | 800 MiB | | median | 500 MiB | | total datapoints | ~200 billion | |------------------+--------------|
The total number of datapoints is a very rough estimate.
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and so would have a
spectratable with a foreign key to
runs, and a
datapointstable with a foreign key to
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple runs, resulting in queries which could touch millions of rows. Assuming I index everything properly (which is a topic for another question) and am not trying to shuffle hundreds of MiB across the network, is it remotely plausible for MySQL to handle this?
The scan data will be coming from files in the XML-based mzML format. The meat of this format is in the
<binaryDataArrayList>elements where the data is stored. Each scan produces >= 2
<binaryDataArray>elements which, taken together, form a 2-dimensional (or more) array of the form
[[123.456, 234.567, ...], ...].
These data are write-once, so update performance and transaction safety are not concerns.
My naïve plan for a database schema is:
| column name | type | |-------------+-------------| | id | PRIMARY KEY | | start_time | TIMESTAMP | | name | VARCHAR | |-------------+-------------|
| column name | type | |----------------+-------------| | id | PRIMARY KEY | | name | VARCHAR | | index | INT | | spectrum_type | INT | | representation | INT | | run_id | FOREIGN KEY | |----------------+-------------|
| column name | type | |-------------+-------------| | id | PRIMARY KEY | | spectrum_id | FOREIGN KEY | | mz | DOUBLE | | num_counts | DOUBLE | | index | INT | |-------------+-------------|
Is this reasonable?
So, as you may have been able to infer, I am the programmer, not the biologist in the lab, so I don't know the science nearly as well as the actual scientists.
Here's a plot of a single spectrum (scan) of the kind of data with which I'll be dealing:
The goal of the software is to figure out where and how significant the peaks are. We use a proprietary software package to figure this out now, but we want to write our own analysis program (in R) so we know what the heck is going on under the sheets. As you can see, the vast majority of the data are uninteresting, but we don't want to throw out potentially-useful data which our algorithm missed. Once we have a list of probable peaks with which we're satisfied, the rest of the pipeline will use that peak list rather than the raw list of datapoints. I suppose that it would be sufficient to store the raw datapoints as a big blob, so they can be reanalyzed if need be, but keep only the peaks as distinct database entries. In that case, there would be only a couple dozen peaks per spectrum, so the crazy scaling stuff shouldn't be as much of an issue.
Since this is raw A/D polling mass spectrometer data, it seems really dumb to store it in the database. I would take my raw data, dump it, process it, and store the processed RESULTS in a database. The results would be (a) waveforms stored one waveform per row, (b) other data associated with those waveforms like calibration curves, and (c) results rows in the database. This would cut billions of rows of bloat from your design. When you want to re-run an initial analysis, you would effectively be editing some parameters, running a giant compute operation, and storing the new results in the db.
I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.
As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.
Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).
Why is it considered wrong to store binary data in a database? (Asking partially because I am curious but also because I can think of a use case for it.)
If the binary data has no value individually, it should not be stored as a unique row. Pixel 500x325 on an image is irrelevant.
That's a very good point. We should probably keep the raw files around in case we need to pull stuff out again later, but the analogy to storing images is a great one. We won't need access to each datapoint ever (unless we're redoing the peak extraction), so simply storing the extracted statistical info would be much better.
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it's possible that things may have improved.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though. (If you want six sigma-level availability with a terabyte of data, don't use MySQL. We were a startup that had no DBA and limited funds.)
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql 5.0 stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
Trying to do analysis on 200 billion rows of data would require very high-end hardware and a lot of hand-holding and patience. Just keeping the data backed up in a format that you could restore from would be a significant job.
I agree with srini.venigalla's answer that normalizing the data like crazy may not be a good idea here. Doing joins across multiple tables with that much data will open you up to the risk of file sorts which could mean some of your queries would just never come back. Denormallizing with simple, integer keys would give you a better chance of success.
Everything we had was InnoDB. Regarding MyISAM vs. InnoDB: The main thing would be to not mix the two. You can't really optimize a server for both because of the way MySQL caches keys and other data. Pick one or the other for all the tables in a server if you can. MyISAM may help with some speed issues, but it may not help with the overall DBA work that needs to be done - which can be a killer.
normalizing the data like crazy
Normalizing the data like crazy may not be the right strategy in this case. Keep your options open by storing the data both in the Normalized form and also in the form of materialized views highly suited to your application. Key in this type of applications is NOT writing adhoc queries. Query modeling is more important than data modeling. Start with your target queries and work towards the optimum data model.
Is this reasonable?
I would also create an additional flat table with all data.
run_id | spectrum_id | data_id | <data table columns..> |
I will use this table as the primary source of all queries. The reason is to avoid having to do any joins. Joins without indexing will make your system very unusable, and having indexes on such huge files will be equally terrible.
Strategy is, query on the above table first, dump the results into a temp table and join the temp table with the look up tables of Run and Spectrum and get the data you want.
Have you analyzed your Write needs vs Read needs? It will be very tempting to ditch SQL and go to non-standard data storage mechanisms. In my view, it should be the last resort.
To accelerate the write speeds, you may want to try the Handler Socket method. Percona, if I remember, packages Handler Socket in their install package. (no relation to Percona!)
The short answer is a qualified yes -- as the number of rows grows the precise schema, datatypes and operations you choose grows in importance.
How much you normalize your data depends on the operations you plan to perform on the stored data. Your 'datapoints' table in particular seems problematic -- are you planning on comparing the nth point from any given spectra with the mth of any other? If not, storing them separately could be a mistake. If your datapoints do not stand alone but make sense only in the context of their associated spectra you don't need a PRIMARY KEY -- a foreign key to the spectra and an 'nth' column (your 'index' column?) will suffice.
Define the inter- and intra-spectrum operations you must perform and then figure out the cheapest way to accomplish them. If equality is all that's needed they may be denormalized -- possibly with some pre-calculated statistical metadata that assist your operations. If you do absolutely need in-SQL access to individual datapoints ensure you reduce the size of each row to the bare minimum number of fields and the smallest datatype possible.
The largest MySQL I've ever personally managed was ~100 million rows. At this size you want to keep your rows and thus your fields fixed-size -- this allows MySQL to efficiently calculate the position of any row in the table by multiplying times the fixed size of each row (think pointer arithmetic) -- though the exact details depend on which storage engine you plan on using. Use MyISAM if you can get away with it, what it lacks in reliability it makes up for in speed, and in your situation it should suffice. Replace variable-size fields such as VARCHAR with CHAR(n) and use RTRIM() on your read queries.
Once your table rows are fixed-width you can reduce the number of bytes by carefully evaluating MySQL's integer datatypes (some of which are non-standard). Every 1-byte savings you can eke out by converting a 4-byte INT into a 3-byte MEDIUMINT saves you ~1MB per million rows -- meaning less disk I/O and more effective caching. Use the smallest possible datatypes that you can get away with. Carefully evaluate the floating point types and see if you can replace 8-byte DOUBLEs with 4-byte FLOATs or even <8 byte fixed-point NUMERICs. Run tests to ensure that whatever you pick doesn't bite you later.
Depending on the expected properties of your dataset and the operations required there may be further savings in more unusual encodings of your values (expected patterns/repetitions that can be encoded as an index into a set of values, raw data that may only meaningfully contribute to metadata and be discarded, etc) -- though exotic, unintuitive, destructive optimizations are only worthwhile when every other option has been tried.
Most importantly, no matter what you end up doing, do not assume you have picked the perfect schema and then blindly begin dumping 10s of millions of records in. Good designs take time to evolve. Create a large but manageable (say, 1-5%) set of test data and verify the correctness and performance of your schema. See how different operations perform (http://dev.mysql.com/doc/refman/5.0/en/using-explain.html) and ensure that you balance you schema to favor the most frequent operations.
Did I say short? Whoops. Anyways, good luck!
It would seem that the only reason to shred the data point data out of the XML (as opposed to the metadata like the time and type of run) and into a database form is when you are analyzing the spectra across arrays - i.e. perhaps finding all runs with a certain signature. Only you know your problem domain right now, but this could be akin to storing music sampled at 96kHz with 1 sample per row. I'm not sure size is the issue more than how the data is used. Querying across the data would be equivalent to asking the relative amplitude 2 minutes into the song across all songs by The Beatles. If you know the kind of analyses which might be performed, it's quite possible that performing these on the signals and storing those in the metadata about the run might make more sense.
I'm also not sure if your source data is sparse. It's completely possible that a spectrum in the database should only include non-zero entries while the original XML does include zero-entries, and so your total number of rows could be much less than in the source data.
So, like many questions, before asking about MySQL handling your model, stepping back and looking at the model and how it is going to be used is probably more appropriate than worrying about performance just yet.
After reviewing your question updates, I think a model where the binary data is stored as a BLOB or just a pointer to the file is sufficient and work on modifying your model to store data about the significant peaks which have been identified when the data is first read.
I run a web analytics service with about 50 database servers, each one containing many tables over 100 million rows, and several that tend to be over a billion rows, sometimes up to two billion (on each server).
The performance here is fine. It is very normalized data. However - my main concern with reading this is that you'll be well over the 4.2 billion row mark for these tables (maybe not "runs" but probably the other two), which means you'll need to use BIGINT instead of INT for the primary/foreign keys.
MySQL performance with BIGINT fields in an indexed column is ridiculously horrible compared to INT. I made the mistake of doing this once with a table I thought might grow over this size, and once it hit a few hundred million rows the performance was simply abysmal. I don't have raw numbers but when I say bad, I mean Windows ME bad.
This column was the primary key. We converted it back to be just an INT and presto magico, the performance was good again.
All of our servers at the time were on Debian 5 and with MySQL 5.0. We have since upgraded to Debian 6 and Percona MySQL 5.5, so things may have improved since then. But based on my experience here, no, I don't think it will work very well.
Whether or not it works, you're always going to run into the same problem with a single monolithic storage medium: disks are slow. At 100 MB/s (pretty good for spinning media) it takes 3 hours just to read a 1TB table; that's assuming no analysis or seeking or other delays slow you down.
This is why very nearly every "big data" installation uses some sort of distributed data store. You can spend 8 times as much money building one super amazing computer to run your DB, but if you have a lot of data that can be scanned in parallel, you're almost always better off distributing the load across the 8 cheaper computers.
Projects like hadoop were build specifically for purposes like this. You build a cluster of a whole bunch of inexpensive computers, distribute the data across all of them, and query them in parallel. It's just one of a half a dozen solutions all built around this same idea, but it's a very popular one.
Hm... I see oly two reasons why you would choose this kind of data structure:
- you really need to do any datapoint vs any datapoint queries
- you intend to perform all your logic in SQL
Now, I would suggest taking a long hard look into your requirements and verify that at least one of the above assumptions is true. If neither are true, you are just making things slower. For this kind of dataset, I would suggest first finding out how the data is expected to be accessed, what kind of accuracy you will need, etc - and then design your database around those.
P.S.: Keep in mind that you will need at least 36+5 bytes per data point, so with 200B datapoints that should give you at least 8.2 TB required space.
P.P.S.: You don't need the
idcolumn in the
PRIMARY KEY (spectrum_id, index)probably suffices (just beware that
indexmay be a reserved word)
DO NOT DO THIS IN MYSQL WITH DATA STORED ON A SINGLE DISK. Just reading that amount of data from a single medium will take hours. You need to SCALE OUT, NOT UP.
And you need to denormalize your data if you want to do effective data analysis. You are not designing a online system here. You want to crunch numbers, design accordingly.
Original answer below line.
The answer will vary depending on your queries, MySQL may not be the best tool for this job. You may want to look at solution you can scale "out" and not "up". If you are willing to put in some effort maybe you should look on a Map Reduce solution such as Hadoop.
So, the solution will depend on if this is a one-shot thing and if you want to reasonably support ad hoc queries.
No one has mentioned, thus my suggestion. Take a look at massively sharded MySQL solutions. For example, see this highly regarded tumblr presentation.
The concept is:
- Instead of one extra large database
- Use many small ones holding parts of the original data
However, there will be troubles if you need to run queries over different shards.
If anyone interested, I made a hello-world sharding application a while ago. It is discussed here in a blog post. I used RavenDB and C# but the details are irrelevant and the idea is the same.