How to quickly search through a very large list of strings / records on a database
I have the following problem: I have a database containing more than 2 million records. Each record has a string field X and I want to display a list of records for which field X contains a certain string. Each record is about 500 bytes in size.
To make it more concrete: in the GUI of my application I have a text field where I can enter a string. Above the text field I have a table displaying the (first N, e.g. 100) records that match the string in the text field. When I type or delete one character in the text field, the table content must be updated on the fly.
I wonder if there is an efficient way of doing this using appropriate index structures and / or caching. As explained above, I only want to display the first N items that match the query. Therefore, for N small enough, it should not be a big issue loading the matching items from the database. Besides, caching items in main memory can make retrieval faster.
I think the main problem is how to find the matching items quickly, given the pattern string. Can I rely on some DBMS facilities, or do I have to build some in-memory index myself? Any ideas?
I have run a first experiment. I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.
The next step is to integrate Lucene in the main program and use the hits returned by Lucene to load the relevant records into main memory.
2 million records * 500 bytes = 1 GB of data. That's a **lot** of data to search, whichever way you go about it - is each value of X likely to be unique, or will you have many records with the same value of X?
That would also be a **lot** of data to attempt to store in memory as a cache for quick retrieval. That would equate to more than 1GB per user session.
It is a desktop application. Values in the records are not necessarily unique. Also, I am searching for substring not for an exact match.
@maple_shaft: I would only cache the records that I have accessed recently. If I change the query string and a record still matches, it is still in the cache.
@Mark Bannister: I will have possibly many records matching the query string (substring). But I can fix a number N, e.g. 100, and only retrieve the first N records. As the user types in more characters, the result set should contain only a few items, or even one item only.
@Giorgio, you misunderstood my question. I did not ask how many records would match the search criteria; I asked whether each value of X is likely to be unique - ie. how many **distinct** values of X are likely: eg. 2 million or 20?
@Mark Bannister: Ok, I thought I had already answered (see my comment above: values in the records are not necessarily unique). But, I expect each distinct value to be repeated very few times, like less than 100. That would make roughly 20 000 distinct values, but probably more, because on average each distinct value should not occur more than 3, 4 times. I hope this helps.
If your database does not support full text search, there is a way you can do this but it would take me some effort to explain it and it requires coding. If you are interested please let me know.
Instead of putting your data inside the DB, you can keep them as a set of documents (text files) separately and keep the link (path/url etc.) in the DB.
This is essential because, SQL query by design will be very slow both in sub-string search as well as retrieval.
Now, your problem is formulated as, having to search the text files which contains the set of strings. There are two possibilities here.
Sub-string match If your text blobs is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every file to find best possible files that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside. But you may still make at least 100+ grep (worst case 2 million) before returning.
Indexed search. Here you are assuming that text contains set of words and search is limited to fixed word lengths. In this case, document is indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective.
This is an interesting concept but it seems unlikely that a developer can easily search through 1GB of textual data faster and more efficiently than a database engine. Much smarter people than you and I have laboured over query optimizers to do just that and it is a bit naive to think that you can somehow do that more efficiently.
@maple_shaft The examples i have given are not RDBMS database engines. They are more like "search engines" if you would like to call it. There is a huge conceptual difference between picking up a list out of an index(or hash table) versus searching through 1GB of data all over again every time a query fires. So what i am suggesting is not a minor tweak.
This seems an interesting idea but I wonder how it would work. I would have more than 2 000 000 files, each about half a kilobyte in size. Or are you suggesting to have more than one record per file? What would be the difference wrt a database?
I'm not convinced that this would necessarily perform any better than, say, SQL fulltext index.
@Giorgio - yes that is how full text search engines would work. The key difference here is a pre-indexed pages vs. the in memory search (again for every time a query comes).
@KirkBroadhurst MySQL's Full Text search - (see http://dev.mysql.com/doc/refman/4.1/en/fulltext-search.html) is conceptual same example as i have shown with Lucane or Sphinx. The other answers were discussing the "in-memory indexes" which is quite unnecessary.
@Dipan Mehta: I am trying out Lucene and it seems very easy to use. So I could put the records in text files and use Lucene to search through them. Then I just load into main memory the first N records that have matched. When I have more news I will report.