I have a problem which I cannot seem to get around no matter how hard I try.
This company works in market analysis, and have pretty large tables (300K - 1M rows) and MANY columns (think 250-300) which we do some calculations on.
IÂ´ll try to get straight to the problem:
The problem is the filtering of the data. All databases IÂ´ve tried so far are way too slow to select data and return it.
At the moment I am storing the entire table in memory and filtering using dynamic LINQ.
However, while this is quite fast (about 100 ms to filter 250 000 rows) I need better results than this...
Is there any way I can change something in my code (not the data model) which could speed the filtering up?
I have tried using:
DataTable.Select which is slow. Dynamic LINQ which is better, but still too slow. Normal LINQ (just for testing purposes) which almost is good enough. Fetching from MySQL and do the processing later on which is badass slow.
At the beginning of this project we thought that some high-performance database would be able to handle this, but I tried:
H2 (IKVM) HSQLDB (compiled ODBC-driver) CubeSQL MySQL SQL SQLite ...
And they are all very slow to interface .NET and get results from.
I have also tried splitting the data into chunks and combining them later in runtime to make the total amount of data which needs filtering smaller.
Is there any way in this universe I can make this faster?
Thanks in advance!
I just want to add that I have not created this database in question.
To add some figures, if I do a simple select of 2 field in the database query window (SQLyog) like this (visit_munic_name is indexed):
SELECT key1, key2 FROM table1 WHERE filter1 = filterValue1
It takes 125 milliseconds on 225639 rows.
Why is it so slow? I have tested 2 different boxes.
Of course they must change someting, obviously?
You do not explain what exactly you want to do, or why filtering a lot of rows is important. Why should it matter how fast you can filter 1M rows to get an aggregate if your database can precalculate that aggregate for you? In any case it seems you are using the wrong tools for the job.
On one hand, 1M rows is a small number of rows for most databases. As long as you have the proper indexes, querying shouldn't be a big problem. I suspect that either you do not have indexes on your query columns or you want to perform ad-hoc queries on non-indexed columns.
Furthermore, it doesn't matter which database you use if your data schema is wrong for the job. Analytical applications typically use star schemas to allow much faster queries for a lot more data than you describe.
All databases used for analysis purposes use special data structures which require that you transform your data to a form they like. For typical relational databases you have to create star schemas that are combined with cubes to precalculate aggregates. Column databases store data in a columnar format usually combined with compression to achieve fast analytical queries, but they require that you learn to query them in their own language, which may be very different than the SQL language most people are accustomed to.
On the other hand, the way you query (LINQ or DataTable.Select or whatever) has minimal effect on performance. Picking the proper data structure is much more important.
For instance, using a Dictionary<> is much faster than using any of the techniques you mentioned. A dictionary essentially checks for single values in memory. Executing DataTable.Select without indexes, using LINQ to Datasets or to Objects is essentially the same as scanning all entries of an array or a List<> for a specific value,because that is what all these methods do - scan an entire list sequentially.
The various LINQ providers do not do the job of a database. They do not optimize your queries. They just execute what you tell them to execute. Even doing a binary search on a sorted list is faster than using the generic LINQ providers.
There are various things you can try, depending on what you need to do:
If you want a more repeatable process you can either create the appropriate star schemas in a relational database or use a columnar database. In either case you will have to write the scripts to load your data in the proper structures.
If you are creating your own application you really need to investigate the various algorithms and structures used by other similar tools either for in memory.