How to manage a million records?

How to manage a million records? - c#

I really need an expert's help to answer my query.
Here is the scenario:
Im using an sql select query to retrieve a million records.
I need to perform sorting and grouping on the resultant records which im storing in a datatable( in one execution)
and looping through it for grouping and sorting it.
I know this is so childish and not the right way to process it.
How can i manage the million records effectively and apply the grouping and sorting to it?
Really need help out here. Heard of executing the select query batch wise but how to implement the grouping and sorting while we dont have the entire data in hand?
I cannot go for sql order by and group by directly and that's against my requirement.
Here is what i'm doing right now:
I have the following objects, i.e the column names for grouping and Sorting
List<Group> groupList;
List<Sort> sortList;
DataTable reportData; // Here im having the entire records from db
Im looping through the 'reportData' row by row and matches the current and previous row for the custom grouping and sorting. Would like to know how the same can be done when we are using a batchwise execution or any alternative solution is there?

I need to perform sorting and grouping on the resultant records which
im storing in a datatable( in one execution) and looping through it
for grouping and sorting it.
What for?
Seriously.
Do not pull then try plaing smart with a stupid object model behind (and datasets are not particularly smart, sorry).
Group and sort in your select statement, pull the data lready grouped and joined and be done with it.
A million records was a small amount of data for sql server when the original version was release (4.2 it was, a port of sysase sql server) 17 years of so ago. These days it is something that fits likely into the processor thiird level cache and is nothing a proper sql server even realizes it has just processed.
SQL is particulaly good ad doing projects and ever since they indoruced MARS you can even run multiple queries over one connection, which comes in handy here.
So, go back - throw away the dataset and "I try to program a sort algo" and create proper SQL statements to pull the data as you need it.

Sounds like you should implement Partition Pruning. Partitioning will allow for a separation of content like you are requesting in order to have faster queries.

If I understood correctly, in your case, I would create a temporary database table with the structure I want especially to cover my grouping.
Then I would select the records from main tables and insert them to the temporary one appying all modifications including grouping.
A specific index on how you want them sorted should be also applied.
After that, just select from this table, do what you have to do, and finally if the data are not needed any more, delete the temporary table.
I would choose the above solution because a million of records in memory smells trouble to me...

For example:
1. Lets assume that you would like to group them by their DocumentTypeID
var groupByType = reportData.GroupBy(g=>g.DocumentTypeID);
2. Sorting Alphabetically
var sortAlphabetically = reportData.OrderBy(g=>g.DocumentName);
3. Grouping and Sorting
var groupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.OrderBy(g=>g.DocumentName);
4. Sort and Group
var groupAndSort = reportData.OrderBy(g=>g.DocumentName)
.GroupBy(g=>g.DocumentTypeID);
5. Multiple Grouping and sorting
var multipleGroupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.GroupBy(g=>g.CreatedOnDate.Month)
.OrderBy(g=>g.DocumentName);
so on and so forth...
But I would still discourage bringing million rows to application. It will cost memory. There are of course ways to manage it through stored procedures etc.

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?

The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.

yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

Millions of rows in the database, only so much needed

Problem summary:
C# (MVC), entity framework 5.0 and Oracle.
I have a couple of million rows in a view which joins two tables.
I need to populate dropdownlists with filter-posibilities.
The options in these dropdownlists should reflect the actual contents
of the view for that column, distinct.
I want to update the dropdownlists whenever you select something, so
that the new options reflect the filtered content, preventing you
from choosing something that would give 0 results.
Its slow.
Question: whats the right way of getting these dropdownlists populated?
Now for more detail.
-- Goal of the page --
The user is presented with some dropownlists that filter the data in a grid below. The grid represents a view (see "Database") where the results are filtered.
Each dropdownlist represents a filter for a column of the view. Once something is selected, the rest of the page updates. The other dropdownlists now contain the posible values for their corresponding columns that complies to the filter that was just applied in the first dropdownlist.
Once the user has selected a couple of filters, he/she presses the search button and the grid below the dropdownlists updates.
-- Database --
I have a view that selects almost all columns from two tables, nothing fancy there. Like this:
SELECT tbl1.blabla, tbl2.blabla etc etc
FROM table1 tbl1, table2 tbl2
WHERE bsl.bvz_id = bvz.id AND bsl.einddatum IS NULL;
There is a total of 22 columns. 13 VARCHARS (mostly small, 1 - 20, one of em has a size of 2000!), 6 DATES and 3 NUMBERS (one of them size 38 and one of them 15,2).
There are a couple of indexes on the tables, among which the relevant ID's for the WHERE clause.
Important thing to know: I cannot change the database. Maybe set an index here and there, but nothing major.
-- Entity Framework --
I created a Database first EDMX in my solution and also mapped the view. There are also classes for both tables, but I need data from both of them, so I don't know if I need them. The problem by selecting things from either table would be that you can't apply half of the filtering, but maybe there are smart way's I didn't think of yet.
-- View --
My view is strongly bound to a viewModel. In there I have a IEnumerable for each dropdownlist. The getter for these gets its data from a single IEnumerable called NameOfViewObjects. Like this:
public string SelectedColumn1{ get; set; }
private IEnumerable<SelectListItem> column1Options;
public IEnumerable<SelectListItem> Column1Options
{
get
{
if (column1Options == null)
{
column1Options= NameOfViewObjects.Select(item => item.Column1).Distinct()
.Select(item => new SelectListItem
{
Value = item,
Text = item,
Selected = item.Equals(SelectedColumn1, StringComparison.InvariantCultureIgnoreCase)
});
}
return column1Options;
}
}
The two solutions I've tried are:
- 1 -
Selecting all columns in a linq query I need for the dropdownlists (the 2000 varchar is not one of them and there are only 2 date columns), do a distinct on them and put the results into a Hashset. Then I set NameOfViewObjects to point towards this hashset. I have to wait for about 2 minutes for that to complete, but after that, populating the dropdownlists is almost instant (maybe a second for each of them).
model.Beslissingen = new HashSet<NameOfViewObject>(dbBes.NameOfViewObject
.DistinctBy(item => new
{
item.VarcharColumn1,
item.DateColumn1,
item.DateColumn2,
item.VarcharColumn2,
item.VarcharColumn3,
item.VarcharColumn4,
item.VarcharColumn5,
item.VarcharColumn6,
item.VarcharColumn7,
item.VarcharColumn8
}
)
);
The big problem here is that the object NameOfViewObject is probably quite large, and even though using distinct here, resulting in less than 100.000 results, it still uses over 500mb of memory for it. This is unacceptable, because there will be a lot of users using this screen (a lot would be... 10 max, 5 average simultaniously).
- 2 -
The other solution is to use the same linq query and point NameOfViewObjects towards the IQueryable it produces. This means that every time the view wants to bind a dropdownlist to a IEnumerable, it will fire a query that will find the distinct values for that column in a table with millions of rows where most likely the column it's getting the values from is not indexed. This takes around 1 minute for each dropdownlist (I have 10), so that takes ages.
Don't forget: I need to update the dropdownlists every time one of them has it's selection changed.
-- Question --
So I'm probably going at this the wrong way, or maybe one of these solutions should be combined with indexing all of the columns I use, maybe I should use another way to store the data in memory, so it's only a little, but there must be someone out there who has done this before and figured out something smart. Can you please tell me what would be the best way to handle a situation like this?
Acceptable performance:
having to wait for a while (2 minutes) while the page loads, but
everything is fast after that.
having to wait for a couple of seconds every time a dropdownlist
changes
the page does not use more than 500mb of memory

Of course you should have indexes on all columns and combinations in WHERE clauses. No index means table scan and O(N) query times. Those cannot scale under any circumstance.
You do not need millions of entries in a drop down. You need to be smarter about filtering the database down to manageable numbers of entries.
I'd take a page from Google. Their type ahead helps narrow down the entire Internet graph into groups of 25 or 50 per page, with the most likely at the top. Maybe you could manage that, too.
Perhaps a better answer is something like a search engine. If you were a Java developer you might try Lucene/SOLR and indexing. I don't know what the .NET equivalent is.

First point you need to check is your DB, make sure you have to right indexes and entity relations in place,
next if you want to dynamical build your filter options then you need to run the query with the existing filters to obtain what the next filter can be. there are several ways to do this,
firstly you can query the data and extract the values from the return, this has a huge load time and wastes time returning data you don't want (unless you are live updating the results with the filter and dont have paging, in which case you might aswell just get all the data and use linqToObjects to filter)
a second option is to have a parallel queries for each filter that returns the possible filters, so filter A = all possible values of A from data, filter b = all possible values of B when filtered by A in the data, C = all possible values of C when filtered by A & B in the data, etc. this is better than the first but not by much
another option is the use aggregates to speed things up, ie you have a parallel query as above but instead of returning the data you return how many records are returned, aggregate functions are always quicker so this will cut your load time dramatically but you are still repeatedly querying a huge dataset to it wont be exactly nippy.
you can tweak this further using exist to just return a 0 or 1.
in this case you would look at a table with all possible filters and then remove the ones with no values from the parallel query
the next option will be the fastest by a mile is to cache the filters in the DB, with a separate table
then you can query that and say from Cache, where filter = ABC select D, the problem with this maintaining the cache, which you would have to do in the DB as part of the save functions, trigggers etc.

Another solution that can be added in addition to the previous suggestions is to use the /*+ result_cache */ hint, if your version of Oracle supports it (Oracle version 11g or later). If the output of the query is small enough for a drop-down list, then when a user enters criteria that matches the same criteria another user used, the results are returned in a few milliseconds instead of a few seconds or minutes. Result cache is wonderful for queries that return a small set of rows out of millions.
select /*+ result_cache */ item_desc from some_table where item_id ...
The result cache is automatically flushed when any insert/updates/deletes occur on the database tables.

I've done something 'kind of' similar in the past - if you can add a table to the database then I'd explore introducing a 'scratchpad' type table where results are temporarily stored as the user refines their search. Since multiple users could be working simultaneously the table would have to have an additional column for identifying the user.
I'd think you'd see some performance benefit since all processing is kept server-side and your app would simply be pulling data from this table. Since you're adding this table you would also have total control over it.
Essentially I'd imagine the program flow would go something like:
User selects some filters and clicks 'Search'.
Server populates scratchpad table with results from that search.
App populates results grid from scratchpad table.
User further refines search and clicks 'Search'.
Server removes/adds rows to scratchpad table as necessary.
App populates results grid from scratchpad table.
And so on.
Rather than having all the users results in one 'scratchpad' table you could possibly explore having temporary 'scratchpad' tables per user.

How to join a very large lucene resultset to a real large sql table [10's of million rows]

Here is the problem,
I have a SQL DB with your regular customer, product, order schema but HUGE. [Each table has 10s of million rows]. there is also a large table with order_email [apprx 100 million rows]. This table holds all email communication associated with an order. I have implemented a full text search using on top of the order_email which works fine.
Now I want to extend the email search functionality to filter this based on other domain objects. i.e to answer queries like
show customers who sent an email with the phrase 'never gonna give you up'
show orders which has an associated email with the phrase 'more ponies'.
The implementation is to do an intersection/join of the lucene result and a sql result but I can't think of a way to do this without running into issues due to the SIZE of the tables and index involved
My failing approaches
Brute force. Adding most of my DB columns as lucene fields. This is equivalent of denormalizing my entire DB and creating a Lucene Index (size in Terrabytes) with all columns as fields. performance sucks and cost prohibitive.
Getting the Lucene result set, getting the OrderID from it and querying the DB like SELECT * from Order where OrderID IN(ORDERIDs from Lucene). This doesn't work because the email search could yield a million orderIDs, which makes the SQL query to perform poorly if at all.
Doing the join in application code, but iterating over the sql result and the lucene result. This means based on the size of the results, a single query could load 2 multi million row datasets and iterate over them, trashing CPU and memory.
Thoughts on how I can structure this join/intersection of 2 large datasets?
p.s: first one to suggest hadoop is a rotten egg. Wish I could, but we don't have the budget for more hardware.

Like OzrenTkalcecKrznaric said in the comments to the question, paging is your friend. (Remember that the single most power algorithm ever developed is "divide and conquer".)

fastest way to search huge list of big texts

I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...

Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.

You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.

See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.

If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.

c# - SQL - speed up code to DB

I have a page with 26 sections - one for each letter of the alphabet. I'm retrieving a list of manufacturers from the database, and for each one, creating a link - using a different field in the Database. So currently, I leave the connection open, then do a new SELECT by each letter, WHERE the Name LIKE that letter. It's very slow, though.
What's a better way to do this?
TIA

Since you are going to fetch them all anyway, you might find it faster to fetch them in one go and split them into letter-groups in the code.
Looking at it from the other end, why do you need to fetch all the lists just to build a set of links? Shouldn't you fetch a single letter when its link is clicked?

It sounds like you are doing up to 26 queries, which will never be fast. Often a single db query can take at least 40 ms, due to network latency, establishing connection, etc. So, doing this 26 times means that it will take around 40 x 26 ms, or more than one second. Of course, it can take much longer depending on your schema, data set, hardware, etc., but this is a rule of thumb that gives you a rough idea of the impact of queries on overall page render time.
One way I deal with this kind of situation is to use a DataTable. Fetch all the records into the DataTable, and then you can iterate through the alphabet, and use the Select method to filter.
DataTable myData = GetMyData();
foreach(string letter in lettersOfTheAlphabet)
{
myData.Filter(String.Format("Name like '{0}%'", letter));
//create your link here
}
Depending on your model layer you may wish to filter in a different way, but this is the basic idea that should improve the performance a lot.
Assuming you are querying to determine which letters are used, so that you know which links to render, you could actually just query for the letters themselves, like this:
select distinct substring(ManufacturerName, 1, 1) as FirstCharacter
from MyTable
order by 1

get one result set from one query and split that up. There is quite a lot of overhead going out the the database 26 times to do basically the same work!

You could probably do it smarter with a stored procedure. Let the SP return all the information you need in one call, and suddenly you only have one database interaction instead of 26...

Bring back all the items in one set (dataset, etc..), either through stored procedure or query, including the field left(col1,1), and sorting by that field..
select left(col1,1) as LetterGroup, col1, url_column from table1 order by left(col1,1)
Then look through the whole resultset, changing sections when the letter changes.

First letter in the alphabet sucks (sorry) as discriminator. You do not neet to split them actually (you could just ask for "where name like 'a%'), but whatever you run for that gives you on average a 1/26 or so split of the names. Not extremely efficient.
What do you mean with "creating a link - using a different field in the Database" - this sounds like a bad design to me.

there are a couple ways u can do this. 1) create a view in your db that has all the manufactures and their website link and then continue to hit the view for each letter. 2) select all the manufactures once and store it in a .net dataset and then use that dataset to populate your links.

This seems dirty to me, but you could create a first letter CHAR column and trigger to populate it. Have the first letter from the manufacturer name stored in that column and index it. Then select * from table where FirstLetter = 'A'.
Or create a lookup table with rows A - Z and set up foreign key in the manufacturer table. Again you would probably need a trigger to update this information. Then you could inner join the lookup table to the manufacturer table.
Then instead of putting 26 datasets in the page, have a list of links (A-Z) which select and show each dataset one at a time.

If I read you right, you're making a query for every manufacturer to get the "different field" you need to construct the link. If so, that's your problem, not the 26 alphabetic queries (though that's no help).
In a case like that, the faster way is this one query:
SELECT manufacturer_name, manufacturer_id, different_field
FROM manufacturers m
INNER JOIN different_field_table d
ON m.manufacturer_id = d.manufacturer_id
ORDER BY manufacturer_name
In your server code, loop through the records as usual. If you want, emit a heading when the first letter of the manufacturer_name changes.
For additional speed:
Put that in a stored procedure.
Index different_field_table on manufacturer_id.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.