How to efficiently implement cursor based pagination with EF? Traditionally Take and Skip solve the common way of do it, but for scenarios where data is added and removed frequently traditional pagination is not the best way to go for.
To put things in context suppose you need to list a huge list of products, you can store last product id and go with a where clause asking for ids greater than or less than the value stored. Things get complicated when you need to provide the ability to sort based on criteria like price, date added etc where you can have equals values for many items, then greater than or less than is not enough.
LINQ has SkipWhile and TakeWile but this work over objects not over SQL, but I can go for it if a decent solution come to my mind or with a smart answer / comment. I am trying to implement graphql pagination as per Relay.js
Thanks in advance
Related
I really need an expert's help to answer my query.
Here is the scenario:
Im using an sql select query to retrieve a million records.
I need to perform sorting and grouping on the resultant records which im storing in a datatable( in one execution)
and looping through it for grouping and sorting it.
I know this is so childish and not the right way to process it.
How can i manage the million records effectively and apply the grouping and sorting to it?
Really need help out here. Heard of executing the select query batch wise but how to implement the grouping and sorting while we dont have the entire data in hand?
I cannot go for sql order by and group by directly and that's against my requirement.
Here is what i'm doing right now:
I have the following objects, i.e the column names for grouping and Sorting
List<Group> groupList;
List<Sort> sortList;
DataTable reportData; // Here im having the entire records from db
Im looping through the 'reportData' row by row and matches the current and previous row for the custom grouping and sorting. Would like to know how the same can be done when we are using a batchwise execution or any alternative solution is there?
I need to perform sorting and grouping on the resultant records which
im storing in a datatable( in one execution) and looping through it
for grouping and sorting it.
What for?
Seriously.
Do not pull then try plaing smart with a stupid object model behind (and datasets are not particularly smart, sorry).
Group and sort in your select statement, pull the data lready grouped and joined and be done with it.
A million records was a small amount of data for sql server when the original version was release (4.2 it was, a port of sysase sql server) 17 years of so ago. These days it is something that fits likely into the processor thiird level cache and is nothing a proper sql server even realizes it has just processed.
SQL is particulaly good ad doing projects and ever since they indoruced MARS you can even run multiple queries over one connection, which comes in handy here.
So, go back - throw away the dataset and "I try to program a sort algo" and create proper SQL statements to pull the data as you need it.
Sounds like you should implement Partition Pruning. Partitioning will allow for a separation of content like you are requesting in order to have faster queries.
If I understood correctly, in your case, I would create a temporary database table with the structure I want especially to cover my grouping.
Then I would select the records from main tables and insert them to the temporary one appying all modifications including grouping.
A specific index on how you want them sorted should be also applied.
After that, just select from this table, do what you have to do, and finally if the data are not needed any more, delete the temporary table.
I would choose the above solution because a million of records in memory smells trouble to me...
For example:
1. Lets assume that you would like to group them by their DocumentTypeID
var groupByType = reportData.GroupBy(g=>g.DocumentTypeID);
2. Sorting Alphabetically
var sortAlphabetically = reportData.OrderBy(g=>g.DocumentName);
3. Grouping and Sorting
var groupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.OrderBy(g=>g.DocumentName);
4. Sort and Group
var groupAndSort = reportData.OrderBy(g=>g.DocumentName)
.GroupBy(g=>g.DocumentTypeID);
5. Multiple Grouping and sorting
var multipleGroupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.GroupBy(g=>g.CreatedOnDate.Month)
.OrderBy(g=>g.DocumentName);
so on and so forth...
But I would still discourage bringing million rows to application. It will cost memory. There are of course ways to manage it through stored procedures etc.
I'm developing a tax calculation system that applies various taxes based on a set of supplied criteria.
The information frequently changes, so I'm trying to create a way to store all these logic rules in the database.
As you can imagine, there is a lot of compound logic involved in applying taxes.
For example, a tax might only apply if A is true, B is less than 100, and C equals 7.
My current design is terrible.
I have a few database columns for very common criteria filtering, such as location and tax year.
For more complex logic, I have a column that holds JavaScript, and in code, I run an interpreter to filter the results. Performance and maintainability suck.
I'd like to improve this design by making the logic entirely data-driven, but I'm having trouble figuring out how to correctly represent this logic within a relational database. What is a good way to model this logic in the database?
I have worked on this similar issue for over a year now for a manufacturing cost generation application. Similarly, it takes in loads of product design data input and base on the design, and other inventory considerations such as quantity, bulk purchase options, part supplier, electrical ratings etc. The result is a list of direct materials, labour and costs.
I knew from the onset that what I need is some kind of query language instead of a computational one, and it has to be scripted, not compiled. But I have yet to find a perfect solution:
METHOD 1 - SQL
I created tables that represents my objects and columns that represents properties and then manually typed in the all the SQL SELECT statments required in an item_rules table. What I did was to first save the object into the database, then then I did
rules = SELECT * FROM item_rules
foreach(rules as _rule)
{
count = SELECT COUNT(*) FROM (_rule[select_statement]) as T1
if(count > 1) itemlist.add(_rule[item_that_satisfy_rule])
}
What it does is it takes each rule in the item_rules table and run it against my object that is now in the tables. e.g. SELECT * FROM my_object WHERE A=5 AND B>10. If I successfully pick it up, I get a positive count and then I know I should include the corresponding rule item to my items list.
METHOD 2 - NCALC
Instead of storing the queries in SQL format, I found the NCALC opensource expression parsing library. NCALC takes a string expression and option variable and computes a result. The string expressions can be stored in plain text on the filesystem.
METHOD 3 - EXCEL
EXCEL is actually a very good piece of software for doing data lookups. You can create the formulas in excel and then feed data from your application into excel and then let excel run the formulas to give you the results. Advantage is that many people knows how to use excel, so different people can maintain it.
But like I say, none of these are perfect for me. I am just sharing and hopefully we can get better recommedations.
If you are to go with Jake's approach, You can use Dynamic Sql too.
I have to write the code for the following method:
public IEnumerable<Product> GetProducts(int pageNumber, int pageSize, string sortKey, string sortDirection, string locale, string filterKey, string filterValue)
The method will be used by a web UI and must support pagination, sorting and filtering. The database (SQL Server 2008) has ~250,000 products. My question is the following: where do I implement the pagination, sorting and filtering logic? Should I do it in a T-SQL stored procedure or in the C# code?
I think that it is better if I do it in T-SQL but I will end up with a very complex query. On the other hand, doing that in C# implies that I have to load the entire list of products, which is also bad...
Any idea what is the best option here? Am I missing an option?
You would definitely want to have the DB do this for you. Moving ~250K records up from the database for each request will be a huge overhead. If you are using LINQ-to-SQL, the Skip and Take methods will do this (here is an example), but I don't know exactly how efficient they are.
I think other (and potentionaly best) option is to use some higher level framework that shield you from complexity of query writing. EntityFramework, NHibernate and LINQ(toSQL) help you a lot. That said database is typically best place to do it in your case.
today itself I implement pagination for my website. I have done with stored procedure though I am using Entity-Framework. I found that executing a complex query is better then fetching all records and doing pagination with code. So do it with stored procedure.
And I see your code line, which you have attached, I have implemented in same way only.
I would definatly do it in a stored procedure something along the lines of :
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY Quantity) AS row, *
FROM Products
) AS a WHERE row BETWEEN 11 AND 20
If you are using linq then the Take and Skip methods will take care of this for you.
Definitely in the DB for preference, if at all possible.
Sometimes you can mix things up a bit such as if you have the results returned from a database function (not a stored procedure, functions can be parts of larger queries in ways that stored procedures cannot), then you can have another function order and paginate, or perhaps have Linq2SQL or similar call for a page of results from said function, producing the correct SQL as needed.
If you can at least get the ordering done in the database, and will usually only want the first few pages (quite often happens in real use), then you can at least have reasonable performance for those cases, as only enough rows to skip to, and then take, the wanted rows need be loaded from the db. You of course still need to test that performance is reasonable in those rare cases where someone really does look for page 1,2312!
Still, that's only a compromise for cases where paging is very difficult indeed, as a rule always page in the DB unless it's either extremely difficult for some reason, or the total number of rows is guaranteed to be low.
Just a random query regarding Microsoft Velocity.
Scenario:
Say I want ALL Orders from my database. In SQL, this is fine, I can do SELECT OrderId,TotalCost... from Orders. This is one round trip to my database, and everyone is happy.
Now, if I'm using Memcached or (as I'm using now) Microsoft Velocity (CTP3), there is no easy way to do this. The two options I see are (in pseudo code)
FOR EACH ORDER
Order = cache.TryGet(OrderId)
if Order is null
Order = db.Get(OrderId)
END FOR EACH
which would be LOADS of roundtrips.
Also, consider I want to get orders by Customer
SQL: Select OrderId....TotalCost from Orders where CustomerId = MyCustomerId
One round trip, everyone is happy.
Using a cached solution there are two solutions I see really:
Solution 1:
DOES CustomerOrderIdsForCustomerId EXIST
NO
POPULATE CustomerOrderIdsForCustomerId FROM DATABASE
YES
FOR EACH OrderId IN CustomerOrdersForCustomerId
cache.TryGet(OrderId)
IF Order IS NULL
Order = db.Get(OrderId)
END FOR EACH
Solution 2 is to hold a serialized list of all the customer orders in it's own cache object. Reduces round trips, but just seems lame.
Can someone shed light on this situation please?
Just because you have a cache doesn't mean you have to use it for every query! In this instance as you've already identified, it's not really helping you and I'd probably go straight through to the database for this sort of thing.
It depends a bit on your application though - if you think customers are regularly going to be looking at their order history, or you have some function that's analysing orders to see what products are hot, then you might want to use some caching to keep load off your SQL server. In that case, I'd probably go with holding in the cache either a DataTable of the orders, or a collection of Orders and query it with LINQ to show the orders for a customer.
Keep in mind that a cache is not supposed to be the permanent store for any data (orders in your case). In this case the cache can help in removing some of the load from your DB server, but something has to load the orders in the cache before you can retrieve them. With that being said, here's a couple of options to consider if you are using velocity that avoid having to loop through a collection. However, you will always have to figure out a way to deal with data that is not in the cache.
Option 1: Use Regions
You can create a region and get all the objects from that region with one call. In your scenario, you could create an Orders region where you can store all the orders and then use the GetObjectsInRegion method to get all the orders in the cache. Note however that this brings back all the orders in the cache... which might or might not have all the orders that you have in the database.
Option 2: Use Regions And Tags
Velocity lets you tag objects that you put in the cache regions and then retrieve them using those tags. So, in your scenario you could tag the order objects with an "order" tag and then use the GetObjectsByTag method to retrieve them. Since you can use multiple tags, you could also tag them with their customer id tag and then pull them out that way.
These 2 options come with some caveats, so be sure to read up on the documentation:
Velocity Tag BasedMethods
I want a data structure that will allow querying how many items in last X minutes. An item may just be a simple identifier or a more complex data structure, preferably the timestamp of the item will be in the item, rather than stored outside (as a hash or similar, wouldn't want to have problems with multiple items having same timestamp).
So far it seems that with LINQ I could easily filter items with timestamp greater than a given time and aggregate a count. Though I'm hesitant to try to work .NET 3.5 specific stuff into my production environment yet. Are there any other suggestions for a similar data structure?
The other part that I'm interested in is aging old data out, If I'm only going to be asking for counts of items less than 6 hours ago I would like anything older than that to be removed from my data structure because this may be a long-running program.
A simple linked list can be used for this.
Basically you add new items to the end, and remove too old items from the start, it is a cheap data structure.
example-code:
list.push_end(new_data)
while list.head.age >= age_limit:
list.pop_head()
If the list will be busy enough to warrant chopping off larger pieces than one at a time, then I agree with dmo, use a tree structure or something similar that allows pruning on a higher level.
I think that an important consideration will be the frequency of querying vs. adding/removing. If you will do frequent querying (especially if you'll have a large collection) a B-tree may be the way to go:
http://en.wikipedia.org/wiki/B-tree
You could have some thread go through and clean up this tree periodically or make it part of the search (again, depending on the usage). Basically, you'll do a tree search to find the spot "x minutes ago", then count the number of children on the nodes with newer times. If you keep the number of children under the nodes up to date, this sum can be done quickly.
a cache with sliding expiration will do the job ....
stuff your items in and the cache handles the aging ....
http://www.sharedcache.com/cms/