Aging Data Structure in C# - c#

I want a data structure that will allow querying how many items in last X minutes. An item may just be a simple identifier or a more complex data structure, preferably the timestamp of the item will be in the item, rather than stored outside (as a hash or similar, wouldn't want to have problems with multiple items having same timestamp).
So far it seems that with LINQ I could easily filter items with timestamp greater than a given time and aggregate a count. Though I'm hesitant to try to work .NET 3.5 specific stuff into my production environment yet. Are there any other suggestions for a similar data structure?
The other part that I'm interested in is aging old data out, If I'm only going to be asking for counts of items less than 6 hours ago I would like anything older than that to be removed from my data structure because this may be a long-running program.

A simple linked list can be used for this.
Basically you add new items to the end, and remove too old items from the start, it is a cheap data structure.
example-code:
list.push_end(new_data)
while list.head.age >= age_limit:
list.pop_head()
If the list will be busy enough to warrant chopping off larger pieces than one at a time, then I agree with dmo, use a tree structure or something similar that allows pruning on a higher level.

I think that an important consideration will be the frequency of querying vs. adding/removing. If you will do frequent querying (especially if you'll have a large collection) a B-tree may be the way to go:
http://en.wikipedia.org/wiki/B-tree
You could have some thread go through and clean up this tree periodically or make it part of the search (again, depending on the usage). Basically, you'll do a tree search to find the spot "x minutes ago", then count the number of children on the nodes with newer times. If you keep the number of children under the nodes up to date, this sum can be done quickly.

a cache with sliding expiration will do the job ....
stuff your items in and the cache handles the aging ....
http://www.sharedcache.com/cms/

Related

Entity Framework cursor based pagination

How to efficiently implement cursor based pagination with EF? Traditionally Take and Skip solve the common way of do it, but for scenarios where data is added and removed frequently traditional pagination is not the best way to go for.
To put things in context suppose you need to list a huge list of products, you can store last product id and go with a where clause asking for ids greater than or less than the value stored. Things get complicated when you need to provide the ability to sort based on criteria like price, date added etc where you can have equals values for many items, then greater than or less than is not enough.
LINQ has SkipWhile and TakeWile but this work over objects not over SQL, but I can go for it if a decent solution come to my mind or with a smart answer / comment. I am trying to implement graphql pagination as per Relay.js
Thanks in advance

Efficient C# data structure for insertion, deletion and rearranging at given indexes [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I'm looking for an efficient data structure in C# that allows me to keep a list of items ordered (by the user) without duplicates.
What I mean by ordered by the user, ie.:
Insert element 1.
Insert element 2 before element 1.
Insert element 3 between 1 and 2. Then rearrange at will.
I will need the order to be continuously updated in a database upon change so that I can load it at start.
Operations I need:
Insert at a given index
Delete at a given index
Move from index x to index y (could be expressed as the combination of 2 and 1 if there is no loss in performance)
All of these operations will be frequent and equally important.
I assume by "efficient" you mean asymptotically efficient. If that's not the case, then clarify the question.
The combination of indexing and arbitrary insertion is a tricky one.
List<T>s -- which are just a thin wrapper over arrays -- have O(1) insertion/deletion at the end, O(n) insertion/deletion at the beginning, and O(1) indexing. Checking uniqueness is O(n).
Linked lists have O(1) insertion/deletion provided you already know where you want to put the item, but O(n) indexing to find that location. Checking uniqueness is O(n)
Balanced binary trees have O(lg n) insertion and deletion and indexing if you're clever. Checking uniqueness is O(n). More exotic data structures like finger trees, skiplists, etc, are similar.
Hash sets have O(1) insertion and deletion but no indexing; checking uniqueness is O(1).
There is no single data structure that fits your needs. My advice is:
Embrace immutability. Write an immutable data structure that meets your needs. It will be easier to reason about.
Write a combination of a balanced binary tree -- red-black, AVL, etc -- and a hash set. The hash set is used only for uniqueness checking. The BBT has the number of items below it in every node; this facilitates indexing. The insertion and deletion algorithms are as normal for your BBT, except that they also rewrite the spine of the tree to ensure that the item count is updated correctly.
This will give you O(1) uniqueness checking and O(lg n) indexing, inserting and deleting.
I note that this data structure gives you O(1) answers to the question "is this item in the collection?" but O(n) answers to the question "where is it?" so if you need the inverse indexing operation to be fast, you have a much larger problem on your hands.
I think I would just use a List and take O(n) Contains or a separate HashSet for uniqueness. List does all the other stuff nicely. Nicely as the operations are all there but most will be O(n). Even on 10,000 O(n) is pretty fast. The database calls are going to be the slowest part by far (try async).
class MyCollection<T> : IList<T>
{
private readonly IList<T> _list = new List<T>();
public void Insert(int index, T item)
{
if (this.Contains(item))
throw new IndexOutOfRangeException();
_list.Insert(index, item);
//make database call
}
// implement all the other features of IList with database calls
This has kind of turned into two questions: one for the database layer and one for the in-memory collection. However, I think you can practically bring it back down to a single question if you let the database layer become your source of truth.
The reason I say this is that with roughly 100 items as the maximum likely number of active items in your list, you can pretty much ignore asymptotic complexity. Performance-wise, the most important thing to focus on when you've got this many items is round-trips across network connections (e.g. to the database).
Here's a fairly simple approach you can use. It's similar to something I've done in the past, with similar requirements. (I can't remember if it's exactly the same or not, but close enough.)
Use a numeric Order column to determine the order of your items within the given list. int should be just fine.
When you remove an item, decrement the orders of all items in the same list after that item. This can be done with a single UPDATE statement in SQL.
When you add an item, give it an Order value based on the location it's added at, and increment the orders of all items in the same list after that item (again, with a single Update statement).
When you move an item to a different location, change its Order and then increment or decrement the Orders of all the items between its starting and ending positions.
Every time a change is made, re-load the entire list of items, in order, from the database to display to the user.
You may want to use stored procs to do more of this work in individual round-trips. Definitely a transactions to avoid race conditions.
An approach like this will easily scale for individual users editing individual lists. If you need scalability in terms of concurrent users, it's likely that another strategy like a NoSQL store is going to be the way to go. If you need to scale on many concurrent users editing the same list, things get really complicated and you may need to implement message buses and other goodness. If you find that you need to scale to tens of thousands of items in the list, you'll need to rethink your UI and how it communicates with the server (e.g. you won't want to load the entire list into memory). But when each of the operations is performed manually by a user, worrying about your in-memory data structure isn't going to get you where you want to be in any of these cases.
In terms of data structures, a linked list is fast for inserts and deletes assuming you have direct reference to the nodes (in this case, you would want a doubly-linked list). I haven't used the built-in .NET LinkedList, but it seems it has some efficiency problems. You may want to just use a normal List if you have issues with LinkedList (Really depends on how "efficient" you need this to be.) See List time complexities here
As for saving it, all you need to do is save the index in your database and fill your collection from an query with an ORDER BY on start.
EDIT:
And for duplicate management, you could maintain a HashSet to check for duplicates and prevent insertion.

Recommended pattern/strategy to compare two sets of data (new vs existing)... remaining with new data that's not in existing

I have an ETL process/job that fetches database data from a source to a destination in a scheduled way.
[Source data] is updated regularly with new data from some external
source. [Destination data] is a subset of [Source data] that is used
downstream by business.
The constraint requirement in [Destination data] is that it should
not have duplicates (may occur, for example, in the event of job
failure, then a new extraction is run after some data is possible imported)
The job imports 1000 records at a time
The Scheduler/Job has other responsibilities and other data it works on
One of my "feasible" options involve:
fetching ALL the projected composite/key columns from the destination,
doing a comparison with the new 1000 loaded records (still alot of
records).
Then saving the new [Source data] that is not in the
[Destination Data].
I would imagine that the data structure containing existing [Destination data] would be a Hashset of the following structure, for example, HashSet<int,string,string>. Where the 3 data items uniquely identify a record.
I would then get the 1000 records, loop through them, comparing with the HashSet.
I fear working with too much data in-memory.
Any advice on a better approach, or would this be the most efficient way to do it?
Just to share, I found similar question with a comprehensive answer. It's in Java but easily translates to C#.
Still open to any alternatives. Otherwise will mark this one as answer and indicate as being duplicate.
...we could sort all elements by their ID (a one-time O(n log n) cost) in ascending order, and iterate over them using an O(n) algorithm that skips elements as long as they are larger than the current element from the other sequence. This is better, but still not optimal.
The optimal solution is to create a hash set of IDs of the bs set. This does not require sorting of both sets, and allows linear-time membership test. There is a one-time O(n) cost to assemble the set of IDs.
HashSet<Integer> bIds = new HashSet<>(bs.size());
for (B b : bs)
bIDs.add(b.getId());
for (A a : as)
if (bIds.contains(a.getId()))
cs.add(a);
The total complexity of this solution is O(|as| + |bs|).
https://softwareengineering.stackexchange.com/a/258325/132218

How to store a sparse boolean vector in a database?

Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.
Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.

fastest way to search huge list of big texts

I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...
Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.
You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.
See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.
If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.

Categories