determining differences between generic lists - c#

There are probably 10 duplicates of this question but I would like to know if there is a better way than I am currently doing this. This is a small example that I'm using to show how I'm determining differences:
//let t1 be a representation of the ID's in the database.
List<int> t1 = new List<int>() { 5, 6, 7, 8 };
//let t2 be the list of ID's that are in memory.
//these changes need to be reflected to the database.
List<int> t2 = new List<int>() { 6, 8, 9, 10 };
var hash = new HashSet<int>(t1);
var hash2 = new HashSet<int>(t2);
//determines which ID's need to be removed from the database
hash.ExceptWith(t2);
//determines which ID's need to be added to the database.
hash2.ExceptWith(t1);
//remove contents of hash from database
//add contents of hash2 to database
I want to know if I can determine what to add and remove in ONE operation instead of the two that I currently have to do. Is there any way to increase the performance of this operation? Keep in mind in the actual database situation there are hundreds of thousands of ID's.
EDIT or second question, is there a LINQ query that I can do directly on the database so I can just supply the new list of ID's and have it automatically remove/add itself? (using mysql)
CLARIFICATION I know I need two SQL queries (or a stored procedure). The question is if I can determine the differences in the list in one action, and if it can be done faster than this.
EDIT2
This operation from SPFiredrake appears to be faster than my hashset version - however I have no idea how to determine which to add and which to remove from the database. Is there a way to include that information in the operation?
t1.Union(t2).Except(t1.Intersect(t2))
EDIT3
Nevermind, I forgot that this statement in-fact has the problem of delayed execution, although in-case anyone is wondering, I solved my prior problem with it by using a custom comparer and an added variable determining which list it was from.

Ultimately, you're going to use a full outer join (which in LINQ world, is two GroupJoins). However, we ONLY care about values that don't have a matching record in either table. Null right value (left outer join) indicates a removal, null left value (right outer join) indicates an addition. So to get it to work this way, we just perform two left outer joins (switching the input for the second case to emulate the right outer join), concat them together (can use union, but unnecessary since we'll be getting rid of any duplicates anyway).
List<int> t1 = new List<int>() { 5, 6, 7, 8 };
List<int> t2 = new List<int>() { 6, 8, 9, 10 };
var operations =
t1.GroupJoin(
t2,
t1i => t1i,
t2i => t2i,
(t1i, t2join) => new { Id = t1i, Action = !t2join.Any() ? "Remove" : null })
.Concat(
t2.GroupJoin(
t1,
t2i => t2i,
t1i => t1i,
(t2i, t1join) => new { Id = t2i, Action = !t1join.Any() ? "Insert" : null })
.Where(tr => tr.Action != null)
This will give you the select statement. Then, you can feed this data into a stored procedure that removes values that already exist in the table and add the rest (or two lists to run removals and additions against). Either way, still not the cleanest way to do it, but at least this gets you thinking.
Edit: My original solution was to separate out the two lists based on what action was needed, which is why it's so ghastly. The same can be done using a one-liner (not caring about which action to take, however), although I think you'll still suffer from the same issues (using LINQ [enumeration] as opposed to Hashsets [hash collection]).
// XOR of sets = (A | B) - (A & B), - being set difference (Except)
t1.Union(t2).Except(t1.Intersect(t2))
I'm sure it'll still be slower than using the Hashsets, but give it a shot anyway.
Edit: Yes, it is faster, because it doesn't actually do anything with the collection until you enumerate over it (either in a foreach or by getting it into a concrete data type [IE: List<>, Array, etc]). It's still going to take extra time to sort out which ones to add/remove and that's ultimately the problem. I was able to get comparable speed by breaking down the two queries, but getting it into the in-memory world (via ToList()) made it slower than the hashset version:
t1.Except(t2); // .ToList() slows these down
t2.Except(t1);
Honestly, I would handle it on the SQL side. In the stored proc, store all the values in a table variable with another column indicating addition or removal (based on whether the value already exists in the table). Then you can just do a bulk deletion/insertion by joining back to this table variable.
Edit: Thought I'd expand on what I meant by sending the full list to the database and have it handled in the sproc:
var toModify = t1.Union(t2).Except(t1.Intersect(t2));
mods = string.Join(",", toModify.ToArray());
// Pass mods (comma separated list) to your sproc.
Then, in the stored procedure, you would do this:
-- #delimitedIDs some unbounded text type, in case you have a LOT of records
-- I use XQuery to build the table (found it's faster than some other methods)
DECLARE #idTable TABLE (ID int, AddRecord bit)
DECLARE #xmlString XML
SET #xmlString = CAST('<NODES><NODE>' + REPLACE(#delimitedIDs, ',', '</NODE><NODE>') + '</NODE></NODES>' as XML)
INSERT INTO #idTable (ID)
SELECT node.value('.','int')
FROM #xmlString.nodes('//NODE') as xs(node)
UPDATE id
SET AddRecord = CASE WHEN someTable.ID IS NULL THEN 1 ELSE 0 END
FROM #idTable id LEFT OUTER JOIN [SomeTable] someTable on someTable.ID = id.ID
DELETE a
FROM [SomeTable] a JOIN #idTable b ON b.ID = a.ID AND b.AddRecord = 0
INSERT INTO [SomeTable] (ID)
SELECT id FROM #idTable WHERE AddRecord = 1
Admittedly, this just inserts some ID, it doesn't actually add any other information. However, you can still pass in XML data to the sproc and use XQuery in a similar fashion to get the information you'd need to add.

even if you replace it with a Linq version you still need two operations.
let's assume you are doing this using pure SQL.
you would probably need two queries:
one for removing the records
another one for adding them
Using LINQ code it would be much more complicated and less readable than your solution

Related

Query ODataV4 connected service with LINQ - Get last record from table

Im trying to query my OData webservice from a C# application.
When i do the following:
var SecurityDefs = from SD in nav.ICESecurityDefinition.Take(1)
orderby SD.Entry_No descending
select SD;
i get an exception because .top() and .orderby is not supposed to be used together.
I need to get the last record in the dataset and only the last.
The purpose is to get the last used entry number in a ledger and then continue creating new entries incrementing the found entry no.
I cant seem to find anything online that explains how to do this.
Its very important that the service only returns the last record from the feed since speed is paramount in this solution.
i get an exception because .top() and .orderby is not supposed to be used together.
Where did you read that? In general .top() or .Take() should ONLY be used in conjunction WITH .orderby(), otherwise the record being retrieved is not guaranteed to be repeatable or predictable.
Probably the compounding issue here is mixing query and fluent expression syntax, which is valid, but you have to understand the order of precedence.
Your syntax is taking 1 record, then applying a sort order... you might find it easier to start with a query like this:
// build your query
var SecurityDefsQuery = from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD;
// Take the first item from the list, if it exists, will be a single record.
var SecurityDefs = SecurityDefsQuery.FirstOrDefault();
// Take an array of only the first record if it exists
var SecurityDefsDeferred = SecurityDefsQuery.Take(1);
This can be executed on a single line using brackets, but you can see how the query is the same in both cases, SecurityDefs in this case is a single ICESecurityDefinition typed record, where as SecurityDefsDeferred is an IQueryable<ICESecurityDefinition> that only has a single record.
If you only need the record itself, you this one liner:
var SecurityDefs = (from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD).FirstOrDefault();
You can execute the same query using fluent notation as well:
var SecurityDefs = nav.ICESecurityDefinition.OrderByDescending(sd => sd.Entry_No)
.FirstOrDefault();
In both cases, .Take(1) or .top() is being implemented through .FirstOrDefault(). You have indicated that speed is important, so use .First() or .FirstOrDefault() instead of .Single() or .SingleOrDefault() because the single variants will actually request .Take(2) and will throw an exception if it returns 1 or no results.
The OrDefault variants on both of these queries will not impact the performance of the query itself and should have negligble affect on your code, use the one that is appriate for your logic that uses the returned record and if you need to handle the case when there is no existing record.
If the record being returned has many columns, and you are only interested in the Entry_No column value, then perhaps you should simply query for that specific value itself:
Query expression:
var lastEntryNo = (from SD in nav.ICESecurityDefinition
orderby SD.Entry_No descending
select SD.Entry_No).FirstOrDefault();
Fluent expression:
var lastEntryNo = nav.ICESecurityDefinition.OrderByDescending(sd => sd.Entry_No)
.Select(sd => sd.Entry_No)
.FirstOrDefault();
If Speed is paramount then look at providing a specific custom endpoint on the service to either serve the record or do not process the 'Entry_No` in the client at all, make that the job of the code that receives data from the client and compute it at the time the entries are inserted.
Making the query perform faster is not the silver bullet you might be looking for though, Even if this is highly optimised, your current pattern means that X number of clients could all call the service to get the current value of Entry_No, meaning all of them would start incrementing from the same value.
If you MUST increment the Entry_No from the client then you should look at putting a custom endpoint on the service to simply return the Next Entry_No to use. This should be optimistic meaning that you don't care if the Entry_No actually gets used in the end, but you can implement the end point such that every call will increment the field in the database and return the next value.
Its getting a bit beyond the scope of your initial post, but SQL Server now has support for Sequences that formalise this type of logic from a database and schema point of view, using Sequence simplifies how we can manage these types of incrementations from the client, because we no longer rely on the outcome of data updates to be comitted to the table before the client can increment the next record. (which is what your TOP, Order By Desc solution is trying to do.

SQL generated from LINQ not consistent

I am using Telerik Open/Data Access ORM against an ORACLE.
Why do these two statements result in different SQL commands?
Statement #1
IQueryable<WITransmits> query = from wiTransmits in uow.DbContext.StatusMessages
select wiTransmits;
query = query.Where(e=>e.MessageID == id);
Results in the following SQL
SELECT
a."MESSAGE_ID" COL1,
-- additional fields
FROM "XFE_REP"."WI_TRANSMITS" a
WHERE
a."MESSAGE_ID" = :p0
Statement #2
IQueryable<WITransmits> query = from wiTransmits in uow.DbContext.StatusMessages
select new WITransmits
{
MessageID = wiTranmits.MessageID,
Name = wiTransmits.Name
};
query = query.Where(e=>e.MessageID == id);
Results in the following SQL
SELECT
a."MESSAGE_ID" COL1,
-- additional fields
FROM "XFE_REP"."WI_TRANSMITS" a
The query generated with the second statement #2 returns, obviously EVERY record in the table when I only want the one. Millions of records make this prohibitive.
Telerik Data Access will try to split each query into database-side and client-side (or in-memory LINQ if you prefer it).
Having projection with select new is sure trigger that will make everything in your LINQ expression tree after the projection to go to the client side.
Meaning in your second case you have inefficient LINQ query as any filtering is applied in-memory and you have already transported a lot of unnecessary data.
If you want compose LINQ expressions in the way done in case 2, you can append the Select clause last or explicitly convert the result to IEnumerable<T> to make it obvious that any further processing will be done in-memory.
The first query returns the full object defined, so any additional limitations (like Where) can be appended to it before it is actually being run. Therefore the query can be combined as you showed.
The second one returns a new object, which can be whatever type and contain whatever information. Therefore the query is sent to the database as "return everything" and after the objects have been created all but the ones that match the Where clause are discarded.
Even though the type were the same in both of them, think of this situation:
var query = from wiTransmits in uow.DbContext.StatusMessages
select new WITransmits
{
MessageID = wiTranmits.MessageID * 4 - 2,
Name = wiTransmits.Name
};
How would you combine the Where query now? Sure, you could go through the code inside the new object creation and try to move it outside, but since there can be anything it is not feasible. What if the checkup is some lookup function? What if it's not deterministic?
Therefore if you create new objects based on the database objects there will be a border where the objects will be retrieved and then further queries will be done in memory.

MVC Efficient dynamic EF database calls

I'm working on an MVC app. A lot of the views are search engine like, allowing the user to select parameters to get the data he wants.
I am looking for an efficient way to make dynamic calls to the database so that I can retrieve only the wanted data instead of taking a bunch of data and sorting them out.
So far I have been using Dynamic Linq, but I have had huge troubles working with this and I was wondering if there was anything better and less troublesome. The only factor is that I know the fields of the table I'm looking on, and I need to be able to use operators like >, < or =.
EDIT
Here's a table example based on my app:
TABLE CARD
(
CARD_IDE INT NOT NULL IDENTITY,
CARD_NAME VARCHAR(50) NOT NULL,
CARD_NUMBER NUMERIC(4) NOT NULL,
CARD_COLOR VARCHAR(10),
CARD_MANA_COST VARCHAR(30),
CARD_MANA_CONVT VARCHAR(3),
CARD_TYPE VARCHAR(50),
CARD_POWER VARCHAR(2),
CARD_TOUGH VARCHAR(2),
CARD_RARTY VARCHAR(1) NOT NULL,
CARD_TEXT_ABILT VARCHAR(800),
CARD_TEXT_FLAVR VARCHAR(800),
CARD_ARTST_NAME VARCHAR(100),
CARD_SET_IDE INT NOT NULL,
CARD_FLAG_FACE INT NOT NULL DEFAULT 0,
CARD_CHILD_IDE INT,
CARD_MASTER_IDE INT,
CARD_UNIT_COST NUMERIC(5,2) NOT NULL DEFAULT 0
)
And a few examples:
A user look for any item which type is "Creature" (String), number is 3 and card set IDE is 6;
Any cards which contains the word Rat;
All the cards of color Blue and White which unit cost is higher than 3.00
Any cards which power is less than 3 but higher than 1.
EDIT 2
After much research (and thanks to Chris Pratt below), I've managed to look a bit and dive into something.
Based on this:
First I create my object context like this:
var objectContext = ((IObjectContextAdapter) mDb).ObjectContext;
Then I create the ObjectSet:
ObjectSet<CARD> priceList = objectContext.CreateObjectSet<CARD>();
Then I check if any values as been chosen by the user:
if (keyValuePair.Key == CARDNAME)
{
queryToLoad = TextBuilder.BuildQueryStringForTextboxValue(keyValuePair.Value);
//valuesToUse.Add("CARD_NAME.Contains(\"" + queryToLoad + "\")");
priceList = priceList.Where(_item => _item.CARD_NAME.Contains(queryToLoad)) as ObjectSet<PRICE_LIST>;
}
Where the queryToLoad is, in fact, the value to look for. Example: if my user search for an Angel, the queryToLoad will be "Angel". I'm trying to get to the result without having to rewrite my whole code.
And then I gather the result in a List like this:
listToReturn.AddRange(priceList.ToList());
HOWEVER: I have a problem using this approach. As the priceList = priceList.Where(_item => _item.CARD_NAME.Contains(queryToLoad)) as ObjectSet<PRICE_LIST>; like is struck, the value is always null and I don't know why.
There is no way to optimize something that is inherently dynamic in nature. The only thing you can do in your app is feed whatever filters the end-user chooses into a Where clause and let Entity Framework fetch the result from your database in the most efficient way it deems fit.
However, there's some things you can potentially do if there's certain known constraints. At the very least, if you know which fields will be searched on, you can add proper indexes to your database so that searches on those particular fields will be optimized. You can reach a point of over-optimization, though (if you index every field, you might as well not have an index). It's usually better to monitor the queries the database handles and add indexes for the most used fields based on real-world user behavior.
You can also investigate using stored procedures for your queries. Depending on the complexity and number of filters being applied, creating a stored procedure to handle the queries could be difficult, but if you can condense the logic enough to make it possible, using a stored procedure will highly optimize the queries.
UPDATE
Here's what I mean by building a query. I'll take your first use case scenario above. You have three filters: type, number, and IDE. The user may specify any one or two, all three, or none. (We'll assume that cardType is a string and cardNumber and cardIde are nullable ints.
var cards = db.Cards;
if (!string.IsNullOrEmpty(cardType))
{
cards = cards.Where(m => m.Type == cardType);
}
if (cardNumber.HasValue)
{
cards = cards.Where(m => m.Number == cardNumber);
}
if (cardIde.HasValue)
{
cards = cards.Where(m => m.IDE == cardIde);
}
return View(cards);
Entity Framework will not actually issue the query until you do something that requires the data from the query (iterate over the list, count the items, etc.). Up until that point, anything additional you do to the DbSet is just tacked on to the query EF will eventually send. You can therefore build your query by conditionally handling each potential filter one at a time before finally utilizing the end result.
UPDATE #2
Sorry, I apparently hadn't consumed enough coffee yet when I did my last update. There's obviously a type issue there. If you store db.Cards into cards it will be a DbSet<Card> type, while the result of any call to Where would be an IQueryable<Card> type, which pretty obviously can't be stored in the same variable.
So you just need to initially cast the variable to something that works for both: IEnumerable<Card>:
IEnumerable<Card> cards = db.Cards;
Then you shouldn't get any type errors.

LINQ Intersection of two different types

I have two different list types. I need to remove the elements from list1 that is not there in list2 and list2 element satisfies certain criteria.
Here is what I tried, seems to work but each element is listed twice.
var filteredTracks =
from mtrack in mTracks
join ftrack in tracksFileStatus on mtrack.id equals ftrack.Id
where mtrack.id == ftrack.Id && ftrack.status == "ONDISK" && ftrack.content_type_id == 234
select mtrack;
Ideally I don't want to create a new copy of the filteredTracks, is it possible modify mTracks in place?
If you're getting duplicates, it's because your id fields are not unique in one or both of the two sequences. Also, you don't need to say where mtrack.id == ftrack.Id since that condition already has to be met for the join to succeed.
I would probably use loops here, but if you are dead set on LINQ, you may need to group tracksFileStatus by its Id field. It's hard to tell by what you posted.
As far as "modifying mTracks in place", this is probably not possible or worthwhile (I'm assuming that mTracks is some type derived from IEnumerable<T>). If you're worried about the efficiency of this approach, then you may want to consider using another kind of data structure, like a dictionary with Id values as the keys.
Since the Q was about lists primarily...
this is probably better linq wise...
var test = (from m in mTracks
from f in fTracks
where m.Id == f.Id && ...
select m);
However you should optimize, e.g.
Are your lists sorted? If they are, see e.g. Best algorithm for synchronizing two IList in C# 2.0
If it's coming from Db (it's not clear here), then you need to build your linq query based on the SQL / relations and indexes you have in the Db and go a bit different route.
If I were you, I'd make a query (for each of the lists, presuming it's not Db bound) so that tracks are sorted in the first place (and sort on whatever is used to compare them, usually),
then enumerate in parallel (using enumerators), comparing other things in the process (like in that link).
that's likely the most efficient way.
if/when it comes from database, optimize at the 'source' - i.e. fetch data already sorted and filtered as much as you can. And basically, build an SQL first, or inspect the returned SQL from the linq query (let me know if you need the link).

Collecting metadata into table

I have tabluar data that passes through a C# program that I need to collect some metadata on before finishing. The metadata is always counts based on fields of the data. Also, I need them all grouped by one field in the data. Periodically, I need to add new counts to this collection of metadata.
I've been researching it for a little while, and I think what makes sense is to rework my program to store the data as a DataTable, then run LINQ queries on the table. The problem I'm having is being able to put the different counts into one table-like structure and then write that out.
I might run a query like this:
var query01 =
from record in records.AsEnumerable()
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Count = associationsGroup.Count<DataRow>() };
To get a count of all of the records grouped by the field Association Key. I'm going to want another count, grouped in the same way:
var query02 =
from record in records.AsEnumerable()
where record.Field<String>("Number 9") == "yes"
group record by record.Field<String>("Association Key") into associationsGroup
select new { AssociationKey = associationsGroup.Key, Number9Count = associationsGroup.Count<DataRow>() };
And so on.
I thought about trying Union chain the queries but I was having trouble getting them to union since I'm projecting into anonymous types. I couldn't figure out how to do it differently to make a union work better.
So, how can I collect my metadata into one table-like structure?
Not going to union because you have different types. Add Number9Count and Count to both annonymous types and try union again.
I ended up solving the problem by creating a class that holds the set of records I need as a DataTable. A user can add queries through a method, taking an argument Func<DataRow, bool>. The method constructs the query supplying that argument as the where clause, maintaining the same grouping and properties in the resulting anonymous-typed object.
When retrieving the results, the class iterates over each query stored and enters the results into a new DataTable.

Categories