Efficient method of finding matching vectors by criteria - c#

I have a dictionary where the key is defined by a Vector2, and I am trying to perform a function involving keys of matching Y-values. (Building a graph)
Right now I am using two foreach loops, one to go through each entry and the second to find keys of matching criteria.
foreach(KeyValuePair<Vector2, TransportData> entry in transportDictionary)
//for every value in dictionary
{
Vector2 forpos = entry.Key;
foreach(KeyValuePair<Vector2, TransportData> searchEntry in transportDictionary)
//go through every value in dictionary
{
if(searchEntry.Key.y == forpos.y && searchEntry.Key.x != forpos.x)
//if something is found with matching Y value, at a different X value as to not include itself
{
DoSomething(forpos, searchEntry.key);
//pass the two matched values as arguments
}
}
DoSomethingElse(forpos); //(functions need to be run on every entry individually too)
}
It works, but it is horribly efficient, and I forsee this dictionary having over a thousand entries. With a small test set of 50 entries, this operation is already taking an unacceptably long time.
How can I optimize this operation? (or am I doing something fundamentally wrong?)
If it helps with finding a method, the x and y coordinates of every Vector2 in this application will always be an integer.
--edit--
I need to run a function on every entry anyway, so it isn't necessary to subset the starting dictionary.

One idea would be to first filter your transportDictionary down to only those items that have at least one matching Key.Y, and then just deal with a list of keys (since that's all you seem to need).
Then you could also change the second foreach to only compare to keys that have a Y match, so you aren't looping through all of the keys in each loop:
Finally, you could also remove all the items you've just processed as you go, so you aren't iterating over them multiple times (of course, I don't know what DoSomething() does...if you need to iterate over matches more than once then this wouldn't work):
List<Vector2> allKeysThatHaveAMatch = transportDictionary.Where(current =>
transportDictionary.Count(other => current.Key.Y == other.Key.Y) > 1)
.Select(item => item.Key)
.ToList();
while (allKeysThatHaveAMatch.Any())
{
// Get the first key
var currentKey = allKeysThatHaveAMatch.First();
// Get all matching keys
var matchingKeys = allKeysThatHaveAMatch
.Skip(1)
.Where(candidateKey => candidateKey.Y == currentKey.Y)
.Select(match => match)
.ToList();
// Do Something with each match
foreach (var matchingKey in matchingKeys)
{
DoSomething(currentKey, matchingKey);
}
// Remove the key we just processed
allKeysThatHaveAMatch.Remove(currentKey);
}

Related

Optimum way to validate DataTable for duplicate or invalid fields in a specific column with LINQ

I am trying to find the best way to determine if a DataTable
Contains duplicate data in a specific column
or
If the fields within said column are not found in an external Dictionary<string, string> and the resulting value matches a string literal.
This is what I've come up with:
List<string> dtSKUsColumn = _dataTable.Select()
.Select(x => x.Field<string("skuColumn"))
.ToList();
bool hasError = dtSKUsColumn.Distinct().Count() != dtSKUsColumn.Count() ||
!_dataTable.AsEnumerable()
.All(r => allSkuTypes
.Any(s => s.Value == "normalSKU" &&
s.Key == r.Field<string>("skuColumn")));
allSkuTypes is a Dictionary<string, string> where the key is the SKU itself, and the value is the SKU type.
I cannot just operate on a 'distinct' _dataTable, because there is a column that must contain identical fields (Said column cannot be removed and inferred, since I need to preserve the state of _dataTable).
So my question:
Am I handling this in the best possible way, or is there a simpler and faster method?
UPDATE:
The DataTable is not obtained via an SQL query, rather it is generated by a set of rules from an spreadsheet or csv. I have to make do with only the allSKuTypes and _dataTable objects as my only 'outside information.'
Your solution is not optimal.
Let N = _dataTable.Rows.Count and M = allSkuTypes.Count. Your algorithm has O(2 * N) space complexity (the memory allocated by ToList and Disctinct calls) and O(N * M) time complexity (due to linear search in the allSkuTypes for each _dataTable record).
Here is IMO the optimal solution. It uses single pass over the _dataTable records, a HashSet<string> for detecting the duplicates and TryGetValue method of the Dictionary for checking the second rule, thus ending up with O(N) space and time complexity:
var dtSkus = new HashSet<string>();
bool hasError = false;
foreach (var row in _dataTable.AsEnumerable())
{
var sku = row.Field<string>("skuColumn");
string type;
if (!dtSkus.Add(sku) || !allSkuTypes.TryGetValue(sku, out type) || type != "normalSKU")
{
hasError = true;
break;
}
}
The additional benefit is that you have the row with the broken rule and the code can easily be modified to take different actions depending of the which rule is broken, collect/count only the first or all invalid records etc.

How do I make a streaming LINQ expression that delivers the filtered out items as well as the filtered items?

I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.
Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.
var elements = sheet
.Rows() <-- BEGIN sheet data transform
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows() <-- END sheet data transform
.ToElements(strings) <-- BEGIN domain transform
.RemoveBadRecords(out discard)
.OrderByCompositeKey();
The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.
public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
return map.Select(pair =>
{
try
{
return ElementRow.FromCells(pair.Key, pair.Value);
}
catch (Exception)
{
return ElementRow.BadRecord(pair.Key);
}
});
}
Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
return elements.Where(el => el.FormatId != 0);
}
However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
var temp = new List<ElementRow>();
var filtered = elements.Where(el =>
{
if (el.FormatId == 0) temp.Add(el);
return el.FormatId != 0;
});
discard = temp;
return filtered;
}
And, lo! I thought I was hardcore and would have this working in one shot...
var discard = new List<ElementRow>();
var elements = data
/* snipped long LINQ statement */
.RemoveBadRecords(out discard)
/* snipped long LINQ statement */
discard.ForEach(el => failures.Add(el));
foreach(var el in elements)
{
/* do more work, maybe add more failures */
}
return new Result(elements, failures);
But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.
The temp list was created
The Where filter was assigned (but not yet run)
And the discard list was assigned
Then the streaming thing was returned
When discard was iterated, it contained no elements, because the elements weren't iterated over yet.
Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?
Some Commentary
Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.
When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!
It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.
However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)
I'd suggest materializing the query before removing the bad records - and then you can run separate queries:
var allElements = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.ToList();
var goodElements = allElements.Where(el => el.FormatId != 0)
.OrderByCompositeKey();
var badElements = allElements.Where(el => el.FormatId == 0);
By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.
EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:
var lookup = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.OrderByCompositeKey()
.ToLookup(el => el.FormatId == 0);
Then you can use:
foreach (var goodElement in lookup[false])
{
...
}
and
foreach (var badElement in lookup[true])
{
...
}
Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:
foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
...
}
I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.

IEnumerable<T>.Union(IEnumerable<T>) overwrites contents instead of unioning

I've got a collection of items (ADO.NET Entity Framework), and need to return a subset as search results based on a couple different criteria. Unfortunately, the criteria overlap in such a way that I can't just take the collection Where the criteria are met (or drop Where the criteria are not met), since this would leave out or duplicate valid items that should be returned.
I decided I would do each check individually, and combine the results. I considered using AddRange, but that would result in duplicates in the results list (and my understanding is it would enumerate the collection every time - am I correct/mistaken here?). I realized Union does not insert duplicates, and defers enumeration until necessary (again, is this understanding correct?).
The search is written as follows:
IEnumerable<MyClass> Results = Enumerable.Empty<MyClass>();
IEnumerable<MyClass> Potential = db.MyClasses.Where(x => x.Y); //Precondition
int parsed_key;
//For each searchable value
foreach(var selected in SelectedValues1)
{
IEnumerable<MyClass> matched = Potential.Where(x => x.Value1 == selected);
Results = Results.Union(matched); //This is where the problem is
}
//Ellipsed....
foreach(var selected in SelectedValuesN) //Happens to be integer
{
if(!int.TryParse(selected, out parsed_id))
continue;
IEnumerable<MyClass> matched = Potential.Where(x => x.ValueN == parsed_id);
Results = Results.Union(matched); //This is where the problem is
}
It seems, however, that Results = Results.Union(matched) is working more like Results = matched. I've stepped through with some test data and a test search. The search asks for results where the first field is -1, 0, 1, or 3. This should return 4 results (two 0s, a 1 and a 3). The first iteration of the loops works as expected, with Results still being empty. The second iteration also works as expected, with Results containing two items. After the third iteration, however, Results contains only one item.
Have I just misunderstood how .Union works, or is there something else going on here?
Because of deferred execution, by the time you eventually consume Results, it is the union of many Where queries all of which are based on the last value of selected.
So you have
Results = Potential.Where(selected)
.Union(Potential.Where(selected))
.Union(potential.Where(selected))...
and all the selected values are the same.
You need to create a var currentSelected = selected inside your loop and pass that to the query. That way each value of selected will be captured individually and you won't have this problem.
You can do this much more simply:
Reuslts = SelectedValues.SelectMany(s => Potential.Where(x => x.Value == s));
(this may return duplicates)
Or
Results = Potential.Where(x => SelectedValues.Contains(x.Value));
As pointed out by others, your LINQ expression is a closure. This means your variable selected is captured by the LINQ expression in each iteration of your foreach-loop. The same variable is used in each iteration of the foreach, so it will end up having whatever the last value was. To get around this, you will need to declare a local variable within the foreach-loop, like so:
//For each searchable value
foreach(var selected in SelectedValues1)
{
var localSelected = selected;
Results = Results.Union(Potential.Where(x => x.Value1 == localSelected));
}
It is much shorter to just use .Contains():
Results = Results.Union(Potential.Where(x => SelectedValues1.Contains(x.Value1)));
Since you need to query multiple SelectedValues collections, you could put them all inside their own collection and iterate over that as well, although you'd need some way of matching the correct field/property on your objects.
You could possibly do this by storing your lists of selected values in a Dictionary with the name of the field/property as the key. You would use Reflection to look up the correct field and perform your check. You could then shorten the code to the following:
// Store each of your searchable lists here
Dictionary<string, IEnumerable<MyClass>> DictionaryOfSelectedValues = ...;
Type t = typeof(MyType);
// For each list of searchable values
foreach(var selectedValues in DictionaryOfSelectedValues) // Returns KeyValuePair<TKey, TValue>
{
// Try to get a property for this key
PropertyInfo prop = t.GetProperty(selectedValues.Key);
IEnumerable<MyClass> localSelected = selectedValues.Value;
if( prop != null )
{
Results = Results.Union(Potential.Where(x =>
localSelected.Contains(prop.GetValue(x, null))));
}
else // If it's not a property, check if the entry is for a field
{
FieldInfo field = t.GetField(selectedValues.Key);
if( field != null )
{
Results = Results.Union(Potential.Where(x =>
localSelected.Contains(field.GetValue(x, null))));
}
}
}
No, your use of union is absoloutely correct.
The only thing to keep in mind is it excludes duplicates as based on the equality operator. Do you have sample data?
Okay, I think you are are haveing a problem because Union uses deferred execution.
What happens if you do,
var unionResults = Results.Union(matched).ToList();
Results = unionResults;

Remove KeyValue from ObservableCollection most effective way?

I have following code to remove group from collection. Technically, there should be no duplicates, but it does remove all anyway. Any trick with LINQ to .Remove.Where.. ?
public void DeleteGroup(KeyValuePair<int, string> group)
{
while (this.Groups.Any(g => g.Key.Equals(group.Key)))
{
var groupToRemove = this.Groups.First(g => g.Key.Equals(group.Key));
this.Groups.Remove(groupToRemove);
}
}
Assuming you are passing in a KeyValuePair with the same Key and the same Value this is the most efficient way possible with an ObseravableCollection.
public void DeleteGroup2(KeyValuePair<int, string> group)
{
Groups.Remove(group);
}
This works because a KeyValuePair is a structure and when the overloaded operator == is applied it is comparing both the Key and the Value data members of the structure.
Again this will work just fine if you pass in the exact same Key and Value that is contained in the Groups obserabableCollection...if the Value does not match it will not work.
Behind the scenes an ObserableCollection is pretty much a list so it will have to iterate over every item performing the == operator. The same is true for the code you are posting. Just because it is using LINQ doesn't mean it's any more efficient. It's not like the LINQ where clause is using any indexing like it would be with LINQ to SQL.
public void DeleteGroup3(KeyValuePair<int, string> groupToDelete)
{
var itemsToDelete =
(
from g in Groups
where g.Key == groupToDelete.Key
select g
);
foreach (var kv in itemsToDelete)
{
Groups.Remove(kv);
}
}
This would probably be most efficient method using linq if you want to guarantee that you remove all items even those with duplicate keys.
public void DeleteGroup4(KeyValuePair<int, string> group)
{
List<int> keyIndexes = new List<int>();
int maxIndex = Groups.Count;
for (int i = 0; i < maxIndex; i++)
{
if (Groups[i].Key == group.Key)
{
keyIndexes.Add(i);
}
}
int indexOffset = 0;
foreach (int index in keyIndexes)
{
Groups.RemoveAt(index - indexOffset);
indexOffset++;
}
}
This should have the best performance of all of them if you have multiple items with the same key or you don't know the exact same Key Value pair as the original.
I believe your DeleteGroup method is BIG O of 2N^2...N for the outer Any while Loop and N for the First and N for the Remove. Take outer Loop times the sum of the inside and you get 2N^2
DeleteGroup2 is BIG O of N and had the best performance of all of them. The drawback is that you need to know both the Key and the Value not just the Key. It will also only remove the first item it finds. It won't delete duplicate items with the same Key and the same Value.
DeleteGroup3 IS BIG O of N + N^2. N for the select. Worse case is that your key is in there N times so N^2 for the removal.
DeleteGroup4 is BIG O of 2N. N to find the indexes and in worst case if you have all items with the same key then its N to remove each of them as RemoveAtIndex is a Big O of 1. This has the best performance if you only know the Key and you have the possibility of having multiple items with the same Key.
If you know for a fact that you won't have duplicate items I would use DeleteGroup2. If you have the possibility of having duplicates DeleteGroup4 should have the best performance.
On a side note if won't have duplicates and you don't necessarily know both the Key and the Value you can still use the best performing option of DeleteGroup2 but create a class called KeyValueIntString with properties of Key and Value. Then overide the IsEquals method so that it only compares the Key property unlike the KeyValue struct that compares both the Key and the Value data members. Then you can use the ObserableCollection.Remove method and not have to worry about knowing the value that is stored. I.E. you could pass in instance of a KeyValueIntString that has the Key set but you don't have to worry about setting the Value property.
After commenting I decided to Add best readability method although it does have worse performance. Has a Big O of N^4. N for select, N for ToList, N for ForEach and N for Remove.
public void DeleteGroup5(KeyValuePair<int, string> groupToDelete)
{
(
from g in Groups
where g.Key == groupToDelete.Key
select g
).ToList().ForEach(g => Groups.Remove(g));
}

List of classes question c#

I have a class contain many variables, something like that
class test
{
internal int x , y ;
internal string z;
}
I created a list of this class list<test> c
I want to do the following:
test if all the list items contain the same x
get the list's item that has z = "try"
I need a quick and fast way , instead of iterate though the entire items
Any suggestion please ,
LINQ to Objects is your friend. For the first:
bool allSameX = list.All(t => t.x == list[0].x);
Test firstTry = list.First(t => t.z == "try");
Test firstTryOrNull = list.FirstOrDefault(t => t.z == "try");
The first one depends on there being at least one value of course. Alternatives might be:
bool allSameX = !list.Select(t => t.x)
.Distinct()
.Skip(1)
.Any();
In other words, once you've gone past the first distinct value of x, there shouldn't be any more. One nice aspect of this is that as soon as it spots the second distinct value, it will stop looking - as does the first line (the All version) of course.
LINQ is wonderfully flexible, and well worth looking into closely.
EDIT: If you need to do the latter test ("find an element with a particular value for z") for multiple different values, you might want a dictionary or a lookup, e.g.
// If there are duplicate z values
var lookup = list.ToLookup(t => t.z);
// If z values are distinct
var dictionary = list.ToDictionary(t => t.z);
Without some pre-work, there's no way of performing the queries you want without iterating over at least some of the list.
You can use linq. Here is a link to small examples that will help you a lot for future too http://msdn.microsoft.com/en-us/vcsharp/aa336746
You could implement a custom collection class instead of a list, and put the search smarts into this e.g.
add a method AllItemsHaveSameX() and a private bool field allItemsHaveSameX
expose a dictionary keyed by the search strings with the index of the item that has that value.
When adding/removing items:
You would re-evaluate allItemsHaveSameX
Add/remove from your private dictionary.

Categories