Slow LINQ missing from list query

Slow LINQ missing from list query - c#

I have a pretty standard block of code that compares two lists and returns the items from the first list that are not matched in the second list. (unmatched query)
But when both the lists are large-ish, the unmatched query takes 3 minutes to execute.
// names.Count ~ 91k
var names = xxxxx.ToList();
// namePhonetics.Count ~ 91k
var namePhonetics = yyyyy.ToList();
// this line takes 3 minutes to run
var namesMissingPhonetics = names.Where(n => !namePhonetics.Any(np => np.NameId == n.Id)).ToList();
What can I do to increase the performance of this?

Try this:
var namePhoneticsDict = yyyyy.ToDictionary(x => x.NameId, x => x);
var namesMissingPhonetics = names.Where(n => !namePhoneticsDict.ContainsKey(n.Id)).ToList();

Related

How to detect lines that are unique in large file using Reactive Extensions

I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?

Providing that Key is an integer we can try using a Dictionary and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(#"c:\MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);

I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.

Yes, the Rx library is well suited for this kind of synchronous enumerate-once/calculate-many operation. You can use a Subject<Record> as the one-to-many propagator, then you should attach various Rx operators to it, then you should feed it with the records from the source enumerable, and finally you'll collect the results from the attached operators that will now be completed. Here is the basic pattern:
IEnumerable<Record> source = GetRecords();
var subject = new Subject<Record>();
var task1 = SomeRxTransformation1(subject);
var task2 = SomeRxTransformation2(subject);
var task3 = SomeRxTransformation3(subject);
source.ToObservable().Subscribe(subject); // This line does all the work
var result1 = task1.Result;
var result2 = task2.Result;
var result3 = task3.Result;
The SomeRxTransformation1, SomeRxTransformation2 etc are methods that accept an IObservable<Record>, and return some generic Task. Their signature should look like this:
Task<TResult> SomeRxTransformation1(IObservable<Record> source);
For example the special grouping you want to do will require a transformation like the following:
Task<Record[][]> GroupByKeyExcludingSomeGroups(IObservable<Record> source)
{
return source
.GroupBy(record => record.Key)
.Select(grouped => grouped.ToArray())
.Merge()
.Where(array => array.All(r => !r.CompletedA && !r.CompletedB))
.ToArray()
.ToTask();
}
When you incorporate it into the pattern, it will look like this:
Task<Record[][]> task1 = GroupByKeyExcludingSomeGroups(subject);
source.ToObservable().Subscribe(subject); // This line does all the work
Record[][] result1 = task1.Result;

c# calculate distinct value in datatable

what i have in my datatable
Resource
1 // 1 represent normal
1
2 // 2 represent sql
2
3 // 3 css
4 // 4 unicode
4
4
how can i perform calculation so that i could display the value in a textbox
normal 2
sql 2
css 1
unicode 3
total hits 9
what ive tried so far
var result = my_datatable.AsEnumerable().Sum(x => Convert.ToInt32(x["Resource"]));
string result2 = result.ToString();
totalTxtBox.Text = result2;
but it calculate the whole column (output is: 24 instead of 9)

Use the following example-
int[] res = { 1,1,2,2,3,4,4,4};
var words = res.AsEnumerable().GroupBy(x => x);
foreach (var x in words)
{
Console.WriteLine(x.Key+"-"+x.Count());
}
It will print output as-
1-2
2-2
3-1
4-3

You can try to use Distinct()
var result = my_datatable.AsEnumerable().Distinct().Sum(x => Convert.ToInt32(x["Resource"]));
Distinct() Returns distinct elements from a sequence by using the default equality comparer to compare values.
for more info: https://msdn.microsoft.com/en-us/library/bb348436(v=vs.110).aspx
You can try to use this example as your reference. Firstly, I get the Distinct values from the datatable and converted it to list and then use the sum function.
Updated Answer:
DataTable my_datatable = new DataTable();
my_datatable.Columns.Add("Value", typeof(int));
my_datatable.Columns.Add("Type", typeof(string));
my_datatable.Rows.Add(1, "Normal");
my_datatable.Rows.Add(1, "Normal");
my_datatable.Rows.Add(2, "SQL");
my_datatable.Rows.Add(2, "SQL");
my_datatable.Rows.Add(3, "CSS");
my_datatable.Rows.Add(4, "UNICODE");
my_datatable.Rows.Add(4, "UNICODE");
my_datatable.Rows.Add(4, "UNICODE");
var distinctIds = my_datatable.AsEnumerable()
.Select(s => new {
value = s.Field<int>("value"),
})
.Distinct().ToList();
int total = distinctIds.Sum(item => item.value);

I figured it out by myself
use this line (linq) in order to filter out which value you want
int normalcount = my_datatable
.AsEnumerable()
.Where(r => r.Field<string>("Resource") == "1")
.Count();
try to change this line to filter out which value, according to your column value
.Where(r => r.Field<string>("Resource") == "2")
.Count();

Checking If Intervals Are of Equal Length in a List

I have a List<foo> that has two properties:
start_time and end_time
Assume that we have 100 records in that list. How can I check if all intervals are of equal length? In other terms, I'd like to know if the values of the difference between end and start times for all foo objects are equal.
Where (value = end_time-start_time).
Is it possible to achieve this in a single LINQ line?
Thanks, appreciate it.

Sure, you can write something like this:
var list = new List<foo>();
var areAllEqual = list.GroupBy(l => (l.end_time - l.start_time)).Count() == 1;
Alternatively, if you want to do more with that information:
var differences = list.GroupBy(l => (l.end_time - l.start_time)).ToList();
var numDifferences = differences.Count();
var areAllEqual = numDifferences == 1;
var firstDifference = differences.First().Key;
var allDifferences = differences.Select(g => g.Key);

Something like this should work.
var first = items.First;
var duration = first.end_time - first.start_time;
var allEqual = items.All(i => duration == (i.end_time - i.start_time))

Optimize LINQ to Objects query

I have around 200K records in a list and I'm looping through them and forming another collection. This works fine on my local 64 bit Win 7 but when I move it to a Windows Server 2008 R2, it takes a lot of time. There is difference of about an hour almost!
I tried looking at Compiled Queries and am still figuring it out.
For various reasons, we cant do a database join and retrieve the child values
Here is the code:
//listOfDetails is another collection
List<SomeDetails> myDetails = null;
foreach (CustomerDetails myItem in customerDetails)
{
var myList = from ss in listOfDetails
where ss.CustomerNumber == myItem.CustomerNum
&& ss.ID == myItem.ID
select ss;
myDetails = (List<SomeDetails>)(myList.ToList());
myItem.SomeDetails = myDetails;
}

I would do this differently:
var lookup = listOfDetails.ToLookup(x => new { x.CustomerNumber, x.ID });
foreach(var item in customerDetails)
{
var key = new { CustomerNumber = item.CustomerNum, item.ID };
item.SomeDetails = lookup[key].ToList();
}
The big benefit of this code is that it only has to loop through the listOfDetails once to build the lookup - which is nothing more than a hash map. After that we just get the values using the key, which is very fast as that is what hash maps are built for.

I don't know why you have the difference in performance, but you should be able to make that code perform better.
//listOfDetails is another collection
List<SomeDetails> myDetails = ...;
detailsGrouped = myDetails.ToLookup(x => new { x.CustomerNumber, x.ID });
foreach (CustomerDetails myItem in customerDetails)
{
var myList = detailsGrouped[new { CustomerNumber = myItem.CustomerNum, myItem.ID }];
myItem.SomeDetails = myList.ToList();
}
The idea here is to avoid the repeated looping on myDetails, and build a hash based lookup instead. Once that is built, it is very cheap to do a lookup.

The inner ToList() is forcing an evaluation on each loop, which has got to hurt. The SelectMany might let you avoid the ToList, something like this :
var details = customerDetails.Select( item => listOfDetails
.Where( detail => detail.CustomerNumber == item.CustomerNum)
.Where( detail => detail.ID == item.ID)
.SelectMany( i => i as SomeDetails )
);
If you first get all the SomeDetails and then assign them to the items, it might speed up. Or it might not. You should really profile to see where the time is being taken.

I think you'd probably benefit from a join here, so:
var mods = customerDetails
.Join(
listOfDetails,
x => Tuple.Create(x.ID, x.CustomerNum),
x => Tuple.Create(x.ID, x.CustomerNumber),
(a, b) => new {custDet = a, listDet = b})
.GroupBy(x => x.custDet)
.Select(g => new{custDet = g.Key,items = g.Select(x => x.listDet).ToList()});
foreach(var mod in mods)
{
mod.custDet.SomeDetails = mod.items;
}
I didn't compile this code...
With a join the matching of items from one list against another is done by building a hashtable-like collection (Lookup) of the second list in O(n) time. Then it's a matter of iterating the first list and pulling items from the Lookup. As pulling data from a hashtable is O(1), the iterate/match phase also only takes O(n), as does the subsequent GroupBy. So in all the operation should take ~O(3n) which is equivalent to O(n), where n is the length of the longer list.

why does this linq code get exponentially slower when applying First() to projection?

I am trying to parse a 500K text file.
This is more of a learning exercise - I know there are other ways to get my result.
I may use the linq-go incorrectly, as I'm a bit new to it still.
My PC is fast.
I'm sure i'm making one of the "classic errors" here - so my QUESTION IS: which one is it and can I correct my logic or is this a bad fit for Linq all together?
var lines = File.ReadAllLines(#"C:\Users\aanodide\Desktop\APIUserGuide.txt");
// add line numbers
var qa = lines
.Select((c,i) => new
{
i = i,
c = c
});
var qb = qa.Skip(2312); // defs start at > 2312
var qc = qb.Where( c => Regex.IsMatch(c.c, #"(\w+): ([a-zA-Z])?(.*)") );
var qd = qc.Where( c => c.c.StartsWith("API Name:") );
var qd_desc = qc.Where( c => c.c.StartsWith("Description:") ).Select( d => d.i );
var qe = qd.Select( c => new {
i = c.i,
c = c.c,
d = qd_desc.First(e => e > c.i) // --> IF I COMMENT OUT THIS, IT RUNS FAST, IN A FRACTION OF A SECOND<--
});
// Take(1) -> .013s
// Take(10) -> .070s
// Take(20) -> .446s
// Take(40) -> 1.63s
// Take(80) -> 6.49s
foreach (var element in qe.Take(50))
{
Console.WriteLine (element.i);
}

As Mark noted, the whole query is iterated when you call First(). And First() is called once for every item in qd - meaning the whole file is parsed once for every item in qd.
To fix it, you can ToList() qd_desc, and do the First() on that list. Then it will only be evaluated once.
i.e.
var qd_desc = qc.Where( c => c.c.StartsWith("Description:") ).Select( d => d.i ).ToList();

The whole query is executed (i.e., the sequence is iterated) when you call First(). When you call Skip(),Where(),Select(), or any operator that returns an IEnumerable, the query won't execute until you iterate over the sequence. The reason it happens on First() (or any operator that returns a single item) is that you're demanding a specific item right then and so the query has to be run to produce that result.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Slow LINQ missing from list query - c#

Try this: var namePhoneticsDict = yyyyy.ToDictionary(x => x.NameId, x => x); var namesMissingPhonetics = names.Where(n => !namePhoneticsDict.ContainsKey(n.Id)).ToList();

Related

How to detect lines that are unique in large file using Reactive Extensions

c# calculate distinct value in datatable

Checking If Intervals Are of Equal Length in a List

Optimize LINQ to Objects query

why does this linq code get exponentially slower when applying First() to projection?

Categories

Resources