Applying "where" clause (LINQ) on even rows only [C#] - c#

I have a string having two different types of data in alternating rows (i.e. two rows make one record). I want to select only those records where length of 2nd (i.e. even row) is less than 1000.
I have tried this but it results in selecting only the eventh row and discards the odd row:
var lessthan1000Length = recordsFile.Where((src, index) => src.Length<1000 && index%2 != 0);
Sample data from recordsFile
2012-12-04 | 10:45 AM | Lahore
Added H2SO4 in the solution. Kept it in the lab temperature for 10 minutes
2012-12-04 | 10:55 AM | Lahore
Observed the pH of the solution.
2012-12-04 | 11:20 AM | Lahore
Neutralized the solution to maintain the pH in 6-8 range
Thanks for your guidance.
P.S: Kindly note that the results are required in the form of List<string> as we have to make a new dataset from it.

var odds = recordsFile.Where((str, index) => index % 2 == 0);
var evens = recordsFile.Where((str, index) => index % 2 == 1);
var records = odds.Zip(evens, (odd, even) => new { odd, even })
.Where(pair => pair.even.Length < 1000);
foreach (var record in records)
Console.WriteLine(record);

List<string> result = recordFile
.Select( (str, index) => new {str, index})
.GroupBy(x => x.index / 2, x => x.str)
.Where(g => g.Last().Length < 1000)
.Select(g => g.First() + g.Last())
.ToList();

Alexander's answer seems to work fine.
Alternatively, you can create a method to turn a sequence (with an even number of terms) into a sequence of pairs. I guess something like:
static IEnumerable<Tuple<T, T>> PairUp<T>(this IEnumerable<T> src)
{
using (var e = src.GetEnumerator)
{
while (e.MoveNext())
{
var first = e.Current;
if (!e.MoveNext())
throw new InvalidOperationException("Count of source must be even"); // OR: yield break; OR yield return Tuple.Create(first, default(T)); yield break;
var second = e.Current;
yield return Tuple.Create(first, second);
}
}
}
With that you could do recordsFile.PairUp().Where(t => t.Item2.Length < 1000) or similar.
Edit: Since you want the two "parts" concatenated as strings, that would be recordsFile.PairUp().Where(t => t.Item2.Length < 1000).Select(t => t.Item1 + t.Item2).

If you use Microsoft's Reactive Framework team's "Interactive Extensions" you get a nice extension method that can help you.
var query =
from pair in lines.Buffer(2)
where pair[1].Length < 1000
select pair;
var results = query.ToList();
From your sample data I get this:
Just NuGet "Ix-Main" to get the extension methods - there are a lot more there than just .Buffer and many of them are super useful.

Related

How to detect lines that are unique in large file using Reactive Extensions

I have to process large CSV files (up to tens of GB), that looks like this:
Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true
I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.
Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.
In this case it is record with key 3.
However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.
I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.
Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?
Providing that Key is an integer we can try using a Dictionary and one scan:
// value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();
var query = File
.ReadLines(#"c:\MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));
foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);
if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}
// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);
I think this will do what you need:
var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());
Using Rx doesn't help here.
Yes, the Rx library is well suited for this kind of synchronous enumerate-once/calculate-many operation. You can use a Subject<Record> as the one-to-many propagator, then you should attach various Rx operators to it, then you should feed it with the records from the source enumerable, and finally you'll collect the results from the attached operators that will now be completed. Here is the basic pattern:
IEnumerable<Record> source = GetRecords();
var subject = new Subject<Record>();
var task1 = SomeRxTransformation1(subject);
var task2 = SomeRxTransformation2(subject);
var task3 = SomeRxTransformation3(subject);
source.ToObservable().Subscribe(subject); // This line does all the work
var result1 = task1.Result;
var result2 = task2.Result;
var result3 = task3.Result;
The SomeRxTransformation1, SomeRxTransformation2 etc are methods that accept an IObservable<Record>, and return some generic Task. Their signature should look like this:
Task<TResult> SomeRxTransformation1(IObservable<Record> source);
For example the special grouping you want to do will require a transformation like the following:
Task<Record[][]> GroupByKeyExcludingSomeGroups(IObservable<Record> source)
{
return source
.GroupBy(record => record.Key)
.Select(grouped => grouped.ToArray())
.Merge()
.Where(array => array.All(r => !r.CompletedA && !r.CompletedB))
.ToArray()
.ToTask();
}
When you incorporate it into the pattern, it will look like this:
Task<Record[][]> task1 = GroupByKeyExcludingSomeGroups(subject);
source.ToObservable().Subscribe(subject); // This line does all the work
Record[][] result1 = task1.Result;

Partitioning observables in C#

I'm looking for some way of splitting an observable sequence into separate sequences that I can process independently based on a given predicate. Something like this would be ideal:
var (evens, odds) = observable.Partition(x => x % 2 == 0);
var strings = evens.Select(x => x.ToString());
var floats = odds.Select(x => x / 2.0);
The closest I've been able to come up with is doing two where filters, but that requires evaluating the condition and processing the source sequence twice, which I'm not wild about.
observable = observable.Publish().RefCount();
var strings = observable.Where(x => x % 2 == 0).Select(x => x.ToString());
var floats = observable.Where(x => x % 2 != 0).Select(x => x / 2.0);
F# seems to have good support for this with Observable.partition<'T> and Observable.split<'T,'U1,'U2>, but I've not been able to find anything equivalent for C#.
A GroupBy may remove the "observe twice" restriction, though you'll still end up with Where clauses:
public static class X
{
public static (IObservable<T> trues, IObservable<T> falsies) Partition<T>(this IObservable<T> source, Func<T, bool> partitioner)
{
var x = source.GroupBy(partitioner).Publish().RefCount();
var trues = x.Where(g => g.Key == true).Merge();
var falsies = x.Where(g => g.Key == false).Merge();
return (trues, falsies);
}
}
How about something like
var (odds,evens) = (collection.Where(a=> a % 2 == 1), collection.Where(a=> a % 2 == 0));?
or if you want to partition based on one condition
Func<int,bool> predicate = a => a%2==0;
var (odds,evens) = (collection.Where(a=> !predicate(a)), collection.Where(a=> predicate(a)));
I think there is no working around the fact that you iterate the items twice this way, what else could be done would be to have a method that accepts a predicate and pass in the 2 sepatate collections and populate them in one iteration in a foreach or for.
Something like this:
var collection = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9};
Func<int,bool> predicate = a => a%2==0;
var odds = new List<int>();
var evens = new List<int>();
Action<List<int>, List<int>, Func<int, bool>> partition = (collection1, collection2, pred) =>
{
foreach (int element in collection)
{
if (pred(element))
{
collection1.Add(element);
}
else
{
collection2.Add(element);
}
}
};
partition(evens, odds, predicate);
Expanding on the last idea, are you looking for something like this?
public static (ObservableCollection<T>, ObservableCollection<T>) Partition<T>(this ObservableCollection<T> collection, Func<T, bool> predicate)
{
var collection1 = new ObservableCollection<T>();
var collection2 = new ObservableCollection<T>();
foreach (T element in collection)
{
if (predicate(element))
{
collection1.Add(element);
}
else
{
collection2.Add(element);
}
}
return (collection1, collection2);
}
Warming the source sequence by using the RefCount operator is not a good idea, because the source sequence may start emitting elements before all subscriptions to derived sequences are in place. In that case some of the emitted elements could be lost. A safer approach is to postpone warming the source sequence until all observers have been subscribed. Here is an example of how to do it:
var published = observable.Publish(); // Make sure not to warm it too early
var strings = published.Where(x => x % 2 == 0).Select(x => x.ToString());
var floats = published.Where(x => x % 2 != 0).Select(x => x / 2.0);
strings.Subscribe(x => Console.WriteLine(x));
floats.Subscribe(x => Console.WriteLine(x));
published.Connect(); // Now that all subscriptions are in place, it's time to warm it
await published; // Wait for the completion of the source sequence
You could make the above code a bit less repetitive by using the LookupObservable<TSource, TKey> class, that is included in an answer of a relevant question. This class was implemented because creating multiple Where subsequences can be quite inefficient, in case the total number of subsequences is large (because each element emitted by the source will be checked for numerous conditions). In your case you have only two subsequences, one for the key true and one for the key false, so using the LookupObservable class is less compelling. In any case, here is a usage example:
var published = observable.Publish(); // Make sure not to warm it too early
var lookup = new LookupObservable<int, bool>(published, x => x % 2 == 0);
var strings = lookup[true].Select(x => x.ToString());
var floats = lookup[false].Select(x => x / 2.0);
strings.Subscribe(x => Console.WriteLine(x));
floats.Subscribe(x => Console.WriteLine(x));
published.Connect(); // Now that all subscriptions are in place, it's time to warm it
await published; // Wait for the completion of the source sequence

How to calculate a running total using linq

I have a linq query result as shown in the image. In the final query (not shown) I am grouping by Year by LeaveType. However I want to calculate a running total for the leaveCarriedOver per type over years. That is, sick LeaveCarriedOver in 2010 becomes "opening" balance for sick leave in 2011 plus the one for 2011.
I have done another query on the shown result list that looks like:
var leaveDetails1 = (from l in leaveDetails
select new
{
l.Year,
l.LeaveType,
l.LeaveTaken,
l.LeaveAllocation,
l.LeaveCarriedOver,
RunningTotal = leaveDetails.Where(x => x.LeaveType == l.LeaveType).Sum(x => x.LeaveCarriedOver)
});
where leaveDetails is the result from the image.
The resulting RunningTotal is not cumulative as expected. How can I achieve my initial goal. Open to any ideas - my last option will be to do it in javascript in the front-end. Thanks in advance
The simple implementation is to get the list of possible totals first then get the sum from the details for each of these categories.
getting the distinct list of Year and LeaveType is a group by and select first of each group. we return a List<Tuple<int, string>> where Int is the year and string is the LeaveType
var distinctList = leaveDetails1.GroupBy(data => new Tuple<int, string>(data.Year, data.LeaveType)).Select(data => data.FirstOrDefault()).ToList();
then we want total for each of these elements so you want a select of that list to return the id (Year and LeaveType) plus the total so an extra value to the Tuple<int, string, int>.
var totals = distinctList.Select(data => new Tuple<int, string, int>(data.Year, data.LeaveType, leaveDetails1.Where(detail => detail.Year == data.Year && detail.LeaveType == data.LeaveType).Sum(detail => detail.LeaveCarriedOver))).ToList();
reading the line above you can see it take the distinct totals we want to list, create a new object, store the Year and LeaveType for reference then set the last Int with the Sum<> of the filtered details for that Year and LeaveType.
If I completely understand what you are trying to do then I don't think I would rely on the built in LINQ operators exclusively. I think (emphasis on think) that any combination of the built in LINQ operators is going to solve this problem in O(n^2) run-time.
If I were going to implement this in LINQ then I would create an extension method for IEnumerable that is similar to the Scan function in reactive extensions (or find a library out there that has already implemented it):
public static class EnumerableExtensions
{
public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> accumulator)
{
// Validation omitted for clarity.
foreach(TSource value in source)
{
seed = accumulator.Invoke(seed, value);
yield return seed;
}
}
}
Then this should do it around O(n log n) (because of the order by operations):
leaveDetails
.OrderBy(x => x.LeaveType)
.ThenBy(x => x.Year)
.Scan(new {
Year = 0,
LeaveType = "Seed",
LeaveTaken = 0,
LeaveAllocation = 0.0,
LeaveCarriedOver = 0.0,
RunningTotal = 0.0
},
(acc, x) => new {
x.Year,
x.LeaveType,
x.LeaveTaken,
x.LeaveAllocation,
x.LeaveCarriedOver,
RunningTotal = x.LeaveCarriedOver + (acc.LeaveType != x.LeaveType ? 0 : acc.RunningTotal)
});
You don't say, but I assume the data is coming from a database; if that is the case then you could get leaveDetails back already sorted and skip the sorting here. That would get you down to O(n).
If you don't want to create an extension method (or go find one) then this will achieve the same thing (just in an uglier way).
var temp = new
{
Year = 0,
LeaveType = "Who Cares",
LeaveTaken = 3,
LeaveAllocation = 0.0,
LeaveCarriedOver = 0.0,
RunningTotal = 0.0
};
var runningTotals = (new[] { temp }).ToList();
runningTotals.RemoveAt(0);
foreach(var l in leaveDetails.OrderBy(x => x.LeaveType).ThenBy(x => x.Year))
{
var s = runningTotals.LastOrDefault();
runningTotals.Add(new
{
l.Year,
l.LeaveType,
l.LeaveTaken,
l.LeaveAllocation,
l.LeaveCarriedOver,
RunningTotal = l.LeaveCarriedOver + (s == null || s.LeaveType != l.LeaveType ? 0 : s.RunningTotal)
});
}
This should also be O(n log n) or O(n) if you can pre-sort leaveDetails.
If I understand the question you want something like
decimal RunningTotal = 0;
var results = leaveDetails
.GroupBy(r=>r.LeaveType)
.Select(r=> new
{
Dummy = RunningTotal = 0 ,
results = r.OrderBy(o=>o.Year)
.Select(l => new
{
l.Year,
l.LeaveType ,
l.LeaveAllocation,
l.LeaveCarriedOver,
RunningTotal = (RunningTotal = RunningTotal + l.LeaveCarriedOver )
})
})
.SelectMany(a=>a.results).ToList();
This is basically using the Select<TSource, TResult> overload to calculate the running balance, but first grouped by LeaveType so we can reset the RunningTotal for every LeaveType, and then ungrouped at the end.
You have to use Window Function Sum here. Which is not supported by EF Core and earlier versions of EF. So, just write SQL and run it via Dapper
SELECT
l.Year,
l.LeaveType,
l.LeaveTaken,
l.LeaveAllocation,
l.LeaveCarriedOver,
SUM(l.LeaveCarriedOver) OVER (PARTITION BY l.Year, l.LeaveType) AS RunningTotal
FROM leaveDetails l
Or, if you are using EF Core, use package linq2db.EntityFrameworkCore
var leaveDetails1 = from l in leaveDetails
select new
{
l.Year,
l.LeaveType,
l.LeaveTaken,
l.LeaveAllocation,
l.LeaveCarriedOver,
RunningTotal = Sql.Ext.Sum(l.LeaveCarriedOver).Over().PartitionBy(l.Year, l.LeaveType).ToValue()
};
// switch to alternative LINQ translator
leaveDetails1 = leaveDetails1.ToLinqToDB();

How do I use LINQ to find 5 elements in a row that match one predicate, but where the sixth element doesn't?

I'm trying to learn LINQ and it seems that finding a series of 'n' elements that match a predicate should be possible but I can't seem to figure out how to approach the problem.
My solution actually needs a second, different predicate to test the 'end' of the sequence but finding the first element that doesn't past a test, after a sequence of at least 5 elements that do pass the test would also be interesting.
Here is my naive non-LINQ approach....
int numPassed = 0;
for (int i = 0; i < array.Count - 1; i++ )
{
if (FirstTest(array[i]))
{
numPassed++;
}
else
{
numPassed = 0;
}
if ((numPassed > 5) && SecondTest(array[i + 1]))
{
foundindex = i;
break;
}
}
A performant LINQ solution is possible but frankly quite ugly. The idea is to isolate subsequences that match the description (a series of N items matching a predicate that ends when an item is found that matches a second predicate) and then select the first of these that has a minimum length.
Let's say that the parameters are:
var data = new[] { 0, 1, 1, 1, 0, 0, 2, 2, 2, 2, 2 };
Func<int, bool> acceptPredicate = i => i != 0;
// The reverse of acceptPredicate, but could be otherwise
Func<int, bool> rejectPredicate = i => i == 0;
Isolating subsequences is possible with GroupBy and a bunch of ugly stateful code (here's the inherent awkwardness -- you have to keep non-trivial state). The idea is to group by an artificial and arbitrary "group number", choosing a different number whenever we move from a subsequence that might be acceptable to one that definitely is not acceptable and when the reverse happens as well:
var acceptMode = false;
var groupCount = 0;
var groups = data.GroupBy(i => {
if (acceptMode && rejectPredicate(i)) {
acceptMode = false;
++groupCount;
}
else if (!acceptMode && acceptPredicate(i)) {
acceptMode = true;
++groupCount;
}
return groupCount;
});
The last step (finding the first group of acceptable length) is easy, but there is one last pitfall: making sure that you don't select one of the groups that do not satisfy the stated condition:
var result = groups.Where(g => !rejectPredicate(g.First()))
.FirstOrDefault(g => g.Count() >= 5);
All of the above is achieved with a single pass over the source sequence.
Note that this code will accept a sequence of items that also ends the source sequence (i.e. it does not terminate because we found an item that satisfies rejectPredicate but because we ran out of data). If you don't want this a slight modification will be required.
See it in action.
Not elegant, but this will work:
var indexList = array
.Select((x, i) => new
{ Item = x, Index = i })
.Where(item =>
item.Index + 5 < array.Length &&
FirstTest(array[item.Index]) &&
FirstTest(array[item.Index+1]) &&
FirstTest(array[item.Index+2]) &&
FirstTest(array[item.Index+3]) &&
FirstTest(array[item.Index+4]) &&
SecondTest(array[item.Index+5]))
.Select(item => item.Index);
Instead of trying to combine existing extension methods, it is much more cleaner to use an Enumerator.
Example:
IEnumerable<T> MatchThis<T>(IEnumerable<T> source,
Func<T, bool> first_predicate,
Int32 times_match,
Func<T, bool> second_predicate)
{
var found = new List<T>();
using (var en = source.GetEnumerator())
{
while(en.MoveNext() && found.Count < times_match)
if (first_predicate((T)en.Current))
found.Add((T)en.Current);
else
found.Clear();
if (found.Count < times_match && !en.MoveNext() || !second_predicate((T)en.Current))
return Enumerable.Empty<T>();
found.Add((T)en.Current);
return found;
}
}
Usage:
var valid_seq = new Int32[] {800, 3423, 423423, 1, 2, 3, 4, 5, 200, 433, 32};
var result = MatchThis(valid_seq, e => e<100, 5, e => e>100);
Result:
var result = array.GetSixth(FirstTest).FirstOrDefault(SecondTest);
internal static class MyExtensions
{
internal static IEnumerable<T> GetSixth<T>(this IEnumerable<T> source, Func<T, bool> predicate)
{
var counter=0;
foreach (var item in source)
{
if (counter==5) yield return item;
counter = predicate(item) ? counter + 1 : 0;
}
}
}
It looks like you want continuous 6 elements, the first 5 of which match predicate1, and the last (the 6th) matches predicate2. Your non-linq version works fine, using linq in this case is a little reluctant. And trying to resolve the problem in one linq query makes the issue harder, here is a (maybe) cleaner linq solution:
int continuous = 5;
var temp = array.Select(n => FirstTest(n) ? 1 : 0);
var result = array.Where((n, index) =>
index >= continuous
&& SecondTest(n)
&& temp.Skip(index - continuous).Take(continuous).Sum() == continuous)
.FirstOrDefault();
Things will be easier if you have Morelinq.Batch method.
Like others have mentioned, LINQ is not the ideal solution to this kind of pattern matching need. But still, it is possible, and it doesn't have to be ugly:
Func<int, bool> isBody = n => n == 8;
Func<int, bool> isEnd = n => n == 2;
var requiredBodyLength = 5;
// Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
int[] sequence = { 6, 8, 8, 9, 2, 1, 8, 8, 8, 8, 8, 8, 8, 2, 5 };
// ^-----------^ ^
// Body End
// First we stick an index on each element, since that's the desired output.
var indexedSequence = sequence.Select((n, i) => new { Index = i, Number = n }).ToArray();
// Scroll to the right to see comments
var patternMatchIndexes = indexedSequence
.Select(x => indexedSequence.Skip(x.Index).TakeWhile(x2 => isBody(x2.Number))) // Skip all the elements we already processed and try to match the body
.Where(body => body.Count() == requiredBodyLength) // Filter out any body sequences of incorrect length
.Select(body => new { BodyIndex = body.First().Index, EndIndex = body.Last().Index + 1 }) // Prepare the index of the first body element and the index of the end element
.Where(x => x.EndIndex < sequence.Length && isEnd(sequence[x.EndIndex])) // Make sure there is at least one element after the body and that it's an end element
.Select(x => x.BodyIndex) // There may be more than one matching pattern, get all their indices
.ToArray();
//patternMatchIndexes.Dump(); // Uncomment in LINQPad to see results
Note that this implementation is not performant at all, it is only meant as a teaching aid to show how something can be done in LINQ despite the unsuitability of solving it that way.

How to optimise this LINQ Query

I have this query
Dasha.Where(x => x[15] == 9).ForEachWithIndex((x,i) => dd[Sex[i]][(int)x[16]]++);
This query is finding that element in Dasha whose 15th index value is 9 and if yes it increments dd[Dashaindex][x[16]] value.
Here Dasha is double[100][50] and dd is double[2][10] and Sex is byte[ ] and can only have value 0 or 1. 0 for Male and 1 for Female
x[15] can only be between 0 to 9 (both inclusive). Same rule for x[16].
It is giving me right results.
I tried optimising this to
Dasha.ForEachWithIndex((x,i) =>
{
if(x[15] == 9)
dd[Sex[i]][(int)x[16]]++
});
This is giving me wrong results. Where am i doing wrong?
My ForEachWithIndex is like
static void ForEachWithIndex<T>(this IEnumerable<T> enu, Action<T, int> action)
{
int i = 0;
foreach(T item in enu)
action(item, i++);
}
This is just a partial answer (too long for a comment) in regards to
Dasha.ForEachWithIndex((x,i) => {
if(x[15] == 9)
dd[Sex[i]][(int)x[16]]++ });
This is giving me wrong results. Where am i doing wrong?
In the first case you filter the Dasha list of 100 items down to n items, then you iterate over these n items.
in the second case you iterate over all 100 items. So the index will be different, and the value you get from Sex[i] for each row will be different
e.g.
Dasha[0] != Dasha.Where(x => x[15] == 9)[0]
unless Dasha[0][15] == 9
You need to save original indexes before Where:
Dasha.Select((x,i) => new {x = x, i = i})
.Where(a => a.x[15] == 9)
.ForEach(a => dd[Sex[a.i]][(int)a.x[16]]++);
Following will give you same result as of first query.
int counter=0;
Dasha.ForEachWithIndex((x,i) =>
{
if(x[15] == 9)
{
dd[Sex[counter]][(int)x[16]]++;
counter++;
}
})

Categories