How to GroupBy objects by numeric values with tolerance factor? - c#

I have a C# list of objects with the following simplified data:
ID, Price
2, 80.0
8, 44.25
14, 43.5
30, 79.98
54, 44.24
74, 80.01
I am trying to GroupBy the lowest number while taking into account a tolerance factor.
for example, in a case of tolerance = 0.02, my expected result should be:
44.24 -> 8, 54
43.5 -> 14
79.98 -> 2, 30, 74
How can i do this while achieving a good performance for large datasets?
Is LINQ the way to go in this case?

It seemed to me that if you have a large data set you'll want to avoid the straightforward solution of sorting the values and then collecting them as you iterate through the sorted list, since sorting a large collection can be expensive. The most efficient solution I could think of which doesn't do any explicit sorting was to build a tree where each node contains the items where the key falls within a "contiguous" range (where all the keys are within tolerance of each other) - the range for each node expands every time an item is added which falls outside the range by less than tolerance. I implemented a solution - which turned out to be more complicated and interesting than I expected - and based on my rough benchmarking it looks like doing it this way takes about half as much time as the straightforward solution.
Here's my implementation as an extension method (so you can chain it, although like the normal Group method it'll iterate the source completely as soon as the result IEnumerable is iterated).
public static IEnumerable<IGrouping<double, TValue>> GroupWithTolerance<TValue>(
this IEnumerable<TValue> source,
double tolerance,
Func<TValue, double> keySelector)
{
if(source == null)
throw new ArgumentNullException("source");
return GroupWithToleranceHelper<TValue>.Group(source, tolerance, keySelector);
}
private static class GroupWithToleranceHelper<TValue>
{
public static IEnumerable<IGrouping<double, TValue>> Group(
IEnumerable<TValue> source,
double tolerance,
Func<TValue, double> keySelector)
{
Node root = null, current = null;
foreach (var item in source)
{
var key = keySelector(item);
if(root == null) root = new Node(key);
current = root;
while(true){
if(key < current.Min - tolerance) { current = (current.Left ?? (current.Left = new Node(key))); }
else if(key > current.Max + tolerance) {current = (current.Right ?? (current.Right = new Node(key)));}
else
{
current.Values.Add(item);
if(current.Max < key){
current.Max = key;
current.Redistribute(tolerance);
}
if(current.Min > key) {
current.Min = key;
current.Redistribute(tolerance);
}
break;
}
}
}
if (root != null)
{
foreach (var entry in InOrder(root))
{
yield return entry;
}
}
else
{
//Return an empty collection
yield break;
}
}
private static IEnumerable<IGrouping<double, TValue>> InOrder(Node node)
{
if(node.Left != null)
foreach (var element in InOrder(node.Left))
yield return element;
yield return node;
if(node.Right != null)
foreach (var element in InOrder(node.Right))
yield return element;
}
private class Node : IGrouping<double, TValue>
{
public double Min;
public double Max;
public readonly List<TValue> Values = new List<TValue>();
public Node Left;
public Node Right;
public Node(double key) {
Min = key;
Max = key;
}
public double Key { get { return Min; } }
IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
public IEnumerator<TValue> GetEnumerator() { return Values.GetEnumerator(); }
public IEnumerable<TValue> GetLeftValues(){
return Left == null ? Values : Values.Concat(Left.GetLeftValues());
}
public IEnumerable<TValue> GetRightValues(){
return Right == null ? Values : Values.Concat(Right.GetRightValues());
}
public void Redistribute(double tolerance)
{
if(this.Left != null) {
this.Left.Redistribute(tolerance);
if(this.Left.Max + tolerance > this.Min){
this.Values.AddRange(this.Left.GetRightValues());
this.Min = this.Left.Min;
this.Left = this.Left.Left;
}
}
if(this.Right != null) {
this.Right.Redistribute(tolerance);
if(this.Right.Min - tolerance < this.Max){
this.Values.AddRange(this.Right.GetLeftValues());
this.Max = this.Right.Max;
this.Right = this.Right.Right;
}
}
}
}
}
You can switch double to another type if you need to (I so wish C# had a numeric generic constraint).

The most straight-forward approach is to design your own IEqualityComparer<double>.
public class ToleranceEqualityComparer : IEqualityComparer<double>
{
public double Tolerance { get; set; } = 0.02;
public bool Equals(double x, double y)
{
return x - Tolerance <= y && x + Tolerance > y;
}
//This is to force the use of Equals methods.
public int GetHashCode(double obj) => 1;
}
Which you should use like so
var dataByPrice = data.GroupBy(d => d.Price, new ToleranceEqualityComparer());

Here is a new implementation that ultimately passed unit tests that the other two solutions failed. It implements the same signature as the currently accepted answer. The unit tests checked to ensure no groups resulted in a min and max value larger than the tolerance and that the number of items grouped matched the items provided.
How to use
var values = new List<Tuple<double, string>>
{
new Tuple<double, string>(113.5, "Text Item 1"),
new Tuple<double, string>(109.62, "Text Item 2"),
new Tuple<double, string>(159.06, "Text Item 3"),
new Tuple<double, string>(114, "Text Item 4")
};
var groups = values.GroupWithTolerance(5, a => a.Item1).ToList();
Extension Method
/// <summary>
/// Groups items of an IEnumerable collection while allowing a tolerance that all items within the group will fall within
/// </summary>
/// <typeparam name="TValue"></typeparam>
/// <param name="source"></param>
/// <param name="tolerance"></param>
/// <param name="keySelector"></param>
/// <returns></returns>
/// <exception cref="ArgumentNullException"></exception>
public static IEnumerable<IGrouping<double, TValue>> GroupWithTolerance<TValue>(
this IEnumerable<TValue> source,
double tolerance,
Func<TValue, double> keySelector
)
{
var sortedValuesWithKey = source
.Select((a, i) => Tuple.Create(a, keySelector(a), i))
.OrderBy(a => a.Item2)
.ToList();
var diffsByIndex = sortedValuesWithKey
.Skip(1)
//i will start at 0 but we are targeting the diff between 0 and 1.
.Select((a, i) => Tuple.Create(i + 1, sortedValuesWithKey[i + 1].Item2 - sortedValuesWithKey[i].Item2))
.ToList();
var groupBreaks = diffsByIndex
.Where(a => a.Item2 > tolerance)
.Select(a => a.Item1)
.ToHashSet();
var groupKeys = new double[sortedValuesWithKey.Count];
void AddRange(int startIndex, int endIndex)
{
//If there is just one value in the group, take a short cut.
if (endIndex - startIndex == 0)
{
groupKeys[sortedValuesWithKey[startIndex].Item3] = sortedValuesWithKey[startIndex].Item2;
return;
}
var min = sortedValuesWithKey[startIndex].Item2;
var max = sortedValuesWithKey[endIndex].Item2;
//If the range is within tolerance, we are done with this group.
if (max - min < tolerance)
{
//Get the average value of the group and assign it to all elements.
var rangeValues = new List<double>(endIndex - startIndex);
for (var x = startIndex; x <= endIndex; x++)
rangeValues.Add(sortedValuesWithKey[x].Item2);
var average = rangeValues.Average();
for (var x = startIndex; x <= endIndex; x++)
groupKeys[sortedValuesWithKey[x].Item3] = average;
return;
}
//The range is not within tolerance and needs to be divided again.
//Find the largest gap and divide.
double maxDiff = -1;
var splitIndex = -1;
for (var i = startIndex; i < endIndex; i++)
{
var currentDif = diffsByIndex[i].Item2;
if (currentDif > maxDiff)
{
maxDiff = currentDif;
splitIndex = i;
}
}
AddRange(startIndex, splitIndex);
AddRange(splitIndex + 1, endIndex);
}
var groupStartIndex = 0;
for (var i = 1; i < sortedValuesWithKey.Count; i++)
{
//There isn't a group break here, at least not yet, so continue.
if (!groupBreaks.Contains(i))
continue;
AddRange(groupStartIndex, i - 1);
groupStartIndex = i;
}
//Add the last group's keys if we haven't already.
if (groupStartIndex < sortedValuesWithKey.Count)
AddRange(groupStartIndex, sortedValuesWithKey.Count - 1);
return sortedValuesWithKey.GroupBy(a => groupKeys[a.Item3], a => a.Item1);
}

Related

C# Find continuous values in List quickly

I am working on a project that plotting the mouse tracking. The MouseInfo class is defined like:
public class MouseInfo {
public readonly long TimeStamp;
public readonly int PosX;
public readonly int PosY;
public int ButtonsDownFlag;
}
I need to find a way to extract the mouse positions from a List<MouseInfo> which ButtonsDownFlag has at least 2 continuous 1s and group them together, so that I can distinguish clicks and draggings, which will then being used for plotting.
The current way I am doing is to iterate through the list, and add the found values one by one to other lists, which is very slow, expensive and the code looks messy. I wonder if there is any more elegant way to do it? Will Linq help?
For example, I have the recording of below:
(t1, x1, y1, 0)
(t2, x2, y2, 1)
(t3, x3, y3, 1)
(t4, x4, y4, 0)
(t5, x5, y5, 1)
(t6, x6, y6, 0)
(t7, x7, y7, 1)
(t8, x8, y8, 1)
(t9, x9, y9, 1)
(ta, xa, ya, 0)
(tb, xb, yb, 2) <- Yes, ButtonDownFlag can be 2 for RightClicks or even 3 for both buttons are down
(tc, xc, yc, 0)
(td, xd, yd, 2)
(te, xe, ye, 2)
I want two Lists (or similiar presentation) which are
((t2, x2, y2), (t2, x3, y3), (t7, x7, y7), (t7, x8, y8), (t7, x9, y9))
and
((x5, y5, 1), (xb, yb, 2), (xd, yd, 2), (xe, ye, 2))
Note:
In the first list, I need TimeStamp in the subsequence elements being altered to the first element's TimeStamp, so that I can group in later plotting.
In the second list, I don't care TimeStamp but I do care the ButtonDownFlag
I don't mind if ButtonDownFlag exists in the first list, nor TimeStamp exists in the second list.
Continuous "Right Clicks" are treated as separate "Right Clicks" rather than "Right dragging".
There is a means by which you can use LINQ to do this which will produce one list for all events which are part of a drag sequence and a separate list for individual click events.
List<MouseInfo> mainList = new List<MouseInfo>();
//populate mainList with some data...
List<MouseInfo> dragList = mainList.Where
(
// check if the left click is pressed
x => x.ButtonsDownFlag == 1
//then check if either the previous or the following elements are also clicked
&&
(
//if this isnt the first element in the list, check the previous one
(mainList.IndexOf(x) != 0 ? mainList[mainList.IndexOf(x) - 1].ButtonsDownFlag == 1 : false)
//if this isnt the last element in the list, check the next one
|| (mainList.IndexOf(x) != (mainList.Count - 1) ? mainList[mainList.IndexOf(x) + 1].ButtonsDownFlag == 1 : false)
)
).ToList();
List<MouseInfo> clickList = mainList.Where
(
// check if the left/right or both click is pressed
x => (x.ButtonsDownFlag == 1 || x.ButtonsDownFlag == 2 || x.ButtonsDownFlag == 3)
//then make sure that neither of the previous or following elements are also clicked
&&
(mainList.IndexOf(x) != 0 ? mainList[mainList.IndexOf(x) - 1].ButtonsDownFlag != 1 : true)
&&
(mainList.IndexOf(x) != (mainList.Count - 1) ? mainList[mainList.IndexOf(x) + 1].ButtonsDownFlag != 1 : true)
).ToList();
This approach does have the limitation of not "labelling" each sequence of drags with the same timestamp.
An alternative would be to do this logic at point of data capture. When each data point is captured, if it has a "ButtonDown" value, check the previous data point. If that data point is also a "ButtonDown" add them both (or however many you end up with in the sequence) to your "dragList", otherwise add it to the "clickList".
For this option I would also be tempted to add some logic to separate out your different drag sequences. You have done this by changing the time stamp of the subsequent points, I would instead be tempted to create your "dragList" as a dictionary instead. With each sequences of drags put into a different distinct key.
I don't think this is too easy to follow, but it is similar to how you might handle this in APL (I used Excel to work it out). I also won't promise how fast this is - generally foreach is faster than LINQ, even if only by a small amount.
Using extension methods to implement APL's scan and compress operators and to append/prepend to IEnumerables:
public static IEnumerable<TResult> Scan<T, TResult>(this IEnumerable<T> src, TResult seed, Func<TResult, T, TResult> combine) {
foreach (var s in src) {
seed = combine(seed, s);
yield return seed;
}
}
public static IEnumerable<T> Compress<T>(this IEnumerable<bool> bv, IEnumerable<T> src) {
var srce = src.GetEnumerator();
foreach (var b in bv) {
srce.MoveNext();
if (b)
yield return srce.Current;
}
}
public static IEnumerable<T> Prepend<T>(this IEnumerable<T> rest, params T[] first) => first.Concat(rest);
public static IEnumerable<T> Append<T>(this IEnumerable<T> rest, params T[] last) => rest.Concat(last);
You can filter the list to groups of drags and what's not in a drag:
// create a terminal MouseInfo for scanning along the moves
var mterm = new MouseInfo { t = 0, x = 0, y = 0, b = 4 };
// find the drags for button 1 except the first row
var bRestOfDrag1s = moves.Append(mterm).Zip(moves.Prepend(mterm), (dm, em) => dm.b == 1 && dm.b == em.b).ToList();
// number the drags by finding the drag beginnings
var iLastDragNums = bRestOfDrag1s.Zip(bRestOfDrag1s.Skip(1), (fm, gm) => (!fm && gm)).Scan(0, (a, hm) => hm ? a + 1 : a).ToList();
// find the drags
var bInDrag1s = bRestOfDrag1s.Zip(bRestOfDrag1s.Skip(1), (fm, gm) => (fm || gm));
// number each drag row by its drag number
var dnmiDrags = bInDrag1s.Compress(Enumerable.Range(0, moves.Count)).Select(idx => new { DragNum = iLastDragNums[idx], mi = moves[idx] });
// group by drag number and smear first timestamp along drags
var drags = dnmiDrags.GroupBy(dnmi => dnmi.DragNum)
.Select(dnmig => dnmig.Select(dnmi => dnmi.mi).Select(mi => new MouseInfo { t = dnmig.First().mi.t, x = mi.x, y = mi.y, b = mi.b }).ToList()).ToList();
var clicks = bInDrag1s.Select(b => !b).Compress(moves).Where(mi => mi.b != 0).ToList();
When done, drags contains a List<List<MouseInfo>> where each sub-list is a drag. You can use SelectMany instead of the last (outside) Select to get just a flat List<MouseInfo> instead.
clicks will contain a List<MouseInfo> with just the clicks.
Note that I abbreviated the MouseInfo field names.
BTW, using a for loop is considerably faster:
var inDrag = false;
var drags = new List<MouseInfo>();
var clicks = new List<MouseInfo>();
var beginTime = 0L;
for (var i = 0; i < moves.Count; ++i) {
var curMove = moves[i];
var wasDrag = inDrag;
inDrag = curMove.b == 1 && (inDrag || (i + 1 < moves.Count ? moves[i + 1].b == 1 : false));
if (inDrag) {
if (wasDrag)
drags.Add(new MouseInfo { t = beginTime, x = curMove.x, y = curMove.y, b = curMove.b });
else {
drags.Add(curMove);
beginTime = curMove.t;
}
}
else {
if (curMove.b != 0)
clicks.Add(curMove);
}
}
Just trying to share some knowledge - I found GroupAdjacent solved my problem very well (along with some tweeks for the plotting in a later stage).
The performance is surely not the best (compare to for loop) but I feel the code is more elegant!
Reference: https://blogs.msdn.microsoft.com/ericwhite/2008/04/20/the-groupadjacent-extension-method/
public static class LocalExtensions {
public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector) {
TKey last = default(TKey);
bool haveLast = false;
List<TSource> list = new List<TSource>();
foreach (TSource s in source) {
TKey k = keySelector(s);
if (haveLast) {
if (!k.Equals(last)) {
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
list = new List<TSource>();
list.Add(s);
last = k;
} else {
list.Add(s);
last = k;
}
} else {
list.Add(s);
last = k;
haveLast = true;
}
}
if (haveLast)
yield return new GroupOfAdjacent<TSource, TKey>(list, last);
}
}
class GroupOfAdjacent<TSource, TKey> : IEnumerable<TSource>, IGrouping<TKey, TSource> {
public TKey Key { get; set; }
private List<TSource> GroupList { get; set; }
IEnumerator IEnumerable.GetEnumerator() {
return ((IEnumerable<TSource>)this).GetEnumerator();
}
IEnumerator<TSource> IEnumerable<TSource>.GetEnumerator() {
foreach (TSource s in GroupList)
yield return s;
}
public GroupOfAdjacent(List<TSource> source, TKey key) {
GroupList = source;
Key = key;
}
}
And my working code for testing:
private class MouseInfo {
public readonly long TimeStamp;
public readonly int PosX;
public readonly int PosY;
public int ButtonsDownFlag;
public MouseInfo(long t, int x, int y, int flag) {
TimeStamp = t;
PosX = x;
PosY = y;
ButtonsDownFlag = flag;
}
public override string ToString() {
return $"({TimeStamp:D2}: {PosX:D3}, {PosY:D4}, {ButtonsDownFlag})";
}
}
public Program() {
List<MouseInfo> mi = new List<MouseInfo>(14);
mi.Add(new MouseInfo(1, 10, 100, 0));
mi.Add(new MouseInfo(2, 20, 200, 1));
mi.Add(new MouseInfo(3, 30, 300, 1));
mi.Add(new MouseInfo(4, 40, 400, 0));
mi.Add(new MouseInfo(5, 50, 500, 1));
mi.Add(new MouseInfo(6, 60, 600, 0));
mi.Add(new MouseInfo(7, 70, 700, 1));
mi.Add(new MouseInfo(8, 80, 800, 1));
mi.Add(new MouseInfo(9, 90, 900, 1));
mi.Add(new MouseInfo(10, 100, 1000, 0));
mi.Add(new MouseInfo(11, 110, 1100, 2));
mi.Add(new MouseInfo(12, 120, 1200, 0));
mi.Add(new MouseInfo(13, 130, 1300, 2));
mi.Add(new MouseInfo(14, 140, 1400, 2));
var groups = mi.GroupAdjacent(x => x.ButtonsDownFlag);
List<List<MouseInfo>> drags = groups.Where(x => x.Key == 1 && x.Count() > 1).Select(x => x.ToList()).ToList();
foreach (var d in drags)
foreach (var item in d)
Console.Write($"{item} ");
Console.WriteLine();
List<List<MouseInfo>> clicks = groups.Where(x => x.Key > 1 || (x.Key == 1 && x.Count() == 1)).Select(x => x.ToList()).ToList();
foreach (var d in clicks) {
foreach (var item in d)
Console.Write($"{item} ");
Console.WriteLine();
}
}
[MTAThread]
static void Main(string[] args) {
var x = new Program();
Console.ReadLine();
return;
}

Identifying strings and manipulating the correctly

To preface this I am pulling records from a database. The CaseNumber column will have a unique identifier. However, multiple cases related to ONE Event will have very similar case numbers in which the last two digits will be the next following number. Example:
TR42X2330789
TR42X2330790
TR42X2330791
TR51C0613938
TR51C0613939
TR51C0613940
TR51C0613941
TR51C0613942
TR52X4224749
As you can see we would have to group these records into three groups. Currently my function is really messy and I it does not account for the scenario in which a group of case numbers is followed by another group of case numbers. I was wondering if anybody had any suggestions as to how to tackle this. I was thinking about putting all the case numbers in an array.
int i = 1;
string firstCaseNumber = string.Empty;
string previousCaseNumber = string.Empty;
if (i == 1)
{
firstCaseNumber = texasHarrisPublicRecordInfo.CaseNumber;
i++;
}
else if (i == 2)
{
string previousCaseNumberCode = firstCaseNumber.Remove(firstCaseNumber.Length - 3);
int previousCaseNumberTwoCharacters = Int32.Parse(firstCaseNumber.Substring(Math.Max(0, firstCaseNumber.Length - 2)));
string currentCaseNumberCode = texasHarrisPublicRecordInfo.CaseNumber.Remove(texasHarrisPublicRecordInfo.CaseNumber.Length - 3);
int currentCaselastTwoCharacters = Int32.Parse(texasHarrisPublicRecordInfo.CaseNumber.Substring(Math.Max(0, texasHarrisPublicRecordInfo.CaseNumber.Length - 2)));
int numberPlusOne = previousCaseNumberTwoCharacters + 1;
if (previousCaseNumberCode == currentCaseNumberCode && numberPlusOne == currentCaselastTwoCharacters)
{
//Group offense here
i++;
needNewCriminalRecord = false;
}
else
{
//NewGRoup here
}
previousCaseNumber = texasHarrisPublicRecordInfo.CaseNumber;
i++;
}
else
{
string beforeCaseNumberCode = previousCaseNumber.Remove(previousCaseNumber.Length - 3);
int beforeCaselastTwoCharacters = Int32.Parse(previousCaseNumber.Substring(Math.Max(0, previousCaseNumber.Length - 2)));
string currentCaseNumberCode = texasHarrisPublicRecordInfo.CaseNumber.Remove(texasHarrisPublicRecordInfo.CaseNumber.Length - 3);
int currentCaselastTwoCharacters = Int32.Parse(texasHarrisPublicRecordInfo.CaseNumber.Substring(Math.Max(0, texasHarrisPublicRecordInfo.CaseNumber.Length - 2)));
int numberPlusOne = beforeCaselastTwoCharacters + 1;
if (beforeCaseNumberCode == currentCaseNumberCode && numberPlusOne == currentCaselastTwoCharacters)
{
i++;
needNewCriminalRecord = false;
}
else
{
needNewCriminalRecord = true;
}
}
If you do not really care about performance you can use LINQ .GroupBy() and .ToDictionary() methods and create dictionary with lists. Something among the lines of :
string[] values =
{
"TR42X2330789",
"TR42X2330790",
"TR42X2330791",
"TR51C0613938",
"TR51C0613939",
"TR51C0613940",
"TR51C0613941",
"TR51C0613942",
"TR52X4224749"
};
Dictionary<string, List<string>> grouppedValues = values.GroupBy(v =>
new string(v.Take(9).ToArray()), // key - first 9 chars
v => v) // value
.ToDictionary(g => g.Key, g => g.ToList());
foreach (var item in grouppedValues)
{
Console.WriteLine(item.Key + " " + item.Value.Count);
}
Output :
TR42X2330 3
TR51C0613 5
TR52X4224 1
I would create a general puropose extension method:
static IEnumerable<IEnumerable<T>> GroupByConsecutiveKey<T, TKey>(this IEnumerable<T> list, Func<T, TKey> keySelector, Func<TKey, TKey, bool> areConsecutive)
{
using (var enumerator = list.GetEnumerator())
{
TKey previousKey = default(TKey);
var currentGroup = new List<T>();
while (enumerator.MoveNext())
{
if (!areConsecutive(previousKey, keySelector(enumerator.Current)))
{
if (currentGroup.Count > 0)
{
yield return currentGroup;
currentGroup = new List<T>();
}
}
currentGroup.Add(enumerator.Current);
previousKey = keySelector(enumerator.Current);
}
if (currentGroup.Count != 0)
{
yield return currentGroup;
}
}
}
And now you would use it like:
var grouped = data.GroupByConsecutiveKey(item => item, (k1, k2) => areConsecutive(k1, k2));
A quick hack for areConsecutive could be:
public static bool Consecutive(string s1, string s2)
{
if (s1 == null || s2 == null)
return false;
if (s1.Substring(0, s1.Length - 2) != s2.Substring(0, s2.Length - 2))
return false;
var end1 = s1.Substring(s1.Length - 2, 2);
var end2 = s2.Substring(s2.Length - 2, 2);
if (end1[1]!='0' && end2[1]!='0')
return Math.Abs((int)end1[1] - (int)end2[1]) == 1;
return Math.Abs(int.Parse(end1) - int.Parse(end2)) == 1;
}
Note that I am considering that Key can take any shape. If the alphanumeric code has the same pattern always then you can probably make this method a whole lot prettier or just use regular expressions.

How to convert a multiple rank array using ConvertAll()?

I want to use ConvertAll like this:
var sou = new[,] { { true, false, false }, { true, true, true } };
var tar = Array.ConvertAll<bool, int>(sou, x => (x ? 1 : 0));
but I got compiler error:
cannot implicitly convert type bool[,] to bool[]
You could write a straightforward conversion extension:
public static class ArrayExtensions
{
public static TResult[,] ConvertAll<TSource, TResult>(this TSource[,] source, Func<TSource, TResult> projection)
{
if (source == null)
throw new ArgumentNullException("source");
if (projection == null)
throw new ArgumentNullException("projection");
var result = new TResult[source.GetLength(0), source.GetLength(1)];
for (int x = 0; x < source.GetLength(0); x++)
for (int y = 0; y < source.GetLength(1); y++)
result[x, y] = projection(source[x, y]);
return result;
}
}
Sample usage would look like this:
var tar = sou.ConvertAll(x => x ? 1 : 0);
The downside is that if you wanted to do any other transforms besides projection, you would be in a pickle.
Alternatively, if you want to be able to use LINQ operators on the sequence, you can do that easily with regular LINQ methods. However, you would still need a custom implementation to turn the sequence back into a 2D array:
public static T[,] To2DArray<T>(this IEnumerable<T> source, int rows, int columns)
{
if (source == null)
throw new ArgumentNullException("source");
if (rows < 0 || columns < 0)
throw new ArgumentException("rows and columns must be positive integers.");
var result = new T[rows, columns];
if (columns == 0 || rows == 0)
return result;
int column = 0, row = 0;
foreach (T element in source)
{
if (column >= columns)
{
column = 0;
if (++row >= rows)
throw new InvalidOperationException("Sequence elements do not fit the array.");
}
result[row, column++] = element;
}
return result;
}
This would allow a great deal more flexibility as you can operate on your source array as an IEnumerable{T} sequence.
Sample usage:
var tar = sou.Cast<bool>().Select(x => x ? 1 : 0).To2DArray(sou.GetLength(0), sou.GetLength(1));
Note that the initial cast is required to transform the sequence from IEnumerable paradigm to IEnumerable<T> paradigm since a multidimensional array does not implement the generic IEnumerable<T> interface. Most of the LINQ transforms only work on that.
If your array is of unknown rank, you can use this extension method (which depends on the MoreLinq Nuget package). I'm sure this can be optimized a lot, though, but this works for me.
using MoreLinq;
using System;
using System.Collections.Generic;
using System.Linq;
public static class ArrayExtensions
{
public static Array ConvertAll<TOutput>(this Array array, Converter<object, TOutput> converter)
{
foreach (int[] indices in GenerateIndices(array))
{
array.SetValue(converter.Invoke(array.GetValue(indices)), indices);
}
return array;
}
private static IEnumerable<int[]> GenerateCartesianProductOfUpperBounds(IEnumerable<int> upperBounds, IEnumerable<int[]> existingCartesianProduct)
{
if (!upperBounds.Any())
return existingCartesianProduct;
var slice = upperBounds.Slice(0, upperBounds.Count() - 1);
var rangeOfIndices = Enumerable.Range(0, upperBounds.Last() + 1);
IEnumerable<int[]> newCartesianProduct;
if (existingCartesianProduct.Any())
newCartesianProduct = rangeOfIndices.Cartesian(existingCartesianProduct, (i, p1) => new[] { i }.Concat(p1).ToArray()).ToArray();
else
newCartesianProduct = rangeOfIndices.Select(i => new int[] { i }).ToArray();
return GenerateCartesianProductOfUpperBounds(slice, newCartesianProduct);
}
private static IEnumerable<int[]> GenerateIndices(Array array)
{
var upperBounds = Enumerable.Range(0, array.Rank).Select(r => array.GetUpperBound(r));
return GenerateCartesianProductOfUpperBounds(upperBounds, Array.Empty<int[]>());
}
}

Grouping by an unknown initial prefix

Say I have the following array of strings as an input:
foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe
There are 3 different prefixes used here, "foo-", "barbaz" and "baz" - however these prefixes are not known ahead of time (they could be something completely different).
How could you establish what the different common prefixes are so that they could then be grouped by? This is made a bit tricky since in the data I've provided there's two that start with "bazg" and one that starts "bazf" where of course "baz" is the prefix.
What I've tried so far is sorting them into alphabetical order, and then looping through them in order and counting how many characters in a row are identical to the previous. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is it falls over at the "bazg" and "bazf" problem I mentioned earlier and separates those into two different groups (one with just one element in it)
Edit: Alright, let's throw a few more rules in:
Longer potential groups should generally be preferred over shorter ones, unless there is a closely matching group of less than X characters difference in length. (So where X is 2, baz would be preferred over bazg)
A group must have at least Y elements in it or not be a group at all
It's okay to simply throw away elements that don't match any of the 'groups' to within the rules above.
To clarify the first rule in relation to the second, if X was 0 and Y was 2, then the two 'bazg' entries would be in a group, and the 'bazf' would be thrown away because its on its own.
Well, here's a quick hack, probably O(something_bad):
IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
// TODO: error checking
return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}
IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
if(source.Any())
{
var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
if (tuple.Item2.Count() >= minGroupSize)
yield return tuple;
foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
yield return element;
}
}
Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}
IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
while (source.Any() && source.Peek().StartsWith(prefix))
yield return source.Pop();
}
string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
var s = new Stack<string>(source);
var counter = new DictionaryWithDefault<string, int>(0);
while(s.Any())
{
var g = GetCommonPrefix(s);
if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
counter[g]++;
s.Pop();
}
return counter.OrderBy(c => c.Value).Last().Key;
}
string GetCommonPrefix(IEnumerable<string> coll)
{
return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
let possibleMatch = coll.First().Substring(0, len)
where coll.All(f => f.StartsWith(possibleMatch))
select possibleMatch).FirstOrDefault();
}
public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
TValue _default;
public TValue DefaultValue {
get { return _default; }
set { _default = value; }
}
public DictionaryWithDefault() : base() { }
public DictionaryWithDefault(TValue defaultValue) : base() {
_default = defaultValue;
}
public new TValue this[TKey key]
{
get { return base.ContainsKey(key) ? base[key] : _default; }
set { base[key] = value; }
}
}
Example usage:
string[] input = {
"foo-139875913",
"foo-aeuefhaiu",
"foo-95hw9ghes",
"barbazabejgoiagjaegioea",
"barbaz8gs98ghsgh9es8h",
"barbaza98fyae9fghaefag",
"bazfa90eufa0e9u",
"bazgeajga8ugae89u",
"bazguea9guae",
"9a8efa098fea0",
"aifeaufhiuafhe"
};
GuessGroups(input, 3, 2).Dump();
Ok, well as discussed, the problem wasn't initially well defined, but here is how I'd go about it.
Create a tree T
Parse the list, for each element:
for each letter in that element
if a branch labeled with that letter exists then
Increment the counter on that branch
Descend that branch
else
Create a branch labelled with that letter
Set its counter to 1
Descend that branch
This gives you a tree where each of the leaves represents a word in your input. Each of the non-leaf nodes has a counter representing how many leaves are (eventually) attached to that node. Now you need a formula to weight the length of the prefix (the depth of the node) against the size of the prefix group. For now:
S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour
So now you can iterate over each of the non-leaf node and assign them a score S. Then, to work out your groups you would
For each non-leaf node
Assign score S
Insertion sort the node in to a list, so the head is the highest scoring node
Starting at the root of the tree, traverse the nodes
If the node is the highest scoring node in the list
Mark it as a prefix
Remove all nodes from the list that are a descendant of it
Pop itself off the front of the list
Return up the tree
This should give you a list of prefixes. The last part feels like some clever data structures or algorithms could speed it up (the last part of removing all the children feels particularly weak, but if you input size is small, I guess speed isn't too important).
I'm wondering if your requirements aren't off. It seems as if you are looking for a specific grouping size as opposed to specific key size requirements. I have below a program that will, based on a specified group size, break up the strings into the largest possible groups up too, and including the group size specified. So if you specify a group size of 5, then it will group items on the smallest key possible to make a group of size 5. In your example it would group foo- as f since there is no need to make a more complex key as an identifier.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
/// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
{
var smallItems = from item in items
where item.Length < keySize
select item;
var largeItems = from item in items
where keySize < item.Length
select item;
var largeItemsq = (from item in largeItems
let key = item.Substring(0, keySize)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
if (smallItems.Any())
{
var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
var smallItemsq = (from item in smallItems
let key = item.Substring(0, smallestLength)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
return Combine(smallItemsq, largeItemsq);
}
return largeItemsq;
}
static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
var x = new Dictionary<bool,Dictionary<string,List<string>>> {
{ true, null },
{ false, null }
};
foreach(var condition in new bool[] { true, false }) {
var hasA = a.ContainsKey(condition);
var hasB = b.ContainsKey(condition);
x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
: hasA ? a[condition]
: hasB ? b[condition]
: new Dictionary<string, List<string>>();
}
return x;
}
public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
{
var toReturn = new Dictionary<string, List<string>>();
var both = Split(maxGroupSize, keySize, items);
if (both.ContainsKey(false))
foreach (var key in both[false].Keys)
toReturn.Add(key, both[false][key]);
if (both.ContainsKey(true))
{
var keySize_ = keySize + 1;
var xs = from needsFix in both[true]
select needsFix;
foreach (var x in xs)
{
var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
}
}
return toReturn;
}
static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
static readonly int maxAllowed = allowedChars.Length - 1;
static IEnumerable<string> GenerateText()
{
var list = new List<string>();
for (int i = 0; i < 100; i++)
{
var stringLength = rand.Next(3,25);
var chars = new List<char>(stringLength);
for (int j = stringLength; j > 0; j--)
chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
list.Add(newString);
}
return list;
}
static void Main(string[] args)
{
// runs 1000 times over autogenerated groups of sample text.
for (int i = 0; i < 1000; i++)
{
var s = GenerateText();
Go(s);
}
Console.WriteLine();
Console.WriteLine("DONE");
Console.ReadLine();
}
static void Go(IEnumerable<string> items)
{
var dict = Group(3, items, 1);
foreach (var key in dict.Keys)
{
Console.WriteLine(key);
foreach (var item in dict[key])
Console.WriteLine("\t{0}", item);
}
}
}
}

Get previous and next item in a IEnumerable using LINQ

I have an IEnumerable of a custom type. (That I've gotten from a SelectMany)
I also have an item (myItem) in that IEnumerable that I desire the previous and next item from the IEnumerable.
Currently, I'm doing the desired like this:
var previousItem = myIEnumerable.Reverse().SkipWhile(
i => i.UniqueObjectID != myItem.UniqueObjectID).Skip(1).FirstOrDefault();
I can get the next item by simply ommitting the .Reverse.
or, I could:
int index = myIEnumerable.ToList().FindIndex(
i => i.UniqueObjectID == myItem.UniqueObjectID)
and then use .ElementAt(index +/- 1) to get the previous or next item.
Which is better between the two options?
Is there an even better option available?
"Better" includes a combination of performance (memory and speed) and readability; with readability being my primary concern.
First off
"Better" includes a combination of performance (memory and speed)
In general you can't have both, the rule of thumb is, if you optimise for speed, it'll cost memory, if you optimise for memory, it'll cost you speed.
There is a better option, that performs well on both memory and speed fronts, and can be used in a readable manner (I'm not delighted with the function name, however, FindItemReturningPreviousItemFoundItemAndNextItem is a bit of a mouthful).
So, it looks like it's time for a custom find extension method, something like . . .
public static IEnumerable<T> FindSandwichedItem<T>(this IEnumerable<T> items, Predicate<T> matchFilling)
{
if (items == null)
throw new ArgumentNullException("items");
if (matchFilling == null)
throw new ArgumentNullException("matchFilling");
return FindSandwichedItemImpl(items, matchFilling);
}
private static IEnumerable<T> FindSandwichedItemImpl<T>(IEnumerable<T> items, Predicate<T> matchFilling)
{
using(var iter = items.GetEnumerator())
{
T previous = default(T);
while(iter.MoveNext())
{
if(matchFilling(iter.Current))
{
yield return previous;
yield return iter.Current;
if (iter.MoveNext())
yield return iter.Current;
else
yield return default(T);
yield break;
}
previous = iter.Current;
}
}
// If we get here nothing has been found so return three default values
yield return default(T); // Previous
yield return default(T); // Current
yield return default(T); // Next
}
You can cache the result of this to a list if you need to refer to the items more than once, but it returns the found item, preceded by the previous item, followed by the following item. e.g.
var sandwichedItems = myIEnumerable.FindSandwichedItem(item => item.objectId == "MyObjectId").ToList();
var previousItem = sandwichedItems[0];
var myItem = sandwichedItems[1];
var nextItem = sandwichedItems[2];
The defaults to return if it's the first or last item may need to change depending on your requirements.
Hope this helps.
For readability, I'd load the IEnumerable into a linked list:
var e = Enumerable.Range(0,100);
var itemIKnow = 50;
var linkedList = new LinkedList<int>(e);
var listNode = linkedList.Find(itemIKnow);
var next = listNode.Next.Value; //probably a good idea to check for null
var prev = listNode.Previous.Value; //ditto
By creating an extension method for establishing context to the current element you can use a Linq query like this:
var result = myIEnumerable.WithContext()
.Single(i => i.Current.UniqueObjectID == myItem.UniqueObjectID);
var previous = result.Previous;
var next = result.Next;
The extension would be something like this:
public class ElementWithContext<T>
{
public T Previous { get; private set; }
public T Next { get; private set; }
public T Current { get; private set; }
public ElementWithContext(T current, T previous, T next)
{
Current = current;
Previous = previous;
Next = next;
}
}
public static class LinqExtensions
{
public static IEnumerable<ElementWithContext<T>>
WithContext<T>(this IEnumerable<T> source)
{
T previous = default(T);
T current = source.FirstOrDefault();
foreach (T next in source.Union(new[] { default(T) }).Skip(1))
{
yield return new ElementWithContext<T>(current, previous, next);
previous = current;
current = next;
}
}
}
You could cache the enumerable in a list
var myList = myIEnumerable.ToList()
iterate over it by index
for (int i = 0; i < myList.Count; i++)
then the current element is myList[i], the previous element is myList[i-1], and the next element is myList[i+1]
(Don't forget about the special cases of the first and last elements in the list.)
You are really over complicating things:
Sometimes just a for loop is going to be better to do something, and I think provide a clearer implementation of what you are trying to do/
var myList = myIEnumerable.ToList();
for(i = 0; i < myList.Length; i++)
{
if(myList[i].UniqueObjectID == myItem.UniqueObjectID)
{
previousItem = myList[(i - 1) % (myList.Length - 1)];
nextItem = myList[(i + 1) % (myList.Length - 1)];
}
}
Here is a LINQ extension method that returns the current item, along with the previous and the next. It yields ValueTuple<T, T, T> values to avoid allocations. The source is enumerated once.
/// <summary>
/// Projects each element of a sequence into a tuple that includes the previous
/// and the next element.
/// </summary>
public static IEnumerable<(T Previous, T Current, T Next)> WithPreviousAndNext<T>(
this IEnumerable<T> source, T firstPrevious = default, T lastNext = default)
{
ArgumentNullException.ThrowIfNull(source);
(T Previous, T Current, bool HasPrevious) queue = (default, firstPrevious, false);
foreach (var item in source)
{
if (queue.HasPrevious)
yield return (queue.Previous, queue.Current, item);
queue = (queue.Current, item, true);
}
if (queue.HasPrevious)
yield return (queue.Previous, queue.Current, lastNext);
}
Usage example:
var source = Enumerable.Range(1, 5);
Console.WriteLine($"Source: {String.Join(", ", source)}");
var result = source.WithPreviousAndNext(firstPrevious: -1, lastNext: -1);
Console.WriteLine($"Result: {String.Join(", ", result)}");
Output:
Source: 1, 2, 3, 4, 5
Result: (-1, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, -1)
To get the previous and the next of a specific item, you could use tuple deconstruction:
var (previous, current, next) = myIEnumerable
.WithPreviousAndNext()
.First(e => e.Current.UniqueObjectID == myItem.UniqueObjectID);
CPU
Depends entirely on where the object is in the sequence. If it is located at the end I would expect the second to be faster with more than a factor 2 (but only a constant factor). If it is located in the beginning the first will be faster because you don't traverse the whole list.
Memory
The first is iterating the sequence without saving the sequence so the memory hit will be very small. The second solution will take as much memory as the length of the list * references + objects + overhead.
I thought I would try to answer this using Zip from Linq.
string[] items = {"nought","one","two","three","four"};
var item = items[2];
var sandwiched =
items
.Zip( items.Skip(1), (previous,current) => new { previous, current } )
.Zip( items.Skip(2), (pair,next) => new { pair.previous, pair.current, next } )
.FirstOrDefault( triplet => triplet.current == item );
This will return a anonymous type {previous,current,next}.
Unfortunately this will only work for indexes 1,2 and 3.
string[] items = {"nought","one","two","three","four"};
var item = items[4];
var pad1 = Enumerable.Repeat( "", 1 );
var pad2 = Enumerable.Repeat( "", 2 );
var padded = pad1.Concat( items );
var next1 = items.Concat( pad1 );
var next2 = items.Skip(1).Concat( pad2 );
var sandwiched =
padded
.Zip( next1, (previous,current) => new { previous, current } )
.Zip( next2, (pair,next) => new { pair.previous, pair.current, next } )
.FirstOrDefault( triplet => triplet.current == item );
This version will work for all indexes.
Both version use lazy evaluation courtesy of Linq.
Here are some extension methods as promised. The names are generic and reusable with any type simple and there are lookup overloads to get at the item needed to get the next or previous items. I would benchmark the solutions and then see where you could squeeze cycles out.
public static class ExtensionMethods
{
public static T Previous<T>(this List<T> list, T item) {
var index = list.IndexOf(item) - 1;
return index > -1 ? list[index] : default(T);
}
public static T Next<T>(this List<T> list, T item) {
var index = list.IndexOf(item) + 1;
return index < list.Count() ? list[index] : default(T);
}
public static T Previous<T>(this List<T> list, Func<T, Boolean> lookup) {
var item = list.SingleOrDefault(lookup);
var index = list.IndexOf(item) - 1;
return index > -1 ? list[index] : default(T);
}
public static T Next<T>(this List<T> list, Func<T,Boolean> lookup) {
var item = list.SingleOrDefault(lookup);
var index = list.IndexOf(item) + 1;
return index < list.Count() ? list[index] : default(T);
}
public static T PreviousOrFirst<T>(this List<T> list, T item) {
if(list.Count() < 1)
throw new Exception("No array items!");
var previous = list.Previous(item);
return previous == null ? list.First() : previous;
}
public static T NextOrLast<T>(this List<T> list, T item) {
if(list.Count() < 1)
throw new Exception("No array items!");
var next = list.Next(item);
return next == null ? list.Last() : next;
}
public static T PreviousOrFirst<T>(this List<T> list, Func<T,Boolean> lookup) {
if(list.Count() < 1)
throw new Exception("No array items!");
var previous = list.Previous(lookup);
return previous == null ? list.First() : previous;
}
public static T NextOrLast<T>(this List<T> list, Func<T,Boolean> lookup) {
if(list.Count() < 1)
throw new Exception("No array items!");
var next = list.Next(lookup);
return next == null ? list.Last() : next;
}
}
And you can use them like this.
var previous = list.Previous(obj);
var next = list.Next(obj);
var previousWithLookup = list.Previous((o) => o.LookupProperty == otherObj.LookupProperty);
var nextWithLookup = list.Next((o) => o.LookupProperty == otherObj.LookupProperty);
var previousOrFirst = list.PreviousOrFirst(obj);
var nextOrLast = list.NextOrLast(ob);
var previousOrFirstWithLookup = list.PreviousOrFirst((o) => o.LookupProperty == otherObj.LookupProperty);
var nextOrLastWithLookup = list.NextOrLast((o) => o.LookupProperty == otherObj.LookupProperty);
I use the following technique:
var items = new[] { "Bob", "Jon", "Zac" };
var sandwiches = items
.Sandwich()
.ToList();
Which produces this result:
Notice that there are nulls for the first Previous value, and the last Next value.
It uses the following extension method:
public static IEnumerable<(T Previous, T Current, T Next)> Sandwich<T>(this IEnumerable<T> source, T beforeFirst = default, T afterLast = default)
{
var sourceList = source.ToList();
T previous = beforeFirst;
T current = sourceList.FirstOrDefault();
foreach (var next in sourceList.Skip(1))
{
yield return (previous, current, next);
previous = current;
current = next;
}
yield return (previous, current, afterLast);
}
If you need it for every element in myIEnumerable I’d just iterate through it keeping references to the 2 previous elements. In the body of the loop I'd do the processing for the second previous element and the current would be its descendant and first previous its ancestor.
If you need it for only one element I'd choose your first approach.

Categories