Grouping by an unknown initial prefix - c#

Say I have the following array of strings as an input:
foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe
There are 3 different prefixes used here, "foo-", "barbaz" and "baz" - however these prefixes are not known ahead of time (they could be something completely different).
How could you establish what the different common prefixes are so that they could then be grouped by? This is made a bit tricky since in the data I've provided there's two that start with "bazg" and one that starts "bazf" where of course "baz" is the prefix.
What I've tried so far is sorting them into alphabetical order, and then looping through them in order and counting how many characters in a row are identical to the previous. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is it falls over at the "bazg" and "bazf" problem I mentioned earlier and separates those into two different groups (one with just one element in it)
Edit: Alright, let's throw a few more rules in:
Longer potential groups should generally be preferred over shorter ones, unless there is a closely matching group of less than X characters difference in length. (So where X is 2, baz would be preferred over bazg)
A group must have at least Y elements in it or not be a group at all
It's okay to simply throw away elements that don't match any of the 'groups' to within the rules above.
To clarify the first rule in relation to the second, if X was 0 and Y was 2, then the two 'bazg' entries would be in a group, and the 'bazf' would be thrown away because its on its own.

Well, here's a quick hack, probably O(something_bad):
IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
// TODO: error checking
return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}
IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
if(source.Any())
{
var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
if (tuple.Item2.Count() >= minGroupSize)
yield return tuple;
foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
yield return element;
}
}
Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}
IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
while (source.Any() && source.Peek().StartsWith(prefix))
yield return source.Pop();
}
string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
var s = new Stack<string>(source);
var counter = new DictionaryWithDefault<string, int>(0);
while(s.Any())
{
var g = GetCommonPrefix(s);
if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
counter[g]++;
s.Pop();
}
return counter.OrderBy(c => c.Value).Last().Key;
}
string GetCommonPrefix(IEnumerable<string> coll)
{
return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
let possibleMatch = coll.First().Substring(0, len)
where coll.All(f => f.StartsWith(possibleMatch))
select possibleMatch).FirstOrDefault();
}
public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
TValue _default;
public TValue DefaultValue {
get { return _default; }
set { _default = value; }
}
public DictionaryWithDefault() : base() { }
public DictionaryWithDefault(TValue defaultValue) : base() {
_default = defaultValue;
}
public new TValue this[TKey key]
{
get { return base.ContainsKey(key) ? base[key] : _default; }
set { base[key] = value; }
}
}
Example usage:
string[] input = {
"foo-139875913",
"foo-aeuefhaiu",
"foo-95hw9ghes",
"barbazabejgoiagjaegioea",
"barbaz8gs98ghsgh9es8h",
"barbaza98fyae9fghaefag",
"bazfa90eufa0e9u",
"bazgeajga8ugae89u",
"bazguea9guae",
"9a8efa098fea0",
"aifeaufhiuafhe"
};
GuessGroups(input, 3, 2).Dump();

Ok, well as discussed, the problem wasn't initially well defined, but here is how I'd go about it.
Create a tree T
Parse the list, for each element:
for each letter in that element
if a branch labeled with that letter exists then
Increment the counter on that branch
Descend that branch
else
Create a branch labelled with that letter
Set its counter to 1
Descend that branch
This gives you a tree where each of the leaves represents a word in your input. Each of the non-leaf nodes has a counter representing how many leaves are (eventually) attached to that node. Now you need a formula to weight the length of the prefix (the depth of the node) against the size of the prefix group. For now:
S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour
So now you can iterate over each of the non-leaf node and assign them a score S. Then, to work out your groups you would
For each non-leaf node
Assign score S
Insertion sort the node in to a list, so the head is the highest scoring node
Starting at the root of the tree, traverse the nodes
If the node is the highest scoring node in the list
Mark it as a prefix
Remove all nodes from the list that are a descendant of it
Pop itself off the front of the list
Return up the tree
This should give you a list of prefixes. The last part feels like some clever data structures or algorithms could speed it up (the last part of removing all the children feels particularly weak, but if you input size is small, I guess speed isn't too important).

I'm wondering if your requirements aren't off. It seems as if you are looking for a specific grouping size as opposed to specific key size requirements. I have below a program that will, based on a specified group size, break up the strings into the largest possible groups up too, and including the group size specified. So if you specify a group size of 5, then it will group items on the smallest key possible to make a group of size 5. In your example it would group foo- as f since there is no need to make a more complex key as an identifier.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
/// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
{
var smallItems = from item in items
where item.Length < keySize
select item;
var largeItems = from item in items
where keySize < item.Length
select item;
var largeItemsq = (from item in largeItems
let key = item.Substring(0, keySize)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
if (smallItems.Any())
{
var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
var smallItemsq = (from item in smallItems
let key = item.Substring(0, smallestLength)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
return Combine(smallItemsq, largeItemsq);
}
return largeItemsq;
}
static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
var x = new Dictionary<bool,Dictionary<string,List<string>>> {
{ true, null },
{ false, null }
};
foreach(var condition in new bool[] { true, false }) {
var hasA = a.ContainsKey(condition);
var hasB = b.ContainsKey(condition);
x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
: hasA ? a[condition]
: hasB ? b[condition]
: new Dictionary<string, List<string>>();
}
return x;
}
public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
{
var toReturn = new Dictionary<string, List<string>>();
var both = Split(maxGroupSize, keySize, items);
if (both.ContainsKey(false))
foreach (var key in both[false].Keys)
toReturn.Add(key, both[false][key]);
if (both.ContainsKey(true))
{
var keySize_ = keySize + 1;
var xs = from needsFix in both[true]
select needsFix;
foreach (var x in xs)
{
var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
}
}
return toReturn;
}
static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
static readonly int maxAllowed = allowedChars.Length - 1;
static IEnumerable<string> GenerateText()
{
var list = new List<string>();
for (int i = 0; i < 100; i++)
{
var stringLength = rand.Next(3,25);
var chars = new List<char>(stringLength);
for (int j = stringLength; j > 0; j--)
chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
list.Add(newString);
}
return list;
}
static void Main(string[] args)
{
// runs 1000 times over autogenerated groups of sample text.
for (int i = 0; i < 1000; i++)
{
var s = GenerateText();
Go(s);
}
Console.WriteLine();
Console.WriteLine("DONE");
Console.ReadLine();
}
static void Go(IEnumerable<string> items)
{
var dict = Group(3, items, 1);
foreach (var key in dict.Keys)
{
Console.WriteLine(key);
foreach (var item in dict[key])
Console.WriteLine("\t{0}", item);
}
}
}
}

Related

Split a list of objects into sub-lists of contiguous elements using LINQ?

I have a simple class Item:
public class Item
{
public int Start { get; set;}
public int Stop { get; set;}
}
Given a List<Item> I want to split this into multiple sublists of contiguous elements. e.g. a method
List<Item[]> GetContiguousSequences(Item[] items)
Each element of the returned list should be an array of Item such that list[i].Stop == list[i+1].Start for each element
e.g.
{[1,10], [10,11], [11,20], [25,30], [31,40], [40,45], [45,100]}
=>
{{[1,10], [10,11], [11,20]}, {[25,30]}, {[31,40],[40,45],[45,100]}}
Here is a simple (and not guaranteed bug-free) implementation that simply walks the input data looking for discontinuities:
List<Item[]> GetContiguousSequences(Item []items)
{
var ret = new List<Item[]>();
var i1 = 0;
for(var i2=1;i2<items.Length;++i2)
{
//discontinuity
if(items[i2-1].Stop != items[i2].Start)
{
var num = i2 - i1;
ret.Add(items.Skip(i1).Take(num).ToArray());
i1 = i2;
}
}
//end of array
ret.Add(items.Skip(i1).Take(items.Length-i1).ToArray());
return ret;
}
It's not the most intuitive implementation and I wonder if there is a way to have a neater LINQ-based approach. I was looking at Take and TakeWhile thinking to find the indices where discontinuities occur but couldn't see an easy way to do this.
Is there a simple way to use IEnumerable LINQ algorithms to do this in a more descriptive (not necessarily performant) way?
I set of a simple test-case here: https://dotnetfiddle.net/wrIa2J
I'm really not sure this is much better than your original, but for the purpose of another solution the general process is
Use Select to project a list working out a grouping
Use GroupBy to group by the above
Use Select again to project the grouped items to an array of Item
Use ToList to project the result to a list
public static List<Item[]> GetContiguousSequences2(Item []items)
{
var currIdx = 1;
return items.Select( (item,index) => new {
item = item,
index = index == 0 || items[index-1].Stop == item.Start ? currIdx : ++currIdx
})
.GroupBy(x => x.index, x => x.item)
.Select(x => x.ToArray())
.ToList();
}
Live example: https://dotnetfiddle.net/mBfHru
Another way is to do an aggregation using Aggregate. This means maintaining a final Result list and a Curr list where you can aggregate your sequences, adding them to the Result list as you find discontinuities. This method looks a little closer to your original
public static List<Item[]> GetContiguousSequences3(Item []items)
{
var res = items.Aggregate(new {Result = new List<Item[]>(), Curr = new List<Item>()}, (agg, item) => {
if(!agg.Curr.Any() || agg.Curr.Last().Stop == item.Start) {
agg.Curr.Add(item);
} else {
agg.Result.Add(agg.Curr.ToArray());
agg.Curr.Clear();
agg.Curr.Add(item);
}
return agg;
});
res.Result.Add(res.Curr.ToArray()); // Remember to add the last group
return res.Result;
}
Live example: https://dotnetfiddle.net/HL0VyJ
You can implement ContiguousSplit as a corutine: let's loop over source and either add item into current range or return it and start a new one.
private static IEnumerable<Item[]> ContiguousSplit(IEnumerable<Item> source) {
List<Item> current = new List<Item>();
foreach (var item in source) {
if (current.Count > 0 && current[current.Count - 1].Stop != item.Start) {
yield return current.ToArray();
current.Clear();
}
current.Add(item);
}
if (current.Count > 0)
yield return current.ToArray();
}
then if you want materialization
List<Item[]> GetContiguousSequences(Item []items) => ContiguousSplit(items).ToList();
Your solution is okay. I don't think that LINQ adds any simplification or clarity in this situation. Here is a fast solution that I find intuitive:
static List<Item[]> GetContiguousSequences(Item[] items)
{
var result = new List<Item[]>();
int start = 0;
while (start < items.Length) {
int end = start + 1;
while (end < items.Length && items[end].Start == items[end - 1].Stop) {
end++;
}
int len = end - start;
var a = new Item[len];
Array.Copy(items, start, a, 0, len);
result.Add(a);
start = end;
}
return result;
}

Dictionary: find min/max values within a range of given keys and returning their keys

I have a concurrentdictionary with 500,000 items.
Keys are integers, items are single.
for instance:
1, 8.65
2, 7.65
3, 8.89
4, 8.90
5, 7.95
...
500000, 7.68
How I can I retrieve the min and max values within a specified key range of this dictionary and their respective keys?
Example: finding min/max data value between key=25 and key=477 and returning their keys.
I found some LINQ examples but the author warned it's potentially slower than foreach, and not doing exactly what I would like.
https://social.msdn.microsoft.com/Forums/vstudio/en-US/774aa579-2bc9-4458-93f4-af4b94169e7c/get-min-and-max-values-in-dictionary?forum=csharpgeneral
Performance is critical in my application.
Update 1:
I want to know the keys corresponding to the max/min.
The dictionary contains a time serie. The values (single) are ordered in time by their key. Higher the key value is, more recent is the data.
Update 2: benchmarks
I made a few benchmarks filling a concurrent dictionary with 929,452 records.
My CPU is i7-8550U, that means it has boost on single thread (3.8GHz) and lowers its frequency when the 4 cores (8 threads) run, roughly 2.6 GHz. So, I never expect multithread to be 4 times faster than single thread.
For each item of the dictionary, I look backward for the maximum of the previous 800 records.
Release build mode, x64:
Single thread, for loop: 14149 ms
Multithread, parallelfor loop: 4731 ms
Single thread, linq ONLY 1000 records: 17609 ms. Sorry LINQ.
LINQ is out. Definitively I will go for the "for loop". Now I'd like to compare concurrentdictionary and list ofwith the for loop.
Update 3: simplification and benchmarks
Modifying my code using other containers. All are thread-safe for reading (if no modification by other thread at the same time).
Concurrent dictionary 1-thread of my objects (datetime, 2D-single): 14682 ms
List of my objects (datetime, 2D-single): 2071 ms
Concurrent dictionary 4-threads: 4611 ms
Array of objects (datetime, 2D-single): 1030 ms
Array of 1D-single (x4) and array 1D-datetime: 784 ms
Array of 1D-single (x4) and array 1D-datetime 4 threads: 229 ms.
In order to keep my input objects read-only and as fast as possible, I will have to write the processing results in another object. It's another theme now.
I'm not sure dictionary knows to optimize based on any relationship the keys may have.
As such, I think you're going to have to do the optimizing yourself. With one pass through the dictionary, you should be able to:
int max = Int32.MinValue;
int min = Int32.MaxValue,
foreach (var k in dictionary.keys) {
if (k<minIndex | k>maxIndex) continue;
max = Math.Max(max,dictionary[k]);
min = Math.Min(min,dictionary[k]);
}
Now if your dictionary is sorted ahead of time, meaning key '50' will always be before key '60', you can abort as soon as possible and start as late as possible.
You should in fact see SortedDictionary
SINCE you updated your description
Use a SortedList, k is the index number of the list and the value is your double.
The Where will return all elements with keys in your range, and then Max() and Min() methods will return corresponding min and max values in the rage.
var data = new Dictionary<int, double>();
for (int i = 1; i <= 10; i++)
{
data.Add(i, i * 1.1);
}
var minKey = 3;
var maxKey = 7;
var max = data.Where(x => x.Key >= minKey && x.Key <= maxKey).Max(y => y.Value);
var min = data.Where(x => x.Key >= minKey && x.Key <= maxKey).Min(y => y.Value);
Edit: Extension Method
If you're going to be using this a lot, you could turn it into an extension method you so can call it easily on any dictionary of type Dictionary<int, double>.
public static class Extensions
{
public static double GetMaxInRange(this Dictionary<int, double> data, int minKey, int maxKey)
{
return data.Where(x => x.Key >= minKey && x.Key <= maxKey).Max(y => y.Value);
}
public static double GetMinInRange(this Dictionary<int, double> data, int minKey, int maxKey)
{
return data.Where(x => x.Key >= minKey && x.Key <= maxKey).Min(y => y.Value);
}
}
Call it like this:
var max = data.GetMaxInRange(3, 7);
var min = data.GetMinInRange(3, 7);
Edit2:
If you want the KeyValuePair<int, double>, then this would be an option.
public static class Extensions
{
public static KeyValuePair<int, double> GetMaxInRange(this Dictionary<int, double> data, int minKey, int maxKey)
{
return data.Where(x => x.Key >= minKey && x.Key <= maxKey).OrderByDescending(y => y.Value).FirstOrDefault();
}
public static KeyValuePair<int, double> GetMinInRange(this Dictionary<int, double> data, int minKey, int maxKey)
{
return data.Where(x => x.Key >= minKey && x.Key <= maxKey).OrderBy(y => y.Value).FirstOrDefault();
}
}
Following is a LinqPad5 example, but don't you want something like this?
var inst = new Dictionary<int, double>();
inst.Add(1, 82.65);
inst.Add(2, 8.65);
inst.Add(3, 8.89);
inst.Add(4, 84.90);
inst.Add(5, 7.95);
var min = inst.Where(x => x.Value > 8).Min(x => x.Value);
Console.WriteLine(min);
var max = inst.Where(x => x.Value < 80).Max(x => x.Value);
Console.WriteLine(max);
Or if you're looking for the key you could do something like this:
var min = inst.Where(x => x.Value > 8).OrderBy(x => x.Value).First();
Console.WriteLine(min.Key);
var max = inst.Where(x => x.Value < 80).OrderByDescending(x => x.Value).First();
Console.WriteLine(max.Key);
However... there is a catch with the lather. How can you define without certain doubt the first key is the one you need? (but that's not my issue.. just a side question)
If Only to get max and min, you can use:
Dim myResult = Aggregate order In myDict Into Max(order.Value), Min(order.Value)
'myResult.max for max and myResult.min as min
If you want to get detail each dic, for min and max, may be you can try this:
Dim myMinResult = From dic In myDic Where dic.Value = (Aggregate dicAgg In myDic Into Min(dicAgg.Value))
Dim myMaxResult = From dic In myDic Where dic.Value = (Aggregate dicAgg In myDic Into Max(dicAgg.Value))
MessageBox.Show("Min = key : " & myMinResult(0).Key.ToString & ", Value : " & myMinResult(0).Value.ToString)
MessageBox.Show("Max = key : " & myMaxResult(0).Key.ToString & ", Value : " & myMaxResult(0).Value.ToString)
I think this extension method for Dictionary can help you
static class DctExt {
public static void GetKeysByValueInRange(this Dictionary<int,float> baseDct, int start, int end, out List<int> byMinValue, out List<int> byMaxValue) {
byMinValue = new List<int>();
byMaxValue = new List<int>();
float max = GetMaxValue(baseDct, start, end);
float min = GetMinValue(baseDct, start, end);
foreach (KeyValuePair<int, float> kvp in baseDct) {
if(kvp.Value == min) {
byMinValue.Add(kvp.Key);
}
else if(kvp.Value == max) {
byMaxValue.Add(kvp.Key);
}
}
}
private static float GetMaxValue(Dictionary<int,float> baseDct, int start, int end) {
List<float> valuesOnRange = GetSpecificRange(baseDct, start, end);
return valuesOnRange.Max();
}
private static float GetMinValue(Dictionary<int,float> baseDct, int start, int end) {
List<float> valuesOnRange = GetSpecificRange(baseDct, start, end);
return valuesOnRange.Min();
}
private static List<float> GetSpecificRange(Dictionary<int,float> dct, int start, int end) {
List<float> res = new List<float>();
for (int i = start; i < end; i++) {
res.Add(dct.ElementAt(i).Value);
}
return res;
}
}
Here is the usage below
private static void Main() {
Dictionary<int, float> dct = new Dictionary<int, float> {
{1, 8.65f},
{2, 7.65f},
{3, 7.65f},
{4, 8.90f},
{5, 7.95f}
};
List<int> keysByMax = new List<int>();
List<int> keysByMin = new List<int>();
dct.GetKeysByValueInRange(1, 4, out keysByMin, out keysByMax);
foreach (var item in keysByMin) {
Console.Write($"min {item} ");
// printst min 2 min 3
}
Console.WriteLine();
foreach (var item in keysByMax) {
Console.Write($"max {item} ");
//prints max 4
}
Console.ReadLine();
}
Here is a class that encapsulates a List<T> and a ReaderWriterLock, it is thread safe to use, and will perform much better than a ConcurrentDictionary for ranged queries. It will perform even better if single-element operations are avoided, so that the ReaderWriterLock is not acquired multiple times during a search or bulk-update. For example instead of:
for (int i = 25; i < 477; i++)
{
if (list[i] > maxValue)
{
maxValue = list[i];
maxIndex = i;
}
}
...it is preferable to do it like this:
foreach (var entry in list.GetRange(25, 477))
{
if (entry.Value > maxValue)
{
maxValue = entry.Value;
maxIndex = entry.Index;
}
}
...because the method GetRange acquires and releases the lock only once. Not only this is faster, but the results will also be more consistent, because it is guaranteed that no updates will happen during the enumeration of the range.
public class ConcurrentList<T> : IEnumerable<T>
{
private readonly List<T> _list;
private readonly ReaderWriterLock _lock = new ReaderWriterLock();
public ConcurrentList()
{
_list = new List<T>();
}
public ConcurrentList(IEnumerable<T> collection)
{
_list = new List<T>(collection);
}
public int Count => ReadSafe(list => list.Count);
public T this[int index]
{
get => ReadSafe(list => list[index]);
set => WriteSafe(list => list[index] = value);
}
public IEnumerable<(int Index, T Value)> GetRange(int from, int to)
{
using (new DisposableReader(_lock))
{
for (int i = from; i < to; i++)
{
yield return (i, _list[i]);
}
}
}
public void Add(T item) => WriteSafe(list => list.Add(item));
public void AddRange(IEnumerable<T> r) => WriteSafe(list => list.AddRange(r));
public void Clear() => WriteSafe(list => list.Clear());
public void UpdateRange(IEnumerable<(int Index, T Value)> changes)
{
WriteSafe(list =>
{
foreach (var change in changes)
{
list[change.Index] = change.Value;
}
});
}
public IEnumerator<T> GetEnumerator()
{
using (new DisposableReader(_lock))
{
foreach (var item in _list)
{
yield return item;
}
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
public TResult ReadSafe<TResult>(Func<List<T>, TResult> function)
{
_lock.AcquireReaderLock(Timeout.Infinite);
try
{
return function(_list);
}
finally
{
_lock.ReleaseReaderLock();
}
}
public void WriteSafe(Action<List<T>> action)
{
_lock.AcquireWriterLock(Timeout.Infinite);
try
{
action(_list);
}
finally
{
_lock.ReleaseWriterLock();
}
}
private struct DisposableReader : IDisposable
{
private readonly ReaderWriterLock _lock;
public DisposableReader(ReaderWriterLock obj)
{
_lock = obj;
_lock.AcquireReaderLock(Timeout.Infinite);
}
public void Dispose() => _lock.ReleaseReaderLock();
}
}
I have used helper methods for acquiring and releasing the lock, to avoid repeating the try - finally block in every property and method. Of course this is not necessary, it is just a matter of style.

How to compare two csv files by 2 columns?

I have 2 csv files
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I read both files and add data to lists
var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();
Then I find all unique strings from both lists and add it to result lists
var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();
rezlist contains this data
[0] = "italy;russia;france"
[1] = "india;iran;pakistan"
At now I want to compare, make except and union by second and third column in all rows.
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.
rezlist must contains
india;iran;pakistan
I added class
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
...
}
public int GetHashCode(string obj)
{
...
}
}
StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();
Your question is not very clear: for instance, is india;iran;pakistan the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.
Then there is this: find all unique string from both lists which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].
Also note that a data sample of 2 leaves room for a fair amount of error.
Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:
// parse "1.CSV"
List<string[]> lst1 = File.ReadAllLines(#"C:\Temp\1.csv").
Select(line => line.Split(';')).
ToList();
// parse "2.CSV"
List<string[]> lst2 = File.ReadAllLines(#"C:\Temp\2.csv").
Select(line => line.Split(';')).
ToList();
// extracting once speeds things up in the next step
// and leaves open the possibility of iterating in a method
List<List<string>> tgts = new List<List<string>>();
tgts.Add(lst1.Select(z => z[1]).Distinct().ToList());
tgts.Add(lst1.Select(z => z[2]).Distinct().ToList());
var tmpLst = lst2.Where(x => !tgts[0].Contains(x[1]) ||
!tgts[1].Contains(x[2])).
ToList();
That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.
Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:
var unique = tmpLst.
GroupBy(g => g[1], (key, values) =>
new GroupItem(key,
values.ToArray()[0],
values.Count())
).Where(q => q.Count == 1).
GroupBy(g => g.Data[2], (key, values) => new
{
Item = string.Join(";", values.ToArray()[0]),
Count = values.Count()
}
).Where(q => q.Count == 1).Select(s => s.Item).
ToList();
The GroupItem class is trivial:
class GroupItem
{
public string Item { set; get; } // debug aide
public string[] Data { set; get; }
public int Count { set; get; }
public GroupItem(string n, string[] d, int c)
{
Item = n;
Data = d;
Count = c;
}
public override string ToString()
{
return string.Join(";", Data);
}
}
It starts with tmpList, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.
The second GroupBy acts on those results, this time looking at col[2]. Finally, it selects the joined string data.
Results
Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:
NepalX and ItalyX were the repeats in both files and they correctly eliminated each other.
With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.
your GetHashCode()-Method in EqualityComparer are buggy. Fixed version:
public int GetHashCode(string obj)
{
return obj.Split(';')[1].GetHashCode();
}
now the result are correct:
// one result: "india;iran;pakistan"
btw. "StringLengthEqualityComparer"is not a good name ;-)
private void GetUnion(List<string> lst1, List<string> lst2)
{
List<string> lstUnion = new List<string>();
foreach (string value in lst1)
{
string valueColumn1 = value.Split(';')[0];
string valueColumn2 = value.Split(';')[1];
string valueColumn3 = value.Split(';')[2];
string result = lst2.FirstOrDefault(s => s.Contains(";" + valueColumn2 + ";" + valueColumn3));
if (result != null)
{
if (!lstUnion.Contains(result))
{
lstUnion.Add(result);
}
}
}
}
class Program
{
static void Main(string[] args)
{
var lst1 = File.ReadLines(#"D:\test\1.csv").Select(x => new StringWrapper(x)).ToList();
var lst2 = File.ReadLines(#"D:\test\2.csv").Select(x => new StringWrapper(x));
var set = new HashSet<StringWrapper>(lst1);
set.SymmetricExceptWith(lst2);
foreach (var x in set)
{
Console.WriteLine(x.Value);
}
}
}
struct StringWrapper : IEquatable<StringWrapper>
{
public string Value { get; }
private readonly string _comparand0;
private readonly string _comparand14;
public StringWrapper(string value)
{
Value = value;
var split = value.Split(';');
_comparand0 = split[0];
_comparand14 = split[14];
}
public bool Equals(StringWrapper other)
{
return string.Equals(_comparand0, other._comparand0, StringComparison.OrdinalIgnoreCase)
&& string.Equals(_comparand14, other._comparand14, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
return obj is StringWrapper && Equals((StringWrapper) obj);
}
public override int GetHashCode()
{
unchecked
{
return ((_comparand0 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand0) : 0)*397)
^ (_comparand14 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand14) : 0);
}
}
}

How do I compare items from a list to all others without repetition?

I have a collection of objects (lets call them MyItem) and each MyItem has a method called IsCompatibleWith which returns a boolean saying whether it's compatible with another MyItem.
public class MyItem
{
...
public bool IsCompatibleWith(MyItem other) { ... }
...
}
A.IsCompatibleWith(B) will always be the same as B.IsCompatibleWith(A). If for example I have a collection containing 4 of these, I am trying to find a LINQ query that will run the method on each distinct pair of items in the same collection. So if my collection contains A, B, C and D I wish to do the equivalent of:
A.IsCompatibleWith(B); // A & B
A.IsCompatibleWith(C); // A & C
A.IsCompatibleWith(D); // A & D
B.IsCompatibleWith(C); // B & C
B.IsCompatibleWith(D); // B & D
C.IsCompatibleWith(D); // C & D
The code initially used was:
var result = from item in myItems
from other in myItems
where item != other &&
item.IsCompatibleWith(other)
select item;
but of course this will still do both A & B and B & A (which is not required and not efficient). Also it's probably worth noting that in reality these lists will be a lot bigger than 4 items, hence the desire for an optimal solution.
Hopefully this makes sense... any ideas?
Edit:
One possible query -
MyItem[] items = myItems.ToArray();
bool compatible = (from item in items
from other in items
where
Array.IndexOf(items, item) < Array.IndexOf(items, other) &&
!item.IsCompatibleWith(other)
select item).FirstOrDefault() == null;
Edit2: In the end switched to using the custom solution from LukeH as it was more efficient for bigger lists.
public bool AreAllCompatible()
{
using (var e = myItems.GetEnumerator())
{
var buffer = new List<MyItem>();
while (e.MoveNext())
{
if (buffer.Any(item => !item.IsCompatibleWith(e.Current)))
return false;
buffer.Add(e.Current);
}
}
return true;
}
Edit...
Judging by the "final query" added to your question, you need a method to determine if all the items in the collection are compatible with each other. Here's how to do it reasonably efficiently:
bool compatible = myItems.AreAllItemsCompatible();
// ...
public static bool AreAllItemsCompatible(this IEnumerable<MyItem> source)
{
using (var e = source.GetEnumerator())
{
var buffer = new List<MyItem>();
while (e.MoveNext())
{
foreach (MyItem item in buffer)
{
if (!item.IsCompatibleWith(e.Current))
return false;
}
buffer.Add(e.Current);
}
}
return true;
}
Original Answer...
I don't think there's an efficient way to do this using only the built-in LINQ methods.
It's easy enough to build your own though. Here's an example of the sort of code you'll need. I'm not sure exactly what results you're trying to return so I'm just writing a message to the console for each compatible pair. It should be easy enough to change it to yield the results that you need.
using (var e = myItems.GetEnumerator())
{
var buffer = new List<MyItem>();
while (e.MoveNext())
{
foreach (MyItem item in buffer)
{
if (item.IsCompatibleWith(e.Current))
{
Console.WriteLine(item + " is compatible with " + e.Current);
}
}
buffer.Add(e.Current);
}
}
(Note that although this is reasonably efficient, it does not preserve the original ordering of the collection. Is that an issue in your situation?)
this should do it:
var result = from item in myItems
from other in myItems
where item != other &&
myItems.indexOf(item) < myItems.indexOf(other) &&
item.IsCompatibleWith(other)
select item;
But i dont know if it makes it faster, because in the query has to check the indices of the rows each row.
Edit:
if you have an index in myItem you should use that one instead of indexOf. And you can remove the "item != other" from the where clause, little bit redundant now
Here's an idea:
Implement IComparable so that your MyItem becomes sortable, then run this linq-query:
var result = from item in myItems
from other in myItems
where item.CompareTo(other) < 0 &&
item.IsCompatibleWith(other)
select item;
If your MyItem collection is small enough, you can storage the results of item.IsCompatibleWith(otherItem) in a boolean array:
var itemCount = myItems.Count();
var compatibilityTable = new bool[itemCount, itemCount];
var itemsToCompare = new List<MyItem>();
var i = 0;
var j = 0;
foreach (var item in myItems)
{
j = 0;
foreach (var other in itemsToCompare)
{
compatibilityTable[i,j] = item.IsCompatibleWith(other);
compatibilityTable[j,i] = compatibilityTable[i,j];
j++;
}
itemsToCompare.Add(item);
i++;
}
var result = myItems.Where((item, i) =>
{
var compatible = true;
var j = 0;
while (compatible && j < itemCount)
{
compatible = compatibilityTable[i,j];
}
j++;
return compatible;
}
So, we have
IEnumerable<MyItem> MyItems;
To get all the combinations we could use a function like this.
//returns all the k sized combinations from a list
public static IEnumerable<IEnumerable<T>> Combinations<T>(IEnumerable<T> list,
int k)
{
if (k == 0) return new[] {new T[0]};
return list.SelectMany((l, i) =>
Combinations(list.Skip(i + 1), k - 1).Select(c => (new[] {l}).Concat(c))
);
}
We can then apply this function to our problem like this.
var combinations = Combinations(MyItems, 2).Select(c => c.ToList<MyItem>());
var result = combinations.Where(c => c[0].IsCompatibleWith(c[1]))
This will perform IsCompatableWith on all the combinations without repetition.
You could of course perform the the checking inside the Combinations functions. For further work you could make the Combinations function into an extention that takes a delegate with a variable number of parameters for several lengths of k.
EDIT: As I suggested above, if you wrote these extension method
public static class Extenesions
{
IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> list, int k)
{
if (k == 0) return new[] { new T[0] };
return list.SelectMany((l, i) =>
list.Skip(i + 1).Combinations<T>(k - 1)
.Select(c => (new[] { l }).Concat(c)));
}
IEnumerable<Tuple<T, T>> Combinations<T> (this IEnumerable<T> list,
Func<T, T, bool> filter)
{
return list.Combinations(2).Where(c =>
filter(c.First(), c.Last())).Select(c =>
Tuple.Create<T, T>(c.First(), c.Last()));
}
}
Then in your code you could do the rather more elegant (IMO)
var compatibleTuples = myItems.Combinations(a, b) => a.IsCompatibleWith(b)))
then get at the compatible items with
foreach(var t in compatibleTuples)
{
t.Item1 // or T.item2
}

Decorate-Sort-Undecorate, how to sort an alphabetic field in descending order

I've got a large set of data for which computing the sort key is fairly expensive. What I'd like to do is use the DSU pattern where I take the rows and compute a sort key. An example:
Qty Name Supplier
Row 1: 50 Widgets IBM
Row 2: 48 Thingies Dell
Row 3: 99 Googaws IBM
To sort by Quantity and Supplier I could have the sort keys: 0050 IBM, 0048 Dell, 0099 IBM. The numbers are right-aligned and the text is left-aligned, everything is padded as needed.
If I need to sort by the Quanty in descending order I can just subtract the value from a constant (say, 10000) to build the sort keys: 9950 IBM, 9952 Dell, 9901 IBM.
How do I quickly/cheaply build a descending key for the alphabetic fields in C#?
[My data is all 8-bit ASCII w/ISO 8859 extension characters.]
Note: In Perl, this could be done by bit-complementing the strings:
$subkey = $string ^ ( "\xFF" x length $string );
Porting this solution straight into C# doesn't work:
subkey = encoding.GetString(encoding.GetBytes(stringval).
Select(x => (byte)(x ^ 0xff)).ToArray());
I suspect because of the differences in the way that strings are handled in C#/Perl. Maybe Perl is sorting in ASCII order and C# is trying to be smart?
Here's a sample piece of code that tries to accomplish this:
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
List<List<string>> sample = new List<List<string>>() {
new List<string>() { "", "apple", "table" },
new List<string>() { "", "apple", "chair" },
new List<string>() { "", "apple", "davenport" },
new List<string>() { "", "orange", "sofa" },
new List<string>() { "", "peach", "bed" },
};
foreach(List<string> line in sample)
{
StringBuilder sb = new StringBuilder();
string key1 = line[1].PadRight(10, ' ');
string key2 = line[2].PadRight(10, ' ');
// Comment the next line to sort desc, desc
key2 = encoding.GetString(encoding.GetBytes(key2).
Select(x => (byte)(x ^ 0xff)).ToArray());
sb.Append(key2);
sb.Append(key1);
line[0] = sb.ToString();
}
List<List<string>> output = sample.OrderBy(p => p[0]).ToList();
return;
You can get to where you want, although I'll admit I don't know whether there's a better overall way.
The problem you have with the straight translation of the Perl method is that .NET simply will not allow you to be so laissez-faire with encoding. However, if as you say your data is all printable ASCII (ie consists of characters with Unicode codepoints in the range 32..127) - note that there is no such thing as '8-bit ASCII' - then you can do this:
key2 = encoding.GetString(encoding.GetBytes(key2).
Select(x => (byte)(32+95-(x-32))).ToArray());
In this expression I have been explicit about what I'm doing:
Take x (which I assume to be in 32..127)
Map the range to 0..95 to make it zero-based
Reverse by subtracting from 95
Add 32 to map back to the printable range
It's not very nice but it does work.
Just write an IComparer that would work as a chain of comparators.
In case of equality on each stage, it should pass eveluation to the next key part. If it's less then, or greater then, just return.
You need something like this:
int comparision = 0;
foreach(i = 0; i < n; i++)
{
comparision = a[i].CompareTo(b[i]) * comparisionSign[i];
if( comparision != 0 )
return comparision;
}
return comparision;
Or even simpler, you can go with:
list.OrderBy(i=>i.ID).ThenBy(i=>i.Name).ThenByDescending(i=>i.Supplier);
The first call return IOrderedEnumerable<>, the which can sort by additional fields.
Answering my own question (but not satisfactorily). To construct a descending alphabetic key I used this code and then appended this subkey to the search key for the object:
if ( reverse )
subkey = encoding.GetString(encoding.GetBytes(subkey)
.Select(x => (byte)(0x80 - x)).ToArray());
rowobj.sortKey.Append(subkey);
Once I had the keys built, I couldn't just do this:
rowobjList.Sort();
Because the default comparator isn't in ASCII order (which my 0x80 - x trick relies on). So then I had to write an IComparable<RowObject> that used the Ordinal sorting:
public int CompareTo(RowObject other)
{
return String.Compare(this.sortKey, other.sortKey,
StringComparison.Ordinal);
}
This seems to work. I'm a little dissatisfied because it feels clunky in C# with the encoding/decoding of the string.
If a key computation is expensive, why compute a key at all? String comparision by itself is not free, it's actually expensive loop through the characters and is not going to perform any better then a custom comparision loop.
In this test custom comparision sort performs about 3 times better then DSU.
Note that DSU key computation is not measured in this test, it's precomputed.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace DSUPatternTest
{
[TestClass]
public class DSUPatternPerformanceTest
{
public class Row
{
public int Qty;
public string Name;
public string Supplier;
public string PrecomputedKey;
public void ComputeKey()
{
// Do not need StringBuilder here, String.Concat does better job internally.
PrecomputedKey =
Qty.ToString().PadLeft(4, '0') + " "
+ Name.PadRight(12, ' ') + " "
+ Supplier.PadRight(12, ' ');
}
public bool Equals(Row other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
return other.Qty == Qty && Equals(other.Name, Name) && Equals(other.Supplier, Supplier);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != typeof (Row)) return false;
return Equals((Row) obj);
}
public override int GetHashCode()
{
unchecked
{
int result = Qty;
result = (result*397) ^ (Name != null ? Name.GetHashCode() : 0);
result = (result*397) ^ (Supplier != null ? Supplier.GetHashCode() : 0);
return result;
}
}
}
public class RowComparer : IComparer<Row>
{
public int Compare(Row x, Row y)
{
int comparision;
comparision = x.Qty.CompareTo(y.Qty);
if (comparision != 0) return comparision;
comparision = x.Name.CompareTo(y.Name);
if (comparision != 0) return comparision;
comparision = x.Supplier.CompareTo(y.Supplier);
return comparision;
}
}
[TestMethod]
public void CustomLoopIsFaster()
{
var random = new Random();
var rows = Enumerable.Range(0, 5000).Select(i =>
new Row
{
Qty = (int) (random.NextDouble()*9999),
Name = random.Next().ToString(),
Supplier = random.Next().ToString()
}).ToList();
foreach (var row in rows)
{
row.ComputeKey();
}
var dsuSw = Stopwatch.StartNew();
var sortedByDSU = rows.OrderBy(i => i.PrecomputedKey).ToList();
var dsuTime = dsuSw.ElapsedMilliseconds;
var customSw = Stopwatch.StartNew();
var sortedByCustom = rows.OrderBy(i => i, new RowComparer()).ToList();
var customTime = customSw.ElapsedMilliseconds;
Trace.WriteLine(dsuTime);
Trace.WriteLine(customTime);
CollectionAssert.AreEqual(sortedByDSU, sortedByCustom);
Assert.IsTrue(dsuTime > customTime * 2.5);
}
}
}
If you need to build a sorter dynamically you can use something like this:
var comparerChain = new ComparerChain<Row>()
.By(r => r.Qty, false)
.By(r => r.Name, false)
.By(r => r.Supplier, false);
var sortedByCustom = rows.OrderBy(i => i, comparerChain).ToList();
Here is a sample implementation of comparer chain builder:
public class ComparerChain<T> : IComparer<T>
{
private List<PropComparer<T>> Comparers = new List<PropComparer<T>>();
public int Compare(T x, T y)
{
foreach (var comparer in Comparers)
{
var result = comparer._f(x, y);
if (result != 0)
return result;
}
return 0;
}
public ComparerChain<T> By<Tp>(Func<T,Tp> property, bool descending) where Tp:IComparable<Tp>
{
Comparers.Add(PropComparer<T>.By(property, descending));
return this;
}
}
public class PropComparer<T>
{
public Func<T, T, int> _f;
public static PropComparer<T> By<Tp>(Func<T,Tp> property, bool descending) where Tp:IComparable<Tp>
{
Func<T, T, int> ascendingCompare = (a, b) => property(a).CompareTo(property(b));
Func<T, T, int> descendingCompare = (a, b) => property(b).CompareTo(property(a));
return new PropComparer<T>(descending ? descendingCompare : ascendingCompare);
}
public PropComparer(Func<T, T, int> f)
{
_f = f;
}
}
It works a little bit slower, maybe because of property binging delegate calls.

Categories