How to compare two csv files by 2 columns? - c#

I have 2 csv files
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I read both files and add data to lists
var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();
Then I find all unique strings from both lists and add it to result lists
var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();
rezlist contains this data
[0] = "italy;russia;france"
[1] = "india;iran;pakistan"
At now I want to compare, make except and union by second and third column in all rows.
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.
rezlist must contains
india;iran;pakistan
I added class
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
...
}
public int GetHashCode(string obj)
{
...
}
}
StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();

Your question is not very clear: for instance, is india;iran;pakistan the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.
Then there is this: find all unique string from both lists which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].
Also note that a data sample of 2 leaves room for a fair amount of error.
Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:
// parse "1.CSV"
List<string[]> lst1 = File.ReadAllLines(#"C:\Temp\1.csv").
Select(line => line.Split(';')).
ToList();
// parse "2.CSV"
List<string[]> lst2 = File.ReadAllLines(#"C:\Temp\2.csv").
Select(line => line.Split(';')).
ToList();
// extracting once speeds things up in the next step
// and leaves open the possibility of iterating in a method
List<List<string>> tgts = new List<List<string>>();
tgts.Add(lst1.Select(z => z[1]).Distinct().ToList());
tgts.Add(lst1.Select(z => z[2]).Distinct().ToList());
var tmpLst = lst2.Where(x => !tgts[0].Contains(x[1]) ||
!tgts[1].Contains(x[2])).
ToList();
That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.
Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:
var unique = tmpLst.
GroupBy(g => g[1], (key, values) =>
new GroupItem(key,
values.ToArray()[0],
values.Count())
).Where(q => q.Count == 1).
GroupBy(g => g.Data[2], (key, values) => new
{
Item = string.Join(";", values.ToArray()[0]),
Count = values.Count()
}
).Where(q => q.Count == 1).Select(s => s.Item).
ToList();
The GroupItem class is trivial:
class GroupItem
{
public string Item { set; get; } // debug aide
public string[] Data { set; get; }
public int Count { set; get; }
public GroupItem(string n, string[] d, int c)
{
Item = n;
Data = d;
Count = c;
}
public override string ToString()
{
return string.Join(";", Data);
}
}
It starts with tmpList, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.
The second GroupBy acts on those results, this time looking at col[2]. Finally, it selects the joined string data.
Results
Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:
NepalX and ItalyX were the repeats in both files and they correctly eliminated each other.
With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.

your GetHashCode()-Method in EqualityComparer are buggy. Fixed version:
public int GetHashCode(string obj)
{
return obj.Split(';')[1].GetHashCode();
}
now the result are correct:
// one result: "india;iran;pakistan"
btw. "StringLengthEqualityComparer"is not a good name ;-)

private void GetUnion(List<string> lst1, List<string> lst2)
{
List<string> lstUnion = new List<string>();
foreach (string value in lst1)
{
string valueColumn1 = value.Split(';')[0];
string valueColumn2 = value.Split(';')[1];
string valueColumn3 = value.Split(';')[2];
string result = lst2.FirstOrDefault(s => s.Contains(";" + valueColumn2 + ";" + valueColumn3));
if (result != null)
{
if (!lstUnion.Contains(result))
{
lstUnion.Add(result);
}
}
}
}

class Program
{
static void Main(string[] args)
{
var lst1 = File.ReadLines(#"D:\test\1.csv").Select(x => new StringWrapper(x)).ToList();
var lst2 = File.ReadLines(#"D:\test\2.csv").Select(x => new StringWrapper(x));
var set = new HashSet<StringWrapper>(lst1);
set.SymmetricExceptWith(lst2);
foreach (var x in set)
{
Console.WriteLine(x.Value);
}
}
}
struct StringWrapper : IEquatable<StringWrapper>
{
public string Value { get; }
private readonly string _comparand0;
private readonly string _comparand14;
public StringWrapper(string value)
{
Value = value;
var split = value.Split(';');
_comparand0 = split[0];
_comparand14 = split[14];
}
public bool Equals(StringWrapper other)
{
return string.Equals(_comparand0, other._comparand0, StringComparison.OrdinalIgnoreCase)
&& string.Equals(_comparand14, other._comparand14, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
return obj is StringWrapper && Equals((StringWrapper) obj);
}
public override int GetHashCode()
{
unchecked
{
return ((_comparand0 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand0) : 0)*397)
^ (_comparand14 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand14) : 0);
}
}
}

Related

How to dynamically GroupBy using Linq

There are several similar sounding posts, but none that do exactly what I want.
Okay, so imagine that I have the following data structure (simplified for this LinqPad example)
public class Row
{
public List<string> Columns { get; set; }
}
public List<Row> Data
=> new List<Row>
{
new Row { Columns = new List<string>{ "A","C","Field3"}},
new Row { Columns = new List<string>{ "A","D","Field3"}},
new Row { Columns = new List<string>{ "A","C","Field3"}},
new Row { Columns = new List<string>{ "B","D","Field3"}},
new Row { Columns = new List<string>{ "B","C","Field3"}},
new Row { Columns = new List<string>{ "B","D","Field3"}},
};
For the property "Data", the user will tell me which column ordinals to GroupBy; they may say "don't group by anything", or they may say "group by Column[1]" or "group by Column[0] and Column[1]".
If I want to group by a single column, I can use:
var groups = Data.GroupBy(d => d.Columns[i]);
And if I want to group by 2 columns, I can use:
var groups = Data.GroupBy(d => new { A = d.Columns[i1], B = d.Columns[i2] });
However, the number of columns is variable (zero -> many); Data could contain hundreds of columns and the user may want to GroupBy dozens of columns.
So the question is, how can I create this GroupBy at runtime (dynamically)?
Thanks
Griff
With that Row data structure what are you asking for is relatively easy.
Start by implementing a custom IEqualityComparer<IEnumerable<string>>:
public class ColumnEqualityComparer : EqualityComparer<IEnumerable<string>>
{
public static readonly ColumnEqualityComparer Instance = new ColumnEqualityComparer();
private ColumnEqualityComparer() { }
public override int GetHashCode(IEnumerable<string> obj)
{
if (obj == null) return 0;
// You can implement better hash function
int hashCode = 0;
foreach (var item in obj)
hashCode ^= item != null ? item.GetHashCode() : 0;
return hashCode;
}
public override bool Equals(IEnumerable<string> x, IEnumerable<string> y)
{
if (x == y) return true;
if (x == null || y == null) return false;
return x.SequenceEqual(y);
}
}
Now you can have a method like this:
public IEnumerable<IGrouping<IEnumerable<string>, Row>> GroupData(IEnumerable<int> columnIndexes = null)
{
if (columnIndexes == null) columnIndexes = Enumerable.Empty<int>();
return Data.GroupBy(r => columnIndexes.Select(c => r.Columns[c]), ColumnEqualityComparer.Instance);
}
Note the grouping Key type is IEnumerable<string> and contains the selected row values specified by the columnIndexes parameter, that's why we needed a custom equality comparer (otherwise they will be compared by reference, which doesn't produce the required behavior).
For instance, to group by columns 0 and 2 you could use something like this:
var result = GroupData(new [] { 0, 2 });
Passing null or empty columnIndexes will effectively produce single group, i.e. no grouping.
you can use a Recursive function for create dynamic lambdaExpression. but you must define columns HardCode in the function.

How to count occurences of number stored in file containing multiple delimeters?

This is my input store in file:
50|Carbon|Mercury|P:4;P:00;P:1
90|Oxygen|Mars|P:10;P:4;P:00
90|Serium|Jupiter|P:4;P:16;P:10
85|Hydrogen|Saturn|P:00;P:10;P:4
Now i will take my first row P:4 and then next P:00 and then next like wise and want to count occurence in every other row so expected output will be:
P:4 3(found in 2nd row,3rd row,4th row(last cell))
P:00 2 (found on 2nd row,4th row)
P:1 0 (no occurences are there so)
P:10 1
P:16 0
etc.....
Like wise i would like to print occurence of each and every proportion.
So far i am successfull in splitting row by row and storing in my class file object like this:
public class Planets
{
//My rest fields
public string ProportionConcat { get; set; }
public List<proportion> proportion { get; set; }
}
public class proportion
{
public int Number { get; set; }
}
I have already filled my planet object like below and Finally my List of planet object data is like this:
List<Planets> Planets = new List<Planets>();
Planets[0]:
{
Number:50
name: Carbon
object:Mercury
ProportionConcat:P:4;P:00;P:1
proportion[0]:
{
Number:4
},
proportion[1]:
{
Number:00
},
proportion[2]:
{
Number:1
}
}
Etc...
I know i can loop through and perform search and count but then 2 to 3 loops will be required and code will be little messy so i want some better code to perform this.
Now how do i search each and count every other proportion in my planet List object??
Well, if you have parsed proportions, you can create new struct for output data:
// Class to storage result
public class Values
{
public int Count; // count of proportion entry.
public readonly HashSet<int> Rows = new HashSet<int>(); //list with rows numbers.
/// <summary> Add new proportion</summary>
/// <param name="rowNumber">Number of row, where proportion entries</param>
public void Increment(int rowNumber)
{
++Count; // increase count of proportions entries
Rows.Add(rowNumber); // add number of row, where proportion entry
}
}
And use this code to fill it. I'm not sure it's "messy" and don't see necessity to complicate the code with LINQ. What do you think about it?
var result = new Dictionary<int, Values>(); // create dictionary, where we will storage our results. keys is proportion. values - information about how often this proportion entries and rows, where this proportion entry
for (var i = 0; i < Planets.Count; i++) // we use for instead of foreach for finding row number. i == row number
{
var planet = Planets[i];
foreach (var proportion in planet.proportion)
{
if (!result.ContainsKey(proportion.Number)) // if our result dictionary doesn't contain proportion
result.Add(proportion.Number, new Values()); // we add it to dictionary and initialize our result class for this proportion
result[proportion.Number].Increment(i); // increment count of entries and add row number
}
}
You can use var count = Regex.Matches(lineString, input).Count;. Try this example
var list = new List<string>
{
"50|Carbon|Mercury|P:4;P:00;P:1",
"90|Oxygen|Mars|P:10;P:4;P:00",
"90|Serium|Jupiter|P:4;P:16;P:10",
"85|Hydrogen|Saturn|P:00;P:10;P:4"
};
int totalCount;
var result = CountWords(list, "P:4", out totalCount);
Console.WriteLine("Total Found: {0}", totalCount);
foreach (var foundWords in result)
{
Console.WriteLine(foundWords);
}
public class FoundWords
{
public string LineNumber { get; set; }
public int Found { get; set; }
}
private List<FoundWords> CountWords(List<string> words, string input, out int total)
{
total = 0;
int[] index = {0};
var result = new List<FoundWords>();
foreach (var f in words.Select(word => new FoundWords {Found = Regex.Matches(word, input).Count, LineNumber = "Line Number: " + index[0] + 1}))
{
result.Add(f);
total += f.Found;
index[0]++;
}
return result;
}
I made a DotNetFiddle for you here: https://dotnetfiddle.net/z9QwmD
string raw =
#"50|Carbon|Mercury|P:4;P:00;P:1
90|Oxygen|Mars|P:10;P:4;P:00
90|Serium|Jupiter|P:4;P:16;P:10
85|Hydrogen|Saturn|P:00;P:10;P:4";
string[] splits = raw.Split(
new string[] { "|", ";", "\n" },
StringSplitOptions.None
);
foreach (string p in splits.Where(s => s.ToUpper().StartsWith(("P:"))).Distinct())
{
Console.WriteLine(
string.Format("{0} - {1}",
p,
splits.Count(s => s.ToUpper() == p.ToUpper())
)
);
}
Basically, you can use .Split to split on multiple delimiters at once, it's pretty straightforward. After that, everything is gravy :).
Obviously my code simply outputs the results to the console, but that part is fairly easy to change. Let me know if there's anything you didn't understand.

fill in delimited sequence

I have a file that contains a list of delimited sequence numbers as a record key. I need to fill in the missing sequence. So if I have
8
8.2
8.3.4.1
I need to add
8.1
8.3
8.3.1
8.3.2
8.3.3
8.3.4
I have come up with a few algorithms but they're all horribly complex and have too many cases. Is there an easy way to do this or do I have to plod through? I'm using c# but Java would do.
Not sure if my solution is easy to understand, but let's try. The idea is that we recursively insert missing sequences between existing ones.
First, you need to parse your file to create a List of items representing existing sequence. Every item should have reference to the next one (linked list idea).
public class Item
{
public int Value { get; set; }
public Item SubItem { get; set; }
public Item NextItem { get; set; }
public Item(int value, Item subItem)
{
Value = value;
SubItem = subItem;
}
public Item CreatePreviousItem()
{
if (SubItem == null)
{
return Value == 1 ? null : new Item(Value - 1, null);
}
return new Item(Value, SubItem.CreatePreviousItem());
}
public bool IsItemMissingPrior(Item item)
{
if (item == null)
{
return false;
}
return
item.Value - Value > 1
|| (SubItem == null && item.SubItem != null && item.SubItem.Value > 1) //edge case
|| (SubItem != null && SubItem.IsItemMissingPrior(item.SubItem));
}
public override string ToString()
{
return Value + (SubItem != null ? "." + SubItem : "");
}
}
Assuming that sequences are delimited by new line symbol, you can use the following Parse method.
private List<Item> Parse(string s)
{
var result = new List<Item>();
var numberLines = s.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach (var numberLine in numberLines)
{
var numbers = numberLine.Split(new[] {'.'}).Reverse();
Item itemInstance = null;
foreach (var number in numbers)
{
itemInstance = new Item(Convert.ToInt32(number), itemInstance);
}
if (result.Count > 0)
{
result.Last().NextItem = itemInstance;
}
result.Add(itemInstance);
}
return result;
}
Here is a recursive method which inserts missing sequences between two existing ones
private void UpdateSequence(Item item)
{
if (item.IsItemMissingPrior(item.NextItem))
{
var inBetweenItem = item.NextItem.CreatePreviousItem();
inBetweenItem.NextItem = item.NextItem;
item.NextItem = inBetweenItem;
UpdateSequence(item);
}
}
And finally the use case:
var inputItems = Parse(inputString);
foreach (var item in inputItems)
{
UpdateSequence(item);
}
That's it. To see the result, you just need to get the first item from the list and keep moving forward using NextItem property. For example
var displayItem = inputItems.FirstOrDefault();
while (displayItem != null)
{
Console.WriteLine(displayItem.ToString());
displayItem = displayItem.NextItem;
}
Hope it helps.
One of the easy ways (though not optimal for some cases) will be maintaining a set of existing keys. For each key in your initial sequence you can add all the preceding keys to that set. This can be done in two loops: in inner loop you add a key to a set and decrease last number of a key by one while the number is more than zero, in outer loop you decrease the length of a key by one.
And then you just need to output the set in sorted order.

Grouping by an unknown initial prefix

Say I have the following array of strings as an input:
foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe
There are 3 different prefixes used here, "foo-", "barbaz" and "baz" - however these prefixes are not known ahead of time (they could be something completely different).
How could you establish what the different common prefixes are so that they could then be grouped by? This is made a bit tricky since in the data I've provided there's two that start with "bazg" and one that starts "bazf" where of course "baz" is the prefix.
What I've tried so far is sorting them into alphabetical order, and then looping through them in order and counting how many characters in a row are identical to the previous. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is it falls over at the "bazg" and "bazf" problem I mentioned earlier and separates those into two different groups (one with just one element in it)
Edit: Alright, let's throw a few more rules in:
Longer potential groups should generally be preferred over shorter ones, unless there is a closely matching group of less than X characters difference in length. (So where X is 2, baz would be preferred over bazg)
A group must have at least Y elements in it or not be a group at all
It's okay to simply throw away elements that don't match any of the 'groups' to within the rules above.
To clarify the first rule in relation to the second, if X was 0 and Y was 2, then the two 'bazg' entries would be in a group, and the 'bazf' would be thrown away because its on its own.
Well, here's a quick hack, probably O(something_bad):
IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
// TODO: error checking
return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}
IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
if(source.Any())
{
var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
if (tuple.Item2.Count() >= minGroupSize)
yield return tuple;
foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
yield return element;
}
}
Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}
IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
while (source.Any() && source.Peek().StartsWith(prefix))
yield return source.Pop();
}
string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
var s = new Stack<string>(source);
var counter = new DictionaryWithDefault<string, int>(0);
while(s.Any())
{
var g = GetCommonPrefix(s);
if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
counter[g]++;
s.Pop();
}
return counter.OrderBy(c => c.Value).Last().Key;
}
string GetCommonPrefix(IEnumerable<string> coll)
{
return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
let possibleMatch = coll.First().Substring(0, len)
where coll.All(f => f.StartsWith(possibleMatch))
select possibleMatch).FirstOrDefault();
}
public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
TValue _default;
public TValue DefaultValue {
get { return _default; }
set { _default = value; }
}
public DictionaryWithDefault() : base() { }
public DictionaryWithDefault(TValue defaultValue) : base() {
_default = defaultValue;
}
public new TValue this[TKey key]
{
get { return base.ContainsKey(key) ? base[key] : _default; }
set { base[key] = value; }
}
}
Example usage:
string[] input = {
"foo-139875913",
"foo-aeuefhaiu",
"foo-95hw9ghes",
"barbazabejgoiagjaegioea",
"barbaz8gs98ghsgh9es8h",
"barbaza98fyae9fghaefag",
"bazfa90eufa0e9u",
"bazgeajga8ugae89u",
"bazguea9guae",
"9a8efa098fea0",
"aifeaufhiuafhe"
};
GuessGroups(input, 3, 2).Dump();
Ok, well as discussed, the problem wasn't initially well defined, but here is how I'd go about it.
Create a tree T
Parse the list, for each element:
for each letter in that element
if a branch labeled with that letter exists then
Increment the counter on that branch
Descend that branch
else
Create a branch labelled with that letter
Set its counter to 1
Descend that branch
This gives you a tree where each of the leaves represents a word in your input. Each of the non-leaf nodes has a counter representing how many leaves are (eventually) attached to that node. Now you need a formula to weight the length of the prefix (the depth of the node) against the size of the prefix group. For now:
S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour
So now you can iterate over each of the non-leaf node and assign them a score S. Then, to work out your groups you would
For each non-leaf node
Assign score S
Insertion sort the node in to a list, so the head is the highest scoring node
Starting at the root of the tree, traverse the nodes
If the node is the highest scoring node in the list
Mark it as a prefix
Remove all nodes from the list that are a descendant of it
Pop itself off the front of the list
Return up the tree
This should give you a list of prefixes. The last part feels like some clever data structures or algorithms could speed it up (the last part of removing all the children feels particularly weak, but if you input size is small, I guess speed isn't too important).
I'm wondering if your requirements aren't off. It seems as if you are looking for a specific grouping size as opposed to specific key size requirements. I have below a program that will, based on a specified group size, break up the strings into the largest possible groups up too, and including the group size specified. So if you specify a group size of 5, then it will group items on the smallest key possible to make a group of size 5. In your example it would group foo- as f since there is no need to make a more complex key as an identifier.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
/// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
{
var smallItems = from item in items
where item.Length < keySize
select item;
var largeItems = from item in items
where keySize < item.Length
select item;
var largeItemsq = (from item in largeItems
let key = item.Substring(0, keySize)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
if (smallItems.Any())
{
var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
var smallItemsq = (from item in smallItems
let key = item.Substring(0, smallestLength)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
return Combine(smallItemsq, largeItemsq);
}
return largeItemsq;
}
static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
var x = new Dictionary<bool,Dictionary<string,List<string>>> {
{ true, null },
{ false, null }
};
foreach(var condition in new bool[] { true, false }) {
var hasA = a.ContainsKey(condition);
var hasB = b.ContainsKey(condition);
x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
: hasA ? a[condition]
: hasB ? b[condition]
: new Dictionary<string, List<string>>();
}
return x;
}
public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
{
var toReturn = new Dictionary<string, List<string>>();
var both = Split(maxGroupSize, keySize, items);
if (both.ContainsKey(false))
foreach (var key in both[false].Keys)
toReturn.Add(key, both[false][key]);
if (both.ContainsKey(true))
{
var keySize_ = keySize + 1;
var xs = from needsFix in both[true]
select needsFix;
foreach (var x in xs)
{
var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
}
}
return toReturn;
}
static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
static readonly int maxAllowed = allowedChars.Length - 1;
static IEnumerable<string> GenerateText()
{
var list = new List<string>();
for (int i = 0; i < 100; i++)
{
var stringLength = rand.Next(3,25);
var chars = new List<char>(stringLength);
for (int j = stringLength; j > 0; j--)
chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
list.Add(newString);
}
return list;
}
static void Main(string[] args)
{
// runs 1000 times over autogenerated groups of sample text.
for (int i = 0; i < 1000; i++)
{
var s = GenerateText();
Go(s);
}
Console.WriteLine();
Console.WriteLine("DONE");
Console.ReadLine();
}
static void Go(IEnumerable<string> items)
{
var dict = Group(3, items, 1);
foreach (var key in dict.Keys)
{
Console.WriteLine(key);
foreach (var item in dict[key])
Console.WriteLine("\t{0}", item);
}
}
}
}

Case insensitive group on multiple columns

Is there anyway to do a LINQ2SQL query doing something similar to this:
var result = source.GroupBy(a => new { a.Column1, a.Column2 });
or
var result = from s in source
group s by new { s.Column1, s.Column2 } into c
select new { Column1 = c.Key.Column1, Column2 = c.Key.Column2 };
but with ignoring the case of the contents of the grouped columns?
You can pass StringComparer.InvariantCultureIgnoreCase to the GroupBy extension method.
var result = source.GroupBy(a => new { a.Column1, a.Column2 },
StringComparer.InvariantCultureIgnoreCase);
Or you can use ToUpperInvariant on each field as suggested by Hamlet Hakobyan on comment. I recommend ToUpperInvariant or ToUpper rather than ToLower or ToLowerInvariant because it is optimized for programmatic comparison purpose.
I couldn't get NaveenBhat's solution to work, getting a compile error:
The type arguments for method
'System.Linq.Enumerable.GroupBy(System.Collections.Generic.IEnumerable,
System.Func,
System.Collections.Generic.IEqualityComparer)' cannot be
inferred from the usage. Try specifying the type arguments explicitly.
To make it work, I found it easiest and clearest to define a new class to store my key columns (GroupKey), then a separate class that implements IEqualityComparer (KeyComparer). I can then call
var result= source.GroupBy(r => new GroupKey(r), new KeyComparer());
The KeyComparer class does compare the strings with the InvariantCultureIgnoreCase comparer, so kudos to NaveenBhat for pointing me in the right direction.
Simplified versions of my classes:
private class GroupKey
{
public string Column1{ get; set; }
public string Column2{ get; set; }
public GroupKey(SourceObject r) {
this.Column1 = r.Column1;
this.Column2 = r.Column2;
}
}
private class KeyComparer: IEqualityComparer<GroupKey>
{
bool IEqualityComparer<GroupKey>.Equals(GroupKey x, GroupKey y)
{
if (!x.Column1.Equals(y.Column1,StringComparer.InvariantCultureIgnoreCase) return false;
if (!x.Column2.Equals(y.Column2,StringComparer.InvariantCultureIgnoreCase) return false;
return true;
//my actual code is more complex than this, more columns to compare
//and handles null strings, but you get the idea.
}
int IEqualityComparer<GroupKey>.GetHashCode(GroupKey obj)
{
return 0.GetHashCode() ; // forces calling Equals
//Note, it would be more efficient to do something like
//string hcode = Column1.ToLower() + Column2.ToLower();
//return hcode.GetHashCode();
//but my object is more complex than this simplified example
}
}
I had the same issue grouping by the values of DataRow objects from a Table, but I just used .ToString() on the DataRow object to get past the compiler issue, e.g.
MyTable.AsEnumerable().GroupBy(
dataRow => dataRow["Value"].ToString(),
StringComparer.InvariantCultureIgnoreCase)
instead of
MyTable.AsEnumerable().GroupBy(
dataRow => dataRow["Value"],
StringComparer.InvariantCultureIgnoreCase)
I've expanded on Bill B's answer to make things a little more dynamic and to avoid hardcoding the column properties in the GroupKey and IQualityComparer<>.
private class GroupKey
{
public List<string> Columns { get; } = new List<string>();
public GroupKey(params string[] columns)
{
foreach (var column in columns)
{
// Using 'ToUpperInvariant()' if user calls Distinct() after
// the grouping, matching strings with a different case will
// be dropped and not duplicated
Columns.Add(column.ToUpperInvariant());
}
}
}
private class KeyComparer : IEqualityComparer<GroupKey>
{
bool IEqualityComparer<GroupKey>.Equals(GroupKey x, GroupKey y)
{
for (var i = 0; i < x.Columns.Count; i++)
{
if (!x.Columns[i].Equals(y.Columns[i], StringComparison.OrdinalIgnoreCase)) return false;
}
return true;
}
int IEqualityComparer<GroupKey>.GetHashCode(GroupKey obj)
{
var hashcode = obj.Columns[0].GetHashCode();
for (var i = 1; i < obj.Columns.Count; i++)
{
var column = obj.Columns[i];
// *397 is normally generated by ReSharper to create more unique hash values
// So I added it here
// (do keep in mind that multiplying each hash code by the same prime is more prone to hash collisions than using a different prime initially)
hashcode = (hashcode * 397) ^ (column != null ? column.GetHashCode() : 0);
}
return hashcode;
}
}
Usage:
var result = source.GroupBy(r => new GroupKey(r.Column1, r.Column2, r.Column3), new KeyComparer());
This way, you can pass any number of columns into the GroupKey constructor.

Categories