Lambda expression to find difference - c#

With the following data
string[] data = { "a", "a", "b" };
I'd very much like to find duplicates and get this result:
a
I tried the following code
var a = data.Distinct().ToList();
var b = a.Except(a).ToList();
obviously this didn't work, I can see what is happening above but I'm not sure how to fix it.

When runtime is no problem, you could use
var duplicates = data.Where(s => data.Count(t => t == s) > 1).Distinct().ToList();
Good old O(n^n) =)
Edit: Now for a better solution. =)
If you define a new extension method like
static class Extensions
{
public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> input)
{
HashSet<T> hash = new HashSet<T>();
foreach (T item in input)
{
if (!hash.Contains(item))
{
hash.Add(item);
}
else
{
yield return item;
}
}
}
}
you can use
var duplicates = data.Duplicates().Distinct().ToArray();

Use the group by stuff, the performance of these methods are reasonably good. Only concern is big memory overhead if you are working with large data sets.
from g in (from x in data group x by x)
where g.Count() > 1
select g.Key;
--OR if you prefer extension methods
data.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => x.Key)
Where Count() == 1 that's your distinct items and where Count() > 1 that's one or more duplicate items.
Since LINQ is kind of lazy, if you don't want to reevaluate your computation you can do this:
var g = (from x in data group x by x).ToList(); // grouping result
// duplicates
from x in g
where x.Count() > 1
select x.Key;
// distinct
from x in g
where x.Count() == 1
select x.Key;
When creating the grouping a set of sets will be created. Assuming that it's a set with O(1) insertion the running time of the group by approach is O(n). The incurred cost for each operation is somewhat high, but it should equate to near linear performance.

Sort the data, iterate through it and remember the last item. When the current item is the same as the last, its a duplicate. This can be easily implemented either iteratively or using a lambda expression in O(n*log(n)) time.

Related

Linq: Find elements of 2 list with different values but same index

I have 2 list of points (List<Point>) for the coordinates of some label elements. One list for before and one list for after they were moved, so the indexes refer to the same label elements. I want to compare each element with the same index and see which had their points changed.
List<int> changedIndexes = new List<int>();
for(int i = 0; i < labelLocationsBefore.Count; i++)
{
if (labelLocationsBefore[i].X != labelLocationsAfter[i].X || labelLocationsBefore[i].Y != labelLocationsAfter[i].Y)
{
changedIndexes.Add(i);
}
}
Which is what this loop does. But how can I convert this into a Linq expression and retrieve the changed labels index?
You are looking for this overload of Select method which takes a Func<T, int, bool> where the second argument is the index:
changedIndexes = labelLocationsBefore
.Select((point,idx) => new { point, idx })
.Where(p => p.point.X != labelLocationsAfter[p.idx].X ||
p.point.Y != labelLocationsAfter[p.idx].Y)
.Select(p => p.idx)
.ToList();
One option is to use Enumerable.Zip to join the two collections, then Select to get the index of each joined pair, then filter appropriately:
var changedIndexes = labelLocationsBefore
.Zip(labelLocationsAfter, (before, after) => before.Equals(after))
.Select((equal, index) => new { Moved = !equal, Index = index })
.Where(result => result.Moved)
.Select(result => result.Index)
.ToList();
This snippet has a few nice properties (it's based on an expression, easy to read, there is no repetition), but it's necessarily more cumbersome and less performant than a straight for loop because of the need to produce the "moved?/index" pair for all before/after sets of points -- even for those where simply determining that they have not been moved would be enough to disregard them.

LINQ avoid specific grouping

in my polynomial class all the terms consist of a List of tuples (double, uint), representing the coefficient and the exponent; a real and a natural number. The +operator implementation works great, but I was wondering if I could avoid to write two times grouping.Sum(s => s.Item1) It somehow feels not good, but I can't seem find a way to circumvent it.
Here is the code:
public static tuplePolynomial operator +(tuplePolynomial tp1, tuplePolynomial tp2)
{
tuplePolynomial Result = new tuplePolynomial();
Result.Terms =
(
from t in tp1.Terms.Concat(tp2.Terms)
group t by t.Item2 into grouping
where grouping.Sum(s => s.Item1) != 0.0
select new Tuple<double, uint>(grouping.Sum(s => s.Item1), grouping.Key)
).ToList();
return Result;
}
I actually merge the two polymonial's terms and group the terms with the same exponents to sum them. I filter out the terms with zero exponents. Terms is of type List<Tuple<double,uint>>.
This is easy with the let clause:
from t in tp1.Terms.Concat(tp2.Terms)
group t by t.Item2 into grouping
let sum = grouping.Sum(s => s.Item1)
where sum != 0.0
select new Tuple<double, uint>(sum, grouping.Key)
You could just move your where condition outside and apply it after projecting the new tuples.
Then you only apply the Sum operator on each group once and filter the resulting zero sums out before you call ToList.
The code would look something like:
Result.Terms = ( ... ).Where(t => t.Item1 != 0).ToList();

Check array for duplicates, return only items which appear more than once

I have an text document of emails such as
Google12#gmail.com,
MyUSERNAME#me.com,
ME#you.com,
ratonabat#co.co,
iamcool#asd.com,
ratonabat#co.co,
I need to check said document for duplicates and create a unique array from that (so if "ratonabat#co.co" appears 500 times in the new array he'll only appear once.)
Edit:
For an example:
username1#hotmail.com
username2#hotmail.com
username1#hotmail.com
username1#hotmail.com
username1#hotmail.com
username1#hotmail.com
This is my "data" (either in an array or text document, I can handle that)
I want to be able to see if there's a duplicate in that, and move the duplicate ONCE to another array. So the output would be
username1#hotmail.com
You can simply use Linq's Distinct extension method:
var input = new string[] { ... };
var output = input.Distinct().ToArray();
You may also want to consider refactoring your code to use a HashSet<string> instead of a simple array, as it will gracefully handle duplicates.
To get an array containing only those records which are duplicates, it's a little moe complex, but you can still do it with a little Linq:
var output = input.GroupBy(x => x)
.Where(g => g.Skip(1).Any())
.Select(g => g.Key)
.ToArray();
Explanation:
.GroupBy group identical strings together
.Where filter the groups by the following criteria
.Skip(1).Any() return true if there are 2 or more items in the group. This is equivalent to .Count() > 1, but it's slightly more efficient because it stops counting after it finds a second item.
.Select return a set consisting only of a single string (rather than the group)
.ToArray convert the result set to an array.
Here's another solution using a custom extension method:
public static class MyExtensions
{
public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> input)
{
var a = new HashSet<T>();
var b = new HashSet<T>();
foreach(var x in input)
{
if (!a.Add(x) && b.Add(x))
yield return x;
}
}
}
And then you can call this method like this:
var output = input.Duplicates().ToArray();
I haven't benchmarked this, but it should be more efficient than the previous method.
You can use the built in in .Distinct() method, by default the comparisons are case sensitive, if you want to make it case insenstive use the overload that takes a comparer in and use a case insensitive string comparer.
List<string> emailAddresses = GetListOfEmailAddresses();
string[] uniqueEmailAddresses = emailAddresses.Distinct(StringComparer.OrdinalIgnoreCase).ToArray();
EDIT: Now I see after you made your clarification you only want to list the duplicates.
string[] duplicateAddresses = emailAddresses.GroupBy(address => address,
(key, rows) => new {Key = key, Count = rows.Count()},
StringComparer.OrdinalIgnoreCase)
.Where(row => row.Count > 1)
.Select(row => row.Key)
.ToArray();
To select emails which occur more then once..
var dupEmails=from emails in File.ReadAllText(path).Split(',').GroupBy(x=>x)
where emails.Count()>1
select emails.Key;

Optimization: How should i Optimize the Linq Concat of Collections? C#

is there any way i can Optimize this:
public static IEnumerable<IEnumerable<int>> GenerateCombinedPatterns
(IEnumerable<IEnumerable<int>> patterns1,
IEnumerable<IEnumerable<int>> patterns2)
{
return patterns1
.Join(patterns2, p1key => 1, p2key => 1, (p1, p2) => p1.Concat(p2))
.Where(r => r.Sum() <= stockLen)
.AsParallel()
as IEnumerable<IEnumerable<int>>;
}
If you're looking for every combination, use SelectMany instead, usually performed with multiple "from" clauses:
return from p1 in patterns1
from p2 in patterns2
let combination = p1.Concat(p2)
where combination.Sum() <= stockLen
select combination;
That's without any parallelism though... depending on the expected collections, I'd probably just parallelize at one level, e.g.
return from p1 in patterns1.AsParallel()
from p2 in patterns2
let combination = p1.Concat(p2)
where combination.Sum() <= stockLen
select combination;
Note that there's no guarantee as to the order in which the results come out with the above - you'd need to tweak it if you wanted the original ordering.
No point in making the query parallel at the very end. Update: Jon was right, my initial solution was incorrect and turns out my corrected solution was essentially the same as his.
public static IEnumerable<IEnumerable<int>> GenerateCombinedPatterns
(IEnumerable<IEnumerable<int>> patterns1,
IEnumerable<IEnumerable<int>> patterns2)
{
var parallel1 = patterns1.AsParallel();
return parallel1.SelectMany(p1 => patterns2.Select(p2 => p1.Concat(p2)))
.Where(r => r.Sum() <= stockLen);
}

Return Modal Average in LINQ (Mode)

I am not sure if CopyMost is the correct term to use here, but it's the term my client used ("CopyMost Data Protocol"). Sounds like he wants the mode? I have a set of data:
Increment Value
.02 1
.04 1
.06 1
.08 2
.10 2
I need to return which Value occurs the most "CopyMost". In this case, the value is 1. Right now I had planned on writing an Extension Method for IEnumerable to do this for integer values. Is there something built into Linq that already does this easily? Or is it best for me to write an extension method that would look something like this
records.CopyMost(x => x.Value);
EDIT
Looks like I am looking for the modal average. I've provided an updated answer that allows for a tiebreaker condition. It's meant to be used like this, and is generic.
records.CopyMost(x => x.Value, x => x == 0);
In this case x.Value would be an int, and if the the count of 0s was the same as the counts of 1s and 3s, it would tiebreak on 0.
Well, here's one option:
var query = (from item in data
group 1 by item.Value into g
orderby g.Count() descending
select g.Key).First();
Basically we're using GroupBy to group by the value - but all we're interested in for each group is the size of the group and the key (which is the original value). We sort the groups by size, and take the first element (the one with the most elements).
Does that help?
Jon beat me to it, but the term you're looking for is Modal Average.
Edit:
If I'm right In thinking that it's modal average you need then the following should do the trick:
var i = (from t in data
group t by t.Value into aggr
orderby aggr.Count() descending
select aggr.Key).First();
This method has been updated several times in my code over the years. It's become a very important method, and is much different than it use to be. I wanted to provide the most up to date version in case anyone was looking to add CopyMost or a Modal Average as a linq extension.
One thing I did not think I would need was a tiebreaker of some sort. I have now overloaded the method to include a tiebreaker.
public static K CopyMost<T, K>(this IEnumerable<T> records, Func<T, K> propertySelector, Func<K, bool> tieBreaker)
{
var grouped = records.GroupBy(x => propertySelector(x)).Select(x => new { Group = x, Count = x.Count() });
var maxCount = grouped.Max(x => x.Count);
var subGroup = grouped.Where(x => x.Count == maxCount);
if (subGroup.Count() == 1)
return subGroup.Single().Group.Key;
else
return subGroup.Where(x => tieBreaker(x.Group.Key)).Single().Group.Key;
}
The above assumes the user enters a legitimate tiebreaker condition. You may want to check and see if the tiebreaker returns a valid value, and if not, throw an exception. And here's my normal method.
public static K CopyMost<T, K>(this IEnumerable<T> records, Func<T, K> propertySelector)
{
return records.GroupBy(x => propertySelector(x)).OrderByDescending(x => x.Count()).Select(x => x.Key).First();
}

Categories