How to find Complement of two HashSets - c#

I have two HashSets – setA and setB.
How can we find the complement of setA and setB?
Is the code for intersection the best way to find intersection?
CODE
string stringA = "A,B,A,A";
string stringB = "C,A,B,D";
HashSet<string> setA = new HashSet<string>(stringA.Split(',').Select(t => t.Trim()));
HashSet<string> setB = new HashSet<string>(stringB.Split(',').Select(t => t.Trim()));
//Intersection - Present in A and B
HashSet<string> intersectedSet = new HashSet<string>( setA.Intersect(setB));
//Complemenet - Present in A; but not present in B
UPDATE:
Use OrdianlIgnoreCase for ignoring case sensitvity How to use HashSet<string>.Contains() method in case -insensitive mode?
REFERENCE:
What is the difference between HashSet<T> and List<T>?
Intersection of multiple lists with IEnumerable.Intersect()
Comparing two hashsets
Compare two hashsets?
Quickest way to find the complement of two collections in C#

1 - How can we find the complement of setA and setB?
Use HashSet<T>.Except Method
//Complemenet - Present in A; but not present in B
HashSet<string> ComplemenetSet = new HashSet<string>(setA.Except(setB));
try it with following string.
string stringA = "A,B,A,E";
string stringB = "C,A,B,D";
ComplementSet will contain E
2 - Is the code for intersection the best way to find intersection?
Probably, YES

You can use Except to get the complement of A or B. To get a symmetric complement, use SymmetricExceptWith.
setA.SymmetricExceptWith(setB);
Note that this modifies setA. To get the intersection, there are two methods: Intersect, which creates a new HashSet, and IntersectWith, which modifies the first:
// setA and setB unchanged
HashSet<string> intersection = setA.Intersect(setB);
// setA gets modified and holds the result
setA.IntersectWith(setB);

Related

How to combine items in List<string> to make new items efficiently

I have a case where I have the name of an object, and a bunch of file names. I need to match the correct file name with the object. The file name can contain numbers and words, separated by either hyphen(-) or underscore(_). I have no control of either file name or object name. For example:
10-11-12_001_002_003_13001_13002_this_is_an_example.svg
The object name in this case is just a string, representing an number
10001
I need to return true or false if the file name is a match for the object name. The different segments of the file name can match on their own, or any combination of two segments. In the example above, it should be true for the following cases (not every true case, just examples):
10001
10002
10003
11001
11002
11003
12001
12002
12003
13001
13002
And, we should return false for this case (among others):
13003
What I've come up with so far is this:
public bool IsMatch(string filename, string objectname)
{
var namesegments = GetNameSegments(filename);
var match = namesegments.Contains(objectname);
return match;
}
public static List<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-').ToList();
var newSegments = new List<string>();
foreach (var segment in segments)
{
foreach (var segment2 in segments)
{
if (segment == segment2)
continue;
var newToken = segment + segment2;
newSegments.Add(newToken);
}
}
return segments.Concat(newSegments).ToList();
}
One or two segments combined can make a match, and that is enought. Three or more segments combined should not be considered.
This does work so far, but is there a better way to do it, perhaps without nesting foreach loops?
First: don't change debugged, working, sufficiently efficient code for no reason. Your solution looks good.
However, we can make some improvements to your solution.
public static List<string> GetNameSegments(string filename)
Making the output a list puts restrictions on the implementation that are not required by the caller. It should be IEnumerable<String>. Particularly since the caller in this case only cares about the first match.
var segments = filename.Split('_', '-').ToList();
Why ToList? A list is array-backed. You've already got an array in hand. Just use the array.
Since there is no longer a need to build up a list, we can transform your two-loop solution into an iterator block:
public static IEnumerable<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-');
foreach (var segment in segments)
yield return segment;
foreach (var s1 in segments)
foreach (var s2 in segments)
if (s1 != s2)
yield return s1 + s2;
}
Much nicer. Alternatively we could notice that this has the structure of a query and simply return the query:
public static IEnumerable<string> GetNameSegments(string filename)
{
var q1= filename.Split('_', '-');
var q2 = from s1 in q1
from s2 in q1
where s1 != s2
select s1 + s2;
return q1.Concat(q2);
}
Again, much nicer in this form.
Now let's talk about efficiency. As is often the case, we can achieve greater efficiency at a cost of increased complication. This code looks like it should be plenty fast enough. Your example has nine segments. Let's suppose that nine or ten is typical. Our solutions thus far consider the ten or so singletons first, and then the hundred or so combinations. That's nothing; this code is probably fine. But what if we had thousands of segments and were considering millions of possibilities?
In that case we should restructure the algorithm. One possibility would be this general solution:
public bool IsMatch(HashSet<string> segments, string name)
{
if (segments.Contains(name))
return true;
var q = from s1 in segments
where name.StartsWith(s1)
let s2 = name.Substring(s1.Length)
where s1 != s2
where segments.Contains(s2)
select 1; // Dummy. All we care about is if there is one.
return q.Any();
}
Your original solution is quadratic in the number of segments. This one is linear; we rely on the constant order contains operation. (This assumes of course that string operations are constant time because strings are short. If that's not true then we have a whole other kettle of fish to fry.)
How else could we extract wins in the asymptotic case?
If we happened to have the property that the collection was not a hash set but rather a sorted list then we could do even better; we could binary search the list to find the start and end of the range of possible prefix matches, and then pour the list into a hashset to do the suffix matches. That's still linear, but could have a smaller constant factor.
If we happened to know that the target string was small compared to the number of segments, we could attack the problem from the other end. Generate all possible combinations of partitions of the target string and check if both halves are in the segment set. The problem with this solution is that it is quadratic in memory usage in the size of the string. So what we'd want to do there is construct a special hash on character sequences and use that to populate the hash table, rather than the standard string hash. I'm sure you can see how the solution would go from there; I shan't spell out the details.
Efficiency is very much dependent on the business problem that you're attempting to solve. Without knowing the full context/usage it's difficult to define the most efficient solution. What works for one situation won't always work for others.
I would always advocate to write working code and then solve any performance issues later down the line (or throw more tin at the problem as it's usually cheaper!) If you're having specific performance issues then please do tell us more...
I'm going to go out on a limb here and say (hope) that you're only going to be matching the filename against the object name once per execution. If that's the case I reckon this approach will be just about the fastest. In a circumstance where you're matching a single filename against multiple object names then the obvious choice is to build up an index of sorts and match against that as you were already doing, although I'd consider different types of collection depending on your expected execution/usage.
public static bool IsMatch(string filename, string objectName)
{
var segments = filename.Split('-', '_');
for (int i = 0; i < segments.Length; i++)
{
if (string.Equals(segments[i], objectName)) return true;
for (int ii = 0; ii < segments.Length; ii++)
{
if (ii == i) continue;
if (string.Equals($"{segments[i]}{segments[ii]}", objectName)) return true;
}
}
return false;
}
If you are willing to use the MoreLINQ NuGet package then this may be worth considering:
public static HashSet<string> GetNameSegments(string filename)
{
var segments = filename.Split(new char[] {'_', '-'}, StringSplitOptions.RemoveEmptyEntries).ToList();
var matches = segments
.Cartesian(segments, (x, y) => x == y ? null : x + y)
.Where(z => z != null)
.Concat(segments);
return new HashSet<string>(matches);
}
StringSplitOptions.RemoveEmptyEntries handles adjacent separators (e.g. --). Cartesian is roughly equivalent to your existing nested for loops. The Where is to remove null entries (i.e. if x == y). Concat is the same as your existing Concat. The use of HashSet allows for your Contains calls (in IsMatch) to be faster.

Performing function on each array element, returning results to new array

I'm a complete Linq newbie here, so forgive me for a probably quite simple question.
I want to perform an operation on every element in an array, and return the result of each of these operations to a new array.
For example, say I have an array or numbers and a function ToWords() that converts the numbers to their word equivalents, I want to be able to pass in the numbers array, perform the ToWords() operation on each element, and pass out a string[]
I know it's entirely possible in a slightly more verbose way, but in my Linq adventures I'm wondering if it's doable in a nice one-liner.
You can use Select() to transform one sequence into another one, and ToArray() to create an array from the result:
int[] numbers = { 1, 2, 3 };
string[] strings = numbers.Select(x => ToWords(x)).ToArray();
It's pretty straight forward. Just use the Select method:
var results = array.Select(ToWords).ToArray();
Note that unless you need an array you don't have to call ToArray. Most of the time you can use lazy evaluation on an IEnumerable<string> just as easily.
There are two different approaches - you can use Select extension method or you can use select clause:
var numbers = new List<int>();
var strings1 = from num in numbers select ToWords(num);
var strings2 = numbers.Select(ToWords);
both of them will return IEnumerable<>, which you can cast as you need (for example, with .ToArray() or .ToList()).
You could do something like this :
public static string[] ProcessStrings(int[] intList)
{
return Array.ConvertAll<int, string>(intList, new Converter<int, string>(
delegate(int number)
{
return ToWords();
}));
}
If it is a list then :
public static List<string> ProcessStrings(List<int> intList)
{
return intList.ConvertAll<string>(new Converter<int, string>(
delegate(int number)
{
return ToWords();
}));
}
Straight simple:
string[] wordsArray = array.ToList().ConvertAll(n => ToWords(n)).ToArray();
If you are OK with Lists, rather than arrays, you can skip ToList() and ToArray().
Lists are much more flexible than arrays, I see no reason on earth not to use them, except for specific cases.

How to sort list of list by count?

In C#:
List<List<Point>> SectionList = new List<List<Point>>();
SectionList contains lists of points where each sub list varies in how many points it contains.
What I'm trying to figure out is how to sort SectionList by the Count of sub lists in descending order.
So if SectionList had 3 lists of points, after sorting, SectionList[0] would contain the highest Count value of all 3 lists.
Thanks,
Mythics
This should work:
SectionList.Sort((a,b) => a.Count - b.Count);
The (a,b) => a.Count - b.Count is a comparison delegate. The Sort method calls it with pairs of lists to compare, and the delegate that returns a negative number if a is shorter than b, a positive number if a is longer than b, and zero when the two lists are of the same length.
var sortedList = SectionList.OrderByDescending(l=>l.Count()).ToList();
You could create a custom comparer.
public class ListCountComparer : IComparer<IList> {
public int Compare(IList x, IList y) {
return x.Count.CompareTo(y.Count);
}
}
Then you can sort your list like this:
SectionList.Sort(new ListCountComparer());
Hope this helps :)

c# - BinarySearch StringList with wildcard

I have a sorted StringList and wanted to replace
foreach (string line3 in CardBase.cardList)
if (line3.ToLower().IndexOf((cardName + Config.EditionShortToLong(edition)).ToLower()) >= 0)
{
return true;
}
with a binarySearch, since the cardList ist rather large(~18k) and this search takes up around 80% of the time.
So I found the List.BinarySearch-Methode, but my problem is that the lines in the cardList look like this:
Brindle_Boar_(Magic_2012).c1p247924.prod
But I have no way to generate the c1p... , which is a problem cause the List.BinarySearch only finds exact matches.
How do I modify List.BinarySearch so that it finds a match if only a part of the string matches?
e. g.
searching for Brindle_Boar_(Magic_2012) should return the position of Brindle_Boar_(Magic_2012).c1p247924.prod
List.BinarySearch will return the ones complement of the index of the next item larger than the request if an exact match is not found.
So, you can do it like this (assuming you'll never get an exact match):
var key = (cardName + Config.EditionShortToLong(edition)).ToLower();
var list = CardBase.cardList;
var index = ~list.BinarySearch(key);
return index != list.Count && list[index].StartsWith(key);
BinarySearch() has an overload that takes an IComparer<T> has second parameter, implement a custom comparer and return 0 when you have a match within the string - you can use the same IndexOf() method there.
Edit:
Does a binary search make sense in your scenario? How do you determine that a certain item is "less" or "greater" than another item? Right now you only provide what would constitute a match. Only if you can answer this question, binary search applies in the first place.
You can take a look at the C5 Generic Collection Library (you can install it via NuGet also).
Use the SortedArray(T) type for your collection. It provides a handful of methods that could prove useful. You can even query for ranges of items very efficiently.
var data = new SortedArray<string>();
// query for first string greater than "Brindle_Boar_(Magic_2012)" an check if it starts
// with "Brindle_Boar_(Magic_2012)"
var a = data.RangeFrom("Brindle_Boar_(Magic_2012)").FirstOrDefault();
return a.StartsWith("Brindle_Boar_(Magic_2012)");
// query for first 5 items that start with "Brindle_Boar"
var b = data.RangeFrom("string").Take(5).Where(s => s.StartsWith("Brindle_Boar"));
// query for all items that start with "Brindle_Boar" (provided only ascii chars)
var c = data.RangeFromTo("Brindle_Boar", "Brindle_Boar~").ToList()
// query for all items that start with "Brindle_Boar", iterates until first non-match
var d = data.RangeFrom("Brindle_Boar").TakeWhile(s => s.StartsWith("Brindle_Boar"));
The RageFrom... methods perform a binary search, find the first element greater than or equal to your argument, that returns an iterator from that position

IEnumerable<IEnumerable<int>> - no duplicate IEnumerable<int>s

I'm trying to find a solution to this problem:
Given a IEnumerable< IEnumerable< int>> I need a method/algorithm that returns the input, but in case of several IEnmerable< int> with the same elements only one per coincidence/group is returned.
ex.
IEnumerable<IEnumerable<int>> seqs = new[]
{
new[]{2,3,4}, // #0
new[]{1,2,4}, // #1 - equals #3
new[]{3,1,4}, // #2
new[]{4,1,2} // #3 - equals #1
};
"foreach seq in seqs" .. yields {#0,#1,#2} or {#0,#2,#3}
Sould I go with ..
.. some clever IEqualityComparer
.. some clever LINQ combination I havent figured out - groupby, sequenceequal ..?
.. some seq->HashSet stuff
.. what not. Anything will help
I'll be able to solve it by good'n'old programming but inspiration is always appreciated.
Here's a slightly simpler version of digEmAll's answer:
var result = seqs.Select(x => new HashSet<int>(x))
.Distinct(HashSet<int>.CreateSetComparer());
Given that you want to treat the elements as sets, you should have them that way to start with, IMO.
Of course this won't help if you want to maintain order within the sequences that are returned, you just don't mind which of the equal sets is returned... the above code will return an IEnumerable<HashSet<int>> which will no longer have any ordering within each sequence. (The order in which the sets are returned isn't guaranteed either, although it would be odd for them not to be return in first-seen-first-returned basis.)
It feels unlikely that this wouldn't be enough, but if you could give more details of what you really need to achieve, that would make it easier to help.
As noted in comments, this will also assume that there are no duplicates within each original source array... or at least, that they're irrelevant, so you're happy to treat { 1 } and { 1, 1, 1, 1 } as equal.
Use the correct collection type for the job. What you really want is ISet<IEnumerable<int>> with an equality comparer that will ignore the ordering of the IEnumerables.
EDITED:
You can get what you want by building your own IEqualityComparer<IEnumerable<int>> e.g.:
public class MyEqualityComparer : IEqualityComparer<IEnumerable<int>>
{
public bool Equals(IEnumerable<int> x, IEnumerable<int> y)
{
return x.OrderBy(el1 => el1).SequenceEqual(y.OrderBy(el2 => el2));
}
public int GetHashCode(IEnumerable<int> elements)
{
int hash = 0;
foreach (var el in elements)
{
hash = hash ^ el.GetHashCode();
}
return hash;
}
}
Usage:
var values = seqs.Distinct(new MyEqualityComparer()).ToList();
N.B.
this solution is slightly different from the one given by Jon Skeet.
His answer considers sublists as sets, so basically two lists like [1,2] and [1,1,1,2,2] are equal.
This solution don't, i.e. :
[1,2,1,1] is equal to [2,1,1,1] but not to [2,2,1,1], hence basically the two lists have to contain the same elements and in the same number of occurrences.

Categories