Enumerable.Except with IEqualityComparer

Enumerable.Except with IEqualityComparer - c#

I have two string arrays, newArray and oldArray, and I want to use Enumberable.Except method to remove all items that are in newArray that are also in oldArray and then write the result to a csv file.
However, I need to use a custom comparer in order to check for formatting similarities(if there is a new line character in one array and not the other, I don't want this item being written to the file).
My code as of now:
string newString = File.ReadAllText(csvOutputFile1);
string[] newArray = newString.Split(new string[] {sentinel}, StringSplitOptions.RemoveEmptyEntries);
string oldString = File.ReadAllText(csvOutputFile2);
string[] oldArray = oldString.Split(new string[] { sentinel }, StringSplitOptions.None);
IEnumerable<string> differnceQuery = newArray.Except(oldArray, new Comparer());
using (var wtr = new StreamWriter(diffFile))
{
foreach (var s in differnceQuery)
{
wtr.WriteLine(s.Trim() + "#!#");
}
}
and the custom comparer class:
class Comparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
x = x.ToString().Replace(" ", "").Replace("\n", "").Replace("\r", "");
y = y.ToString().Replace(" ", "").Replace("\n", "").Replace("\r", "");
if (x == y)
return true;
else
return false;
}
public int GetHashCode(string row)
{
int hCode = row.GetHashCode();
return hCode;
}
}
The resulting file is not omitting the formatting difference items between the two arrays. So although it catches items that are in the newArray but not in the oldArray(like it should), it is also putting in items that are only different because of a \n or something even though in my custom comparer I am removing them.
The thing I really don't understand is when I debug and step through my code, I can see each pair of items being analyzed in my custom comparer class, but only when they are equal terms. If for example the string "This is\nthe 1st term" is in newArray and the string "This is the first array" is in oldArray, the debugger doesn't even enter the comparer class and instead jumps straight to the writeline part of my code in the main class.

simply: your hash-code does not correctly mirror your equality method. Strings like "a b c" and "abc" would return different values from GetHashCode, so it would never get around to testing Equals. GetHashCode must return the same result for any two values that could be equal. It is not, however, necessary that two strings that are not equal return different hash-codes (although it is highly desirable, otherwise everything will go into the same hash-bucket).
I guess you could use:
// warning: probably not very efficient
return x.Replace(" ", "").Replace("\n", "").Replace("\r", "").GetHashCode();
but that looks pretty expensive (lots of potential for garbage strings to be generated all the time)

Related

How to combine items in List<string> to make new items efficiently

I have a case where I have the name of an object, and a bunch of file names. I need to match the correct file name with the object. The file name can contain numbers and words, separated by either hyphen(-) or underscore(_). I have no control of either file name or object name. For example:
10-11-12_001_002_003_13001_13002_this_is_an_example.svg
The object name in this case is just a string, representing an number
10001
I need to return true or false if the file name is a match for the object name. The different segments of the file name can match on their own, or any combination of two segments. In the example above, it should be true for the following cases (not every true case, just examples):
10001
10002
10003
11001
11002
11003
12001
12002
12003
13001
13002
And, we should return false for this case (among others):
13003
What I've come up with so far is this:
public bool IsMatch(string filename, string objectname)
{
var namesegments = GetNameSegments(filename);
var match = namesegments.Contains(objectname);
return match;
}
public static List<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-').ToList();
var newSegments = new List<string>();
foreach (var segment in segments)
{
foreach (var segment2 in segments)
{
if (segment == segment2)
continue;
var newToken = segment + segment2;
newSegments.Add(newToken);
}
}
return segments.Concat(newSegments).ToList();
}
One or two segments combined can make a match, and that is enought. Three or more segments combined should not be considered.
This does work so far, but is there a better way to do it, perhaps without nesting foreach loops?

First: don't change debugged, working, sufficiently efficient code for no reason. Your solution looks good.
However, we can make some improvements to your solution.
public static List<string> GetNameSegments(string filename)
Making the output a list puts restrictions on the implementation that are not required by the caller. It should be IEnumerable<String>. Particularly since the caller in this case only cares about the first match.
var segments = filename.Split('_', '-').ToList();
Why ToList? A list is array-backed. You've already got an array in hand. Just use the array.
Since there is no longer a need to build up a list, we can transform your two-loop solution into an iterator block:
public static IEnumerable<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-');
foreach (var segment in segments)
yield return segment;
foreach (var s1 in segments)
foreach (var s2 in segments)
if (s1 != s2)
yield return s1 + s2;
}
Much nicer. Alternatively we could notice that this has the structure of a query and simply return the query:
public static IEnumerable<string> GetNameSegments(string filename)
{
var q1= filename.Split('_', '-');
var q2 = from s1 in q1
from s2 in q1
where s1 != s2
select s1 + s2;
return q1.Concat(q2);
}
Again, much nicer in this form.
Now let's talk about efficiency. As is often the case, we can achieve greater efficiency at a cost of increased complication. This code looks like it should be plenty fast enough. Your example has nine segments. Let's suppose that nine or ten is typical. Our solutions thus far consider the ten or so singletons first, and then the hundred or so combinations. That's nothing; this code is probably fine. But what if we had thousands of segments and were considering millions of possibilities?
In that case we should restructure the algorithm. One possibility would be this general solution:
public bool IsMatch(HashSet<string> segments, string name)
{
if (segments.Contains(name))
return true;
var q = from s1 in segments
where name.StartsWith(s1)
let s2 = name.Substring(s1.Length)
where s1 != s2
where segments.Contains(s2)
select 1; // Dummy. All we care about is if there is one.
return q.Any();
}
Your original solution is quadratic in the number of segments. This one is linear; we rely on the constant order contains operation. (This assumes of course that string operations are constant time because strings are short. If that's not true then we have a whole other kettle of fish to fry.)
How else could we extract wins in the asymptotic case?
If we happened to have the property that the collection was not a hash set but rather a sorted list then we could do even better; we could binary search the list to find the start and end of the range of possible prefix matches, and then pour the list into a hashset to do the suffix matches. That's still linear, but could have a smaller constant factor.
If we happened to know that the target string was small compared to the number of segments, we could attack the problem from the other end. Generate all possible combinations of partitions of the target string and check if both halves are in the segment set. The problem with this solution is that it is quadratic in memory usage in the size of the string. So what we'd want to do there is construct a special hash on character sequences and use that to populate the hash table, rather than the standard string hash. I'm sure you can see how the solution would go from there; I shan't spell out the details.

Efficiency is very much dependent on the business problem that you're attempting to solve. Without knowing the full context/usage it's difficult to define the most efficient solution. What works for one situation won't always work for others.
I would always advocate to write working code and then solve any performance issues later down the line (or throw more tin at the problem as it's usually cheaper!) If you're having specific performance issues then please do tell us more...
I'm going to go out on a limb here and say (hope) that you're only going to be matching the filename against the object name once per execution. If that's the case I reckon this approach will be just about the fastest. In a circumstance where you're matching a single filename against multiple object names then the obvious choice is to build up an index of sorts and match against that as you were already doing, although I'd consider different types of collection depending on your expected execution/usage.
public static bool IsMatch(string filename, string objectName)
{
var segments = filename.Split('-', '_');
for (int i = 0; i < segments.Length; i++)
{
if (string.Equals(segments[i], objectName)) return true;
for (int ii = 0; ii < segments.Length; ii++)
{
if (ii == i) continue;
if (string.Equals($"{segments[i]}{segments[ii]}", objectName)) return true;
}
}
return false;
}

If you are willing to use the MoreLINQ NuGet package then this may be worth considering:
public static HashSet<string> GetNameSegments(string filename)
{
var segments = filename.Split(new char[] {'_', '-'}, StringSplitOptions.RemoveEmptyEntries).ToList();
var matches = segments
.Cartesian(segments, (x, y) => x == y ? null : x + y)
.Where(z => z != null)
.Concat(segments);
return new HashSet<string>(matches);
}
StringSplitOptions.RemoveEmptyEntries handles adjacent separators (e.g. --). Cartesian is roughly equivalent to your existing nested for loops. The Where is to remove null entries (i.e. if x == y). Concat is the same as your existing Concat. The use of HashSet allows for your Contains calls (in IsMatch) to be faster.

c# Linq Except Not Returning List of Different Values

I am trying to find the differences in two lists. List, "y" should have 1 unique value when compared to list "x". However, Except, does not return the difference. The, "differences" list's count always equals 0.
List<EtaNotificationUser> etaNotifications = GetAllNotificationsByCompanyIDAndUserID(PrevSelectedCompany.cmp_ID);
IEnumerable<string> x = etaNotifications.OfType<string>();
IEnumerable<string> y = EmailList.OfType<string>();
IEnumerable<string> differences = x.Except(y, new StringLengthEqualityComparer()).ToList();
foreach(string diff in differences)
{
addDiffs.Add(diff);
}
After reading a few posts and articles on the post, I created a custom comparer. The comparer looks at string length (kept it simple for testing) and obtains the Hashcode, since these are two objects of a different type (even though I convert their types to string), I thought it may have been the issue.
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x.Length == y.Length;
}
public int GetHashCode(string obj)
{
return obj.Length;
}
}
This is my first time using Except. Sounds like a great, optimized way of comparing two lists, but I can't get it to work.
Update
X - Should hold Email Addresses from the database.
GetAllNotificationsByCompanyIDAndUserID - brings back email values from the DB.
Y - Should hold all Email Addresses in the UI Grid.
What I am trying to do is detect if a new e-mail has been added to the grid. So at this point X will have the saved values from past entries. Y will have any new e-mail addresses add by the user and have not been saved yet.
I have verified this is all working correctly.

The problem is here:
IEnumerable<string> x = etaNotifications.OfType<string>();
but etaNotifications is a List<EtaNotificationUser>, none of which can be a string since string is sealed. OfType returns all instances that are of the given type - it does not "convert" each member to that type.
So x will always be empty.
Maybe you want:
IEnumerable<string> x = etaNotifications.Select(e => e.ToString());
if EtaNotificationUser has overridden ToString to give you the value you want to compare. If the value you want to compare is in a property you can use:
IEnumerable<string> x = etaNotifications.Select(e => e.EmailAddress);
or some other property.
You'll likely have to do something similar for y (unless EmailList is already a List<string> which I doubt).

Assuming you have verified that your two enumerables x and y actually contain the strings you expect them to, I believe your problem is with your string comparer. According to the docs, Enumerable.Except "Produces the set difference of two sequences. The set difference is the members of the first sequence that don't appear in the second sequence." But your equality comparer equates all strings with the same length. Thus, if a string in the first sequence happens to have the same length as a string in the second, it will not be found as different using your comparer.
Update: yup, I just tested it:
public class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x.Length == y.Length;
}
public int GetHashCode(string obj)
{
return obj.Length;
}
}
string [] array1 = new string [] { "foo", "bar", "yup" };
string[] array2 = new string[] { "dll" };
int diffCount;
diffCount = 0;
foreach (var diff in array1.Except(array2, new StringLengthEqualityComparer()))
{
diffCount++;
}
Debug.Assert(diffCount == 0); // No assert.
diffCount = 0;
foreach (var diff in array1.Except(array2))
{
diffCount++;
}
Debug.Assert(diffCount == 0); // Assert b/c diffCount == 3.
There is no assert with the custom comparer but there is with the standard.

Compare two values using RegEx

If I have two values eg/ABC001 and ABC100 or A0B0C1 and A1B0C0, is there a RegEx I can use to make sure the two values have the same pattern?

Well, here's my shot at it. This doesn't use regular expressions, and assumes s1 and s2 only contain numbers or digits:
public static bool SamePattern(string s1, string s2)
{
if (s1.Length == s2.Length)
{
char[] chars1 = s1.ToCharArray();
char[] chars2 = s2.ToCharArray();
for (int i = 0; i < chars1.Length; i++)
{
if (!Char.IsDigit(chars1[i]) && chars1[i] != chars2[i])
{
return false;
}
else if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
{
return false;
}
}
return true;
}
else
{
return false;
}
}
A description of the algorithm is as follows:
If the strings have different lengths, return false.
Otherwise, check the characters in the same position in both strings:
If they are both digits or both numbers, move on to the next iteration.
If they aren't digits but aren't the same, return false.
If one is a digit and one is a number, return false.
If all characters in both strings were checked successfully, return true.

If you don't know the pattern in advance, but are only going to encounter two groups of characters (alpha and digits), then you could do the following:
Write some C# that parsed the first pattern, looking at each char and determine if it's alpha, or digit, then generate a regex accordingly from that pattern.
You may find that there's no point writing code to generate a regex, as it could be just as simple to check the second string against the first.
Alternatively, without regex:
First check the strings are the same length.
Then loop through both strings at the same time, char by char. If char[x] from string 1 is alpha, and char[x] from string two is the same, you're patterns are matching.
Try this, it should cope if a string sneaks in some symbols. Edited to compare character values ... and use Char.IsLetter and Char.IsDigit
private bool matchPattern(string string1, string string2)
{
bool result = (string1.Length == string2.Length);
char[] chars1 = string1.ToCharArray();
char[] chars2 = string2.ToCharArray();
for (int i = 0; i < string1.Length; i++)
{
if (Char.IsLetter(chars1[i]) != Char.IsLetter(chars2[i]))
{
result = false;
}
if (Char.IsLetter(chars1[i]) && (chars1[i] != chars2[i]))
{
//Characters must be identical
result = false;
}
if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
result = false;
}
return result;
}

Consider using Char.GetUnicodeCategory
You can write a helper class for this task:
public class Mask
{
public Mask(string originalString)
{
OriginalString = originalString;
CharCategories = originalString.Select(Char.GetUnicodeCategory).ToList();
}
public string OriginalString { get; private set; }
public IEnumerable<UnicodeCategory> CharCategories { get; private set; }
public bool HasSameCharCategories(Mask other)
{
//null checks
return CharCategories.SequenceEqual(other.CharCategories);
}
}
Use as
Mask mask1 = new Mask("ab12c3");
Mask mask2 = new Mask("ds124d");
MessageBox.Show(mask1.HasSameCharCategories(mask2).ToString());

I don't know C# syntax but here is a pseudo code:
split the strings on ''
sort the 2 arrays
join each arrays with ''
compare the 2 strings

A general-purpose solution with LINQ can be achieved quite easily. The idea is:
Sort the two strings (reordering the characters).
Compare each sorted string as a character sequence using SequenceEquals.
This scheme enables a short, graceful and configurable solution, for example:
// We will be using this in SequenceEquals
class MyComparer : IEqualityComparer<char>
{
public bool Equals(char x, char y)
{
return x.Equals(y);
}
public int GetHashCode(char obj)
{
return obj.GetHashCode();
}
}
// and then:
var s1 = "ABC0102";
var s2 = "AC201B0";
Func<char, double> orderFunction = char.GetNumericValue;
var comparer = new MyComparer();
var result = s1.OrderBy(orderFunction).SequenceEqual(s2.OrderBy(orderFunction), comparer);
Console.WriteLine("result = " + result);
As you can see, it's all in 3 lines of code (not counting the comparer class). It's also very very easily configurable.
The code as it stands checks if s1 is a permutation of s2.
Do you want to check if s1 has the same number and kind of characters with s2, but not necessarily the same characters (e.g. "ABC" to be equal to "ABB")? No problem, change MyComparer.Equals to return char.GetUnicodeCategory(x).Equals(char.GetUnicodeCategory(y));.
By changing the values of orderFunction and comparer you can configure a multitude of other comparison options.
And finally, since I don't find it very elegant to define a MyComparer class just to enable this scenario, you can also use the technique described in this question:
Wrap a delegate in an IEqualityComparer
to define your comparer as an inline lambda. This would result in a configurable solution contained in 2-3 lines of code.

C# matching two text files, case sensitive issue

What I have is two files, sourcecolumns.txt and destcolumns.txt. What I need to do is compare source to dest and if the dest doesn't contain the source value, write it out to a new file. The code below works except I have case sensitive issues like this:
source: CPI
dest: Cpi
These don't match because of captial letters, so I get incorrect outputs. Any help is always welcome!
string[] sourcelinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "sourcecolumns.txt");
string[] destlinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "destcolumns.txt");
foreach (string sline in sourcelinestotal)
{
if (destlinestotal.Contains(sline))
{
}
else
{
File.AppendAllText("C:\\testdirectory\\" + "missingcolumns.txt", sline);
}
}

You could do this using an extension method for IEnumerable<string> like:
public static class EnumerableExtensions
{
public static bool Contains( this IEnumerable<string> source, string value, StringComparison comparison )
{
if (source == null)
{
return false; // nothing is a member of the empty set
}
return source.Any( s => string.Equals( s, value, comparison ) );
}
}
then change
if (destlinestotal.Contains( sline ))
to
if (destlinestotal.Contains( sline, StringComparison.OrdinalIgnoreCase ))
However, if the sets are large and/or you are going to do this very often, the way you're going about it is very inefficient. Essentially, you're doing an O(n2) operation -- for each line in the source you compare it with, potentially, all lines in the destination. It would be better to create a HashSet from the destination columns with a case insenstivie comparer and then iterate through your source columns checking if each one exists in the HashSet of the destination columns. This would be an O(n) algorithm. note that Contains on the HashSet will use the comparer you provide in the constructor.
string[] sourcelinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "sourcecolumns.txt");
HashSet<string> destlinestotal =
new HashSet<string>(
File.ReadAllLines("C:\\testdirectory\\" + "destcolumns.txt"),
StringComparer.OrdinalIgnoreCase
);
foreach (string sline in sourcelinestotal)
{
if (!destlinestotal.Contains(sline))
{
File.AppendAllText("C:\\testdirectory\\" + "missingcolumns.txt", sline);
}
}
In retrospect, I actually prefer this solution over simply writing your own case insensitive contains for IEnumerable<string> unless you need the method for something else. There's actually less code (of your own) to maintain by using the HashSet implementation.

Use an extension method for your Contains. A brilliant example was found here on stack overflow Code isn't mine, but I'll post it below.
public static bool Contains(this string source, string toCheck, StringComparison comp)
{
return source.IndexOf(toCheck, comp) >= 0;
}
string title = "STRING";
bool contains = title.Contains("string", StringComparison.OrdinalIgnoreCase);

If you do not need case sensitivity, convert your lines to upper case using string.ToUpper before comparison.

How to search a string in String array

I need to search a string in the string array. I dont want to use any for looping in it
string [] arr = {"One","Two","Three"};
string theString = "One"
I need to check whether theString variable is present in arr.

Well, something is going to have to look, and looping is more efficient than recursion (since tail-end recursion isn't fully implemented)... so if you just don't want to loop yourself, then either of:
bool has = arr.Contains(var); // .NET 3.5
or
bool has = Array.IndexOf(arr, var) >= 0;
For info: avoid names like var - this is a keyword in C# 3.0.

Every method, mentioned earlier does looping either internally or externally, so it is not really important how to implement it. Here another example of finding all references of target string
string [] arr = {"One","Two","Three"};
var target = "One";
var results = Array.FindAll(arr, s => s.Equals(target));

Does it have to be a string[] ? A List<String> would give you what you need.
List<String> testing = new List<String>();
testing.Add("One");
testing.Add("Two");
testing.Add("Three");
testing.Add("Mouse");
bool inList = testing.Contains("Mouse");

bool exists = arr.Contains("One");

I think it is better to use Array.Exists than Array.FindAll.

Its pretty simple. I always use this code to search string from a string array
string[] stringArray = { "text1", "text2", "text3", "text4" };
string value = "text3";
int pos = Array.IndexOf(stringArray, value);
if (pos > -1)
{
return true;
}
else
{
return false;
}

If the array is sorted, you can use BinarySearch. This is a O(log n) operation, so it is faster as looping. If you need to apply multiple searches and speed is a concern, you could sort it (or a copy) before using it.

Each class implementing IList has a method Contains(Object value). And so does System.Array.

Why the prohibition "I don't want to use any looping"? That's the most obvious solution. When given the chance to be obvious, take it!
Note that calls like arr.Contains(...) are still going to loop, it just won't be you who has written the loop.
Have you considered an alternate representation that's more amenable to searching?
A good Set implementation would perform well. (HashSet, TreeSet or the local equivalent).
If you can be sure that arr is sorted, you could use binary search (which would need to recurse or loop, but not as often as a straight linear search).

You can use Find method of Array type. From .NET 3.5 and higher.
public static T Find<T>(
T[] array,
Predicate<T> match
)
Here is some examples:
// we search an array of strings for a name containing the letter “a”:
static void Main()
{
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, ContainsA);
Console.WriteLine (match); // Jack
}
static bool ContainsA (string name) { return name.Contains ("a"); }
Here’s the same code shortened with an anonymous method:
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, delegate (string name)
{ return name.Contains ("a"); } ); // Jack
A lambda expression shortens it further:
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, n => n.Contains ("a")); // Jack

At first shot, I could come up with something like this (but it's pseudo code and assuming you cannot use any .NET built-in libaries). Might require a bit of tweaking and re-thinking, but should be good enough for a head-start, maybe?
int findString(String var, String[] stringArray, int currentIndex, int stringMaxIndex)
{
if currentIndex > stringMaxIndex
return (-stringMaxIndex-1);
else if var==arr[currentIndex] //or use any string comparison op or function
return 0;
else
return findString(var, stringArray, currentIndex++, stringMaxIndex) + 1 ;
}
//calling code
int index = findString(var, arr, 0, getMaxIndex(arr));
if index == -1 printOnScreen("Not found");
else printOnScreen("Found on index: " + index);

In C#, if you can use an ArrayList, you can use the Contains method, which returns a boolean:
if MyArrayList.Contains("One")

You can check the element existence by
arr.Any(x => x == "One")

it is old one ,but this is the way i do it ,
enter code herevar result = Array.Find(names, element => element == "One");

I'm surprised that no one suggested using Array.IndexOf Method.
Indeed, Array.IndexOf has two advantages :
It allows searching if an element is included into an array,
It gets at the same time the index into the array.
int stringIndex = Array.IndexOf(arr, theString);
if (stringIndex >= 0)
{
// theString has been found
}
Inline version :
if (Array.IndexOf(arr, theString) >= 0)
{
// theString has been found
}

Using Contains()
string [] SomeArray = {"One","Two","Three"};
bool IsExist = SomeArray.Contains("One");
Console.WriteLine("Is string exist: "+ IsExist);
Using Find()
string [] SomeArray = {"One","Two","Three"};
var result = Array.Find(SomeArray, element => element == "One");
Console.WriteLine("Required string is: "+ result);
Another simple & traditional way, very useful for beginners to build logic.
string [] SomeArray = {"One","Two","Three"};
foreach (string value in SomeArray) {
if (value == "One") {
Console.WriteLine("Required string is: "+ value);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Enumerable.Except with IEqualityComparer - c#

Related

How to combine items in List<string> to make new items efficiently

c# Linq Except Not Returning List of Different Values

Compare two values using RegEx

C# matching two text files, case sensitive issue

How to search a string in String array

Categories

Resources