C# matching two text files, case sensitive issue

C# matching two text files, case sensitive issue - c#

What I have is two files, sourcecolumns.txt and destcolumns.txt. What I need to do is compare source to dest and if the dest doesn't contain the source value, write it out to a new file. The code below works except I have case sensitive issues like this:
source: CPI
dest: Cpi
These don't match because of captial letters, so I get incorrect outputs. Any help is always welcome!
string[] sourcelinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "sourcecolumns.txt");
string[] destlinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "destcolumns.txt");
foreach (string sline in sourcelinestotal)
{
if (destlinestotal.Contains(sline))
{
}
else
{
File.AppendAllText("C:\\testdirectory\\" + "missingcolumns.txt", sline);
}
}

You could do this using an extension method for IEnumerable<string> like:
public static class EnumerableExtensions
{
public static bool Contains( this IEnumerable<string> source, string value, StringComparison comparison )
{
if (source == null)
{
return false; // nothing is a member of the empty set
}
return source.Any( s => string.Equals( s, value, comparison ) );
}
}
then change
if (destlinestotal.Contains( sline ))
to
if (destlinestotal.Contains( sline, StringComparison.OrdinalIgnoreCase ))
However, if the sets are large and/or you are going to do this very often, the way you're going about it is very inefficient. Essentially, you're doing an O(n2) operation -- for each line in the source you compare it with, potentially, all lines in the destination. It would be better to create a HashSet from the destination columns with a case insenstivie comparer and then iterate through your source columns checking if each one exists in the HashSet of the destination columns. This would be an O(n) algorithm. note that Contains on the HashSet will use the comparer you provide in the constructor.
string[] sourcelinestotal =
File.ReadAllLines("C:\\testdirectory\\" + "sourcecolumns.txt");
HashSet<string> destlinestotal =
new HashSet<string>(
File.ReadAllLines("C:\\testdirectory\\" + "destcolumns.txt"),
StringComparer.OrdinalIgnoreCase
);
foreach (string sline in sourcelinestotal)
{
if (!destlinestotal.Contains(sline))
{
File.AppendAllText("C:\\testdirectory\\" + "missingcolumns.txt", sline);
}
}
In retrospect, I actually prefer this solution over simply writing your own case insensitive contains for IEnumerable<string> unless you need the method for something else. There's actually less code (of your own) to maintain by using the HashSet implementation.

Use an extension method for your Contains. A brilliant example was found here on stack overflow Code isn't mine, but I'll post it below.
public static bool Contains(this string source, string toCheck, StringComparison comp)
{
return source.IndexOf(toCheck, comp) >= 0;
}
string title = "STRING";
bool contains = title.Contains("string", StringComparison.OrdinalIgnoreCase);

If you do not need case sensitivity, convert your lines to upper case using string.ToUpper before comparison.

Related

Enumerable.Except with IEqualityComparer

I have two string arrays, newArray and oldArray, and I want to use Enumberable.Except method to remove all items that are in newArray that are also in oldArray and then write the result to a csv file.
However, I need to use a custom comparer in order to check for formatting similarities(if there is a new line character in one array and not the other, I don't want this item being written to the file).
My code as of now:
string newString = File.ReadAllText(csvOutputFile1);
string[] newArray = newString.Split(new string[] {sentinel}, StringSplitOptions.RemoveEmptyEntries);
string oldString = File.ReadAllText(csvOutputFile2);
string[] oldArray = oldString.Split(new string[] { sentinel }, StringSplitOptions.None);
IEnumerable<string> differnceQuery = newArray.Except(oldArray, new Comparer());
using (var wtr = new StreamWriter(diffFile))
{
foreach (var s in differnceQuery)
{
wtr.WriteLine(s.Trim() + "#!#");
}
}
and the custom comparer class:
class Comparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
x = x.ToString().Replace(" ", "").Replace("\n", "").Replace("\r", "");
y = y.ToString().Replace(" ", "").Replace("\n", "").Replace("\r", "");
if (x == y)
return true;
else
return false;
}
public int GetHashCode(string row)
{
int hCode = row.GetHashCode();
return hCode;
}
}
The resulting file is not omitting the formatting difference items between the two arrays. So although it catches items that are in the newArray but not in the oldArray(like it should), it is also putting in items that are only different because of a \n or something even though in my custom comparer I am removing them.
The thing I really don't understand is when I debug and step through my code, I can see each pair of items being analyzed in my custom comparer class, but only when they are equal terms. If for example the string "This is\nthe 1st term" is in newArray and the string "This is the first array" is in oldArray, the debugger doesn't even enter the comparer class and instead jumps straight to the writeline part of my code in the main class.

simply: your hash-code does not correctly mirror your equality method. Strings like "a b c" and "abc" would return different values from GetHashCode, so it would never get around to testing Equals. GetHashCode must return the same result for any two values that could be equal. It is not, however, necessary that two strings that are not equal return different hash-codes (although it is highly desirable, otherwise everything will go into the same hash-bucket).
I guess you could use:
// warning: probably not very efficient
return x.Replace(" ", "").Replace("\n", "").Replace("\r", "").GetHashCode();
but that looks pretty expensive (lots of potential for garbage strings to be generated all the time)

Split string extension with generic type?

I would like to create a Split extension that would allow me to split any string to a strongly-typed list. I have a head start, but since I was going to reuse this in many projects, I would like to get input from the community (and so you can add it to your own toolbox ;) Any ideas from here?
public static class Converters
{
public static IEnumerable<T> Split<T>(this string source, char delimiter)
{
var type = typeof(T);
//SPLIT TO INTEGER LIST
if (type == typeof(int))
{
return source.Split(delimiter).Select(n => int.Parse(n)) as IEnumerable<T>;
}
//SPLIT TO FLOAT LIST
if (type == typeof(float))
{
return source.Split(delimiter).Select(n => float.Parse(n)) as IEnumerable<T>;
}
//SPLIT TO DOUBLE LIST
if (type == typeof(double))
{
return source.Split(delimiter).Select(n => double.Parse(n)) as IEnumerable<T>;
}
//SPLIT TO DECIMAL LIST
if (type == typeof(decimal))
{
return source.Split(delimiter).Select(n => decimal.Parse(n)) as IEnumerable<T>;
}
//SPLIT TO DATE LIST
if (type == typeof(DateTime))
{
return source.Split(delimiter).Select(n => DateTime.Parse(n)) as IEnumerable<T>;
}
//USE DEFAULT SPLIT IF NO SPECIAL CASE DEFINED
return source.Split(delimiter) as IEnumerable<T>;
}
}

I'd add a parameter for the conversion function:
public static IEnumerable<T> Split<T>(this string source, Func<string, T> converter, params char[] delimiters)
{
return source.Split(delimiters).Select(converter);
}
And you can call it as
IEnumerable<int> ints = "1,2,3".Split<int>(int.Parse, ',');
I would also consider renaming it to avoid confusion with the String.Split instance method since this complicates overload resolution, and behaves differently to the others.
EDIT: If you don't want to specify the conversion function, you could use type converters:
public static IEnumerable<T> SplitConvert<T>(this string str, params char[] delimiters)
{
var converter = TypeDescriptor.GetConverter(typeof(T));
if (converter.CanConvertFrom(typeof(string)))
{
return str.Split(delimiters).Select(s => (T)converter.ConvertFromString(s));
}
else throw new InvalidOperationException("Cannot convert type");
}
This allows the conversion to be extended to other types rather than relying on a pre-defined list.

Although I agree with Lee’s suggestion, I personally don’t think it’s worth defining a new extension method for something that may trivially be achieved with standard LINQ operations:
IEnumerable<int> ints = "1,2,3".Split(',').Select(int.Parse);

public static IEnumerable<T> Split<T>
(this string source, char delimiter,Converter<string,T> func)
{
return source.Split(delimiter).Select(n => func(n)));
}
Example:
"...".Split<int>(',',p=>int.Parse(p))
Or you can use Converter.ChangeType without define function:
public static IEnumerable<T> Split<T>(this string source, char delimiter)
{
return source.Split(delimiter).Select(n => (T)Convert.ChangeType(n, typeof(T)));
}

I don't like this method. Parsing data types from strings (a sort of deserialization) is a very type- and content- sensitive process when you're dealing with data types more complex than an int. For example, DateTime.Parse parses the date using the current culture, so your method is not going to provide consistent or reliable output for dates across systems. It also tries to parse the date at all costs so it might skip through what might be considered bad input in some situations.
The goal of splitting any string into a strongly typed list cannot really be accomplished with a single method that uses hard-coded conversions, especially if your goal is broad usability. Even if you do update it repeatedly with new conversions. The best way to go about it is just to "1,2,3".Split(",").Select(x => whatever) like Douglas suggested above. It's also very clear what sort of conversion is taking place.

Compare two values using RegEx

If I have two values eg/ABC001 and ABC100 or A0B0C1 and A1B0C0, is there a RegEx I can use to make sure the two values have the same pattern?

Well, here's my shot at it. This doesn't use regular expressions, and assumes s1 and s2 only contain numbers or digits:
public static bool SamePattern(string s1, string s2)
{
if (s1.Length == s2.Length)
{
char[] chars1 = s1.ToCharArray();
char[] chars2 = s2.ToCharArray();
for (int i = 0; i < chars1.Length; i++)
{
if (!Char.IsDigit(chars1[i]) && chars1[i] != chars2[i])
{
return false;
}
else if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
{
return false;
}
}
return true;
}
else
{
return false;
}
}
A description of the algorithm is as follows:
If the strings have different lengths, return false.
Otherwise, check the characters in the same position in both strings:
If they are both digits or both numbers, move on to the next iteration.
If they aren't digits but aren't the same, return false.
If one is a digit and one is a number, return false.
If all characters in both strings were checked successfully, return true.

If you don't know the pattern in advance, but are only going to encounter two groups of characters (alpha and digits), then you could do the following:
Write some C# that parsed the first pattern, looking at each char and determine if it's alpha, or digit, then generate a regex accordingly from that pattern.
You may find that there's no point writing code to generate a regex, as it could be just as simple to check the second string against the first.
Alternatively, without regex:
First check the strings are the same length.
Then loop through both strings at the same time, char by char. If char[x] from string 1 is alpha, and char[x] from string two is the same, you're patterns are matching.
Try this, it should cope if a string sneaks in some symbols. Edited to compare character values ... and use Char.IsLetter and Char.IsDigit
private bool matchPattern(string string1, string string2)
{
bool result = (string1.Length == string2.Length);
char[] chars1 = string1.ToCharArray();
char[] chars2 = string2.ToCharArray();
for (int i = 0; i < string1.Length; i++)
{
if (Char.IsLetter(chars1[i]) != Char.IsLetter(chars2[i]))
{
result = false;
}
if (Char.IsLetter(chars1[i]) && (chars1[i] != chars2[i]))
{
//Characters must be identical
result = false;
}
if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
result = false;
}
return result;
}

Consider using Char.GetUnicodeCategory
You can write a helper class for this task:
public class Mask
{
public Mask(string originalString)
{
OriginalString = originalString;
CharCategories = originalString.Select(Char.GetUnicodeCategory).ToList();
}
public string OriginalString { get; private set; }
public IEnumerable<UnicodeCategory> CharCategories { get; private set; }
public bool HasSameCharCategories(Mask other)
{
//null checks
return CharCategories.SequenceEqual(other.CharCategories);
}
}
Use as
Mask mask1 = new Mask("ab12c3");
Mask mask2 = new Mask("ds124d");
MessageBox.Show(mask1.HasSameCharCategories(mask2).ToString());

I don't know C# syntax but here is a pseudo code:
split the strings on ''
sort the 2 arrays
join each arrays with ''
compare the 2 strings

A general-purpose solution with LINQ can be achieved quite easily. The idea is:
Sort the two strings (reordering the characters).
Compare each sorted string as a character sequence using SequenceEquals.
This scheme enables a short, graceful and configurable solution, for example:
// We will be using this in SequenceEquals
class MyComparer : IEqualityComparer<char>
{
public bool Equals(char x, char y)
{
return x.Equals(y);
}
public int GetHashCode(char obj)
{
return obj.GetHashCode();
}
}
// and then:
var s1 = "ABC0102";
var s2 = "AC201B0";
Func<char, double> orderFunction = char.GetNumericValue;
var comparer = new MyComparer();
var result = s1.OrderBy(orderFunction).SequenceEqual(s2.OrderBy(orderFunction), comparer);
Console.WriteLine("result = " + result);
As you can see, it's all in 3 lines of code (not counting the comparer class). It's also very very easily configurable.
The code as it stands checks if s1 is a permutation of s2.
Do you want to check if s1 has the same number and kind of characters with s2, but not necessarily the same characters (e.g. "ABC" to be equal to "ABB")? No problem, change MyComparer.Equals to return char.GetUnicodeCategory(x).Equals(char.GetUnicodeCategory(y));.
By changing the values of orderFunction and comparer you can configure a multitude of other comparison options.
And finally, since I don't find it very elegant to define a MyComparer class just to enable this scenario, you can also use the technique described in this question:
Wrap a delegate in an IEqualityComparer
to define your comparer as an inline lambda. This would result in a configurable solution contained in 2-3 lines of code.

Can I use LINQ to strip repeating spaces from a string?

A quick brain teaser: given a string
This is a string with repeating spaces
What would be the LINQ expressing to end up with
This is a string with repeating spaces
Thanks!
For reference, here's one non-LINQ way:
private static IEnumerable<char> RemoveRepeatingSpaces(IEnumerable<char> text)
{
bool isSpace = false;
foreach (var c in text)
{
if (isSpace && char.IsWhiteSpace(c)) continue;
isSpace = char.IsWhiteSpace(c);
yield return c;
}
}

This is not a linq type task, use regex
string output = Regex.Replace(input," +"," ");
Of course you could use linq to apply this to a collection of strings.

public static string TrimInternal(this string text)
{
var trimmed = text.Where((c, index) => !char.IsWhiteSpace(c) || (index != 0 && !char.IsWhiteSpace(text[index - 1])));
return new string(trimmed.ToArray());
}

Since nobody seems to have given a satisfactory answer, I came up with one. Here's a string-based solution (.Net 4):
public static string RemoveRepeatedSpaces(this string s)
{
return s[0] + string.Join("",
s.Zip(
s.Skip(1),
(x, y) => x == y && y == ' ' ? (char?)null : y));
}
However, this is just a general case of removing repeated elements from a sequence, so here's the generalized version:
public static IEnumerable<T> RemoveRepeatedElements<T>(
this IEnumerable<T> s, T dup)
{
return s.Take(1).Concat(
s.Zip(
s.Skip(1),
(x, y) => x.Equals(y) && y.Equals(dup) ? (object)null : y)
.OfType<T>());
}
Of course, that's really just a more specific version of a function that removes all consecutive duplicates from its input stream:
public static IEnumerable<T> RemoveRepeatedElements<T>(this IEnumerable<T> s)
{
return s.Take(1).Concat(
s.Zip(
s.Skip(1),
(x, y) => x.Equals(y) ? (object)null : y)
.OfType<T>());
}
And obviously you can implement the first function in terms of the second:
public static string RemoveRepeatedSpaces(this string s)
{
return string.Join("", s.RemoveRepeatedElements(' '));
}
BTW, I benchmarked my last function against the regex version (Regex.Replace(s, " +", " ")) and they were were within nanoseconds of each other, so the extra LINQ overhead is negligible compared to the extra regex overhead. When I generalized it to remove all consecutive duplicate characters, the equivalent regex (Regex.Replace(s, "(.)\\1+", "$1")) was 3.5 times slower than my LINQ version (string.Join("", s.RemoveRepeatedElements())).
I also tried the "ideal" procedural solution:
public static string RemoveRepeatedSpaces(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
char lastChar = '\0';
foreach (char c in s)
if (c != ' ' || lastChar != ' ')
sb.Append(lastChar = c);
return sb.ToString();
}
This is more than 5 times faster than a regex!

In practice, I would probably just use your original solution or regular expressions (if you want a quick & simple solution). A geeky approach that uses lambda functions would be to define a fixed point operator:
T FixPoint<T>(T initial, Func<T, T> f) {
T current = initial;
do {
initial = current;
current = f(initial);
} while (initial != current);
return current;
}
This keeps calling the operation f repeatedly until the operation returns the same value that it got as an argument. You can think of the operation as a generalized loop - it is quite useful, though I guess it is too geeky to be included in .NET BCL. Then you can write:
string res = FixPoint(original, s => s.Replace(" ", " "));
It is not as efficient as your original version, but unless there are too many spaces it should work fine.

Linq is by definition related to enumerable (i.e. collections, list, arrays). You could transorm your string into a collection of char and select the non space one but this is definitevly not a job for Linq.

Paul Creasey's answer is the way to go.
If you want to treat tabs as whitespace as well, go with:
text = Regex.Replace(text, "[ |\t]+", " ");
UPDATE:
The most logical way to solve this problem while satisfying the "using LINQ" requirement has been suggested by both Hasan and Ani. However, notice that these solutions involve accessing a character in a string by index.
The spirit of the LINQ approach is that it can be applied to any enumerable sequence. Because any reasonably efficient solution to this problem requires maintaining some kind of state (with Ani's and Hasan's solutions it's easy to miss this fact as the state is already maintained within the string itself), a generic approach that accepts any sequence of items is likely going to be much more straightforward to implement using procedural code.
This procedural code may then be abstracted into a method that looks like a LINQ-style method, of course. But I would not recommend tackling a problem like this with the attitude of "I want to use LINQ in this solution" from the get-go because it will impose very awkward restriction on your code.
For what it's worth, here's how I'd implement the general idea.
public static IEnumerable<T> StripConsecutives<T>(this IEnumerable<T> source, T value, IEqualityComparer<T> comparer)
{
// null-checking omitted for brevity
using (var enumerator = source.GetEnumerator())
{
if (enumerator.MoveNext())
{
yield return enumerator.Current;
}
else
{
yield break;
}
T prev = enumerator.Current;
while (enumerator.MoveNext())
{
T current = enumerator.Current;
if (comparer.Equals(prev, value) && comparer.Equals(current, value))
{
// This is a consecutive occurrence of value --
// moving on...
}
else
{
yield return current;
}
prev = current;
}
}
}

Split to list, filter, then rejoin, 2 lines of code...
var test = " Alpha Beta Tango ";
var l = test.Split(' ').Where(s => !string.IsNullOrEmpty(s));
var result = string.Join(" ", l);
// result = "Alpha Beta Tango"
Refactoring as an extension method:
using Extensions;
void Main()
{
var test = " Alpha Beta Tango ";
var result = test.RemoveRepeatedSpaces();
// result = "Alpha Beta Tango";
}
static class Extentions
{
public static string RemoveRepeatedSpaces(this string s)
{
if (s == null)
return string.Empty;
var l = s.Split(' ').Where(a => !string.IsNullOrEmpty(a));
return string.Join(" ", l);
}
}

How to search a string in String array

I need to search a string in the string array. I dont want to use any for looping in it
string [] arr = {"One","Two","Three"};
string theString = "One"
I need to check whether theString variable is present in arr.

Well, something is going to have to look, and looping is more efficient than recursion (since tail-end recursion isn't fully implemented)... so if you just don't want to loop yourself, then either of:
bool has = arr.Contains(var); // .NET 3.5
or
bool has = Array.IndexOf(arr, var) >= 0;
For info: avoid names like var - this is a keyword in C# 3.0.

Every method, mentioned earlier does looping either internally or externally, so it is not really important how to implement it. Here another example of finding all references of target string
string [] arr = {"One","Two","Three"};
var target = "One";
var results = Array.FindAll(arr, s => s.Equals(target));

Does it have to be a string[] ? A List<String> would give you what you need.
List<String> testing = new List<String>();
testing.Add("One");
testing.Add("Two");
testing.Add("Three");
testing.Add("Mouse");
bool inList = testing.Contains("Mouse");

bool exists = arr.Contains("One");

I think it is better to use Array.Exists than Array.FindAll.

Its pretty simple. I always use this code to search string from a string array
string[] stringArray = { "text1", "text2", "text3", "text4" };
string value = "text3";
int pos = Array.IndexOf(stringArray, value);
if (pos > -1)
{
return true;
}
else
{
return false;
}

If the array is sorted, you can use BinarySearch. This is a O(log n) operation, so it is faster as looping. If you need to apply multiple searches and speed is a concern, you could sort it (or a copy) before using it.

Each class implementing IList has a method Contains(Object value). And so does System.Array.

Why the prohibition "I don't want to use any looping"? That's the most obvious solution. When given the chance to be obvious, take it!
Note that calls like arr.Contains(...) are still going to loop, it just won't be you who has written the loop.
Have you considered an alternate representation that's more amenable to searching?
A good Set implementation would perform well. (HashSet, TreeSet or the local equivalent).
If you can be sure that arr is sorted, you could use binary search (which would need to recurse or loop, but not as often as a straight linear search).

You can use Find method of Array type. From .NET 3.5 and higher.
public static T Find<T>(
T[] array,
Predicate<T> match
)
Here is some examples:
// we search an array of strings for a name containing the letter “a”:
static void Main()
{
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, ContainsA);
Console.WriteLine (match); // Jack
}
static bool ContainsA (string name) { return name.Contains ("a"); }
Here’s the same code shortened with an anonymous method:
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, delegate (string name)
{ return name.Contains ("a"); } ); // Jack
A lambda expression shortens it further:
string[] names = { "Rodney", "Jack", "Jill" };
string match = Array.Find (names, n => n.Contains ("a")); // Jack

At first shot, I could come up with something like this (but it's pseudo code and assuming you cannot use any .NET built-in libaries). Might require a bit of tweaking and re-thinking, but should be good enough for a head-start, maybe?
int findString(String var, String[] stringArray, int currentIndex, int stringMaxIndex)
{
if currentIndex > stringMaxIndex
return (-stringMaxIndex-1);
else if var==arr[currentIndex] //or use any string comparison op or function
return 0;
else
return findString(var, stringArray, currentIndex++, stringMaxIndex) + 1 ;
}
//calling code
int index = findString(var, arr, 0, getMaxIndex(arr));
if index == -1 printOnScreen("Not found");
else printOnScreen("Found on index: " + index);

In C#, if you can use an ArrayList, you can use the Contains method, which returns a boolean:
if MyArrayList.Contains("One")

You can check the element existence by
arr.Any(x => x == "One")

it is old one ,but this is the way i do it ,
enter code herevar result = Array.Find(names, element => element == "One");

I'm surprised that no one suggested using Array.IndexOf Method.
Indeed, Array.IndexOf has two advantages :
It allows searching if an element is included into an array,
It gets at the same time the index into the array.
int stringIndex = Array.IndexOf(arr, theString);
if (stringIndex >= 0)
{
// theString has been found
}
Inline version :
if (Array.IndexOf(arr, theString) >= 0)
{
// theString has been found
}

Using Contains()
string [] SomeArray = {"One","Two","Three"};
bool IsExist = SomeArray.Contains("One");
Console.WriteLine("Is string exist: "+ IsExist);
Using Find()
string [] SomeArray = {"One","Two","Three"};
var result = Array.Find(SomeArray, element => element == "One");
Console.WriteLine("Required string is: "+ result);
Another simple & traditional way, very useful for beginners to build logic.
string [] SomeArray = {"One","Two","Three"};
foreach (string value in SomeArray) {
if (value == "One") {
Console.WriteLine("Required string is: "+ value);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# matching two text files, case sensitive issue - c#

If you do not need case sensitivity, convert your lines to upper case using string.ToUpper before comparison.

Related

Enumerable.Except with IEqualityComparer

Split string extension with generic type?

Compare two values using RegEx

Can I use LINQ to strip repeating spaces from a string?

How to search a string in String array

Categories

Resources