How to not include line breaks when comparing two strings - c#

i am comparing updates to two strings. i did a:
string1 != string2
and they turn out different. I put them in the "Add Watch" and i see the only difference is one has line breaks and the other doesnt'.:
string1 = "This is a test. \nThis is a test";
string2 = "This is a test. This is a test";
i basically want to do a compare but dont include line breaks. So if line break is the only difference then consider them equal.

A quick and dirty way, when performance isn't much of an issue:
string1.Replace("\n", "") != string2.Replace("\n", "")

I'd suggest regex to reduce every space, tab, \r, \n to a single space :
Regex.Replace(string1, #"\s+", " ") != Regex.Replace(string2, #"\s+", " ")

Assuming:
The sort of direct char-value-for-char-value comparison of != and == is what is wanted here, except for the matter of newlines.
The strings are, or may, be large enough or compared often enough to make just replacing "\n" with an empty string too inefficient.
Then:
public bool LinelessEquals(string x, string y)
{
//deal with quickly handlable cases quickly.
if(ReferenceEquals(x, y))//same instance
return true; // - generally happens often in real code,
//and is a fast check, so always worth doing first.
//We already know they aren't both null as
//ReferenceEquals(null, null) returns true.
if(x == null || y == null)
return false;
IEnumerator<char> eX = x.Where(c => c != '\n').GetEnumerator();
IEnumerator<char> eY = y.Where(c => c != '\n').GetEnumerator();
while(eX.MoveNext())
{
if(!eY.MoveNext()) //y is shorter
return false;
if(ex.Current != ey.Current)
return false;
}
return !ey.MoveNext(); //check if y was longer.
}
This is defined as equality rather than inequality, so you could easily adapt it to be an implementation of IEqualityComparer<string>.Equals. Your question for a linebreak-less string1 != string2 becomes: !LinelessEquals(string1, string2)

Here's an equality comparer for strings that ignores certain characters, such as \r and \n.
This implementation doesn't allocate any heap memory during execution, helping its performance. It also avoids virtual calls through IEnumerable and IEnumerator.
public sealed class SelectiveStringComparer : IEqualityComparer<string>
{
private readonly string _ignoreChars;
public SelectiveStringComparer(string ignoreChars = "\r\n")
{
_ignoreChars = ignoreChars;
}
public bool Equals(string x, string y)
{
if (ReferenceEquals(x, y))
return true;
if (x == null || y == null)
return false;
var ix = 0;
var iy = 0;
while (true)
{
while (ix < x.Length && _ignoreChars.IndexOf(x[ix]) != -1)
ix++;
while (iy < y.Length && _ignoreChars.IndexOf(y[iy]) != -1)
iy++;
if (ix >= x.Length)
return iy >= y.Length;
if (iy >= y.Length)
return false;
if (x[ix] != y[iy])
return false;
ix++;
iy++;
}
}
public int GetHashCode(string obj)
{
throw new NotSupportedException();
}
}

A cleaner approach would be to use:
string1.Replace(Environment.NewLine, String.Empty) != string2.Replace(Environment.NewLine, String.Empty);

This is a generalized and tested version of Jon Hannas answer.
/// <summary>
/// Compares two character enumerables one character at a time, ignoring those specified.
/// </summary>
/// <param name="x"></param>
/// <param name="y"></param>
/// <param name="ignoreThese"> If not specified, the default is to ignore linefeed and newline: {'\r', '\n'} </param>
/// <returns></returns>
public static bool EqualsIgnoreSome(this IEnumerable<char> x, IEnumerable<char> y, params char[] ignoreThese)
{
// First deal with quickly handlable cases quickly:
// Same instance - generally happens often in real code, and is a fast check, so always worth doing first.
if (ReferenceEquals(x, y))
return true; //
// We already know they aren't both null as ReferenceEquals(null, null) returns true.
if (x == null || y == null)
return false;
// Default ignore is newlines:
if (ignoreThese == null || ignoreThese.Length == 0)
ignoreThese = new char[] { '\r', '\n' };
// Filters by specifying enumerator.
IEnumerator<char> eX = x.Where(c => !ignoreThese.Contains(c)).GetEnumerator();
IEnumerator<char> eY = y.Where(c => !ignoreThese.Contains(c)).GetEnumerator();
// Compares.
while (eX.MoveNext())
{
if (!eY.MoveNext()) //y is shorter
return false;
if (eX.Current != eY.Current)
return false;
}
return !eY.MoveNext(); //check if y was longer.
}

string1.replace('\n','') != string2.replace('\n','')

Cant you just strip out the line breaks before comparing the strings?
E.g. (pseudocode)...
string1.replace('\n','') != string2.replace('\n','')

Here's a version in VB.net based on Drew Noakes answer
Dim g_sIgnore As String = vbSpace & vbNewLine & vbTab 'String.Format("\n\r\t ")
Public Function StringCompareIgnoringWhitespace(s1 As String, s2 As String) As Boolean
Dim i1 As Integer = 0
Dim i2 As Integer = 0
Dim s1l As Integer = s1.Length
Dim s2l As Integer = s2.Length
Do
While i1 < s1l AndAlso g_sIgnore.IndexOf(s1(i1)) <> -1
i1 += 1
End While
While i2 < s2l AndAlso g_sIgnore.IndexOf(s2(i2)) <> -1
i2 += 1
End While
If i1 = s1l And i2 = s2l Then
Return True
Else
If i1 < s1l AndAlso i2 < s2l AndAlso s1(i1) = s2(i2) Then
i1 += 1
i2 += 1
Else
Return False
End If
End If
Loop
Return False
End Function
I also tested it with
Try
Debug.Assert(Not StringCompareIgnoringWhitespace("a", "z"))
Debug.Assert(Not StringCompareIgnoringWhitespace("aa", "zz"))
Debug.Assert(StringCompareIgnoringWhitespace("", ""))
Debug.Assert(StringCompareIgnoringWhitespace(" ", ""))
Debug.Assert(StringCompareIgnoringWhitespace("", " "))
Debug.Assert(StringCompareIgnoringWhitespace(" a", "a "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa", "aa "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa ", " aa "))
Debug.Assert(StringCompareIgnoringWhitespace(" aa a", " aa a"))
Debug.Assert(Not StringCompareIgnoringWhitespace("a", ""))
Debug.Assert(Not StringCompareIgnoringWhitespace("", "a"))
Debug.Assert(Not StringCompareIgnoringWhitespace("ccc", ""))
Debug.Assert(Not StringCompareIgnoringWhitespace("", "ccc"))
Catch ex As Exception
Console.WriteLine(ex.ToString)
End Try

I've run into this problem a number of times when I'm writing unit tests that need to compare multiple line expected strings with the actual output strings.
For example, if I'm writing a method that outputs a multi-line string I care about what each line looks like, but I don't care about the particular newline character used on a Windows or Mac machine.
In my case I just want to assert that each line is equal in my unit tests and bail out if one of them isn't.
public static void AssertAreLinesEqual(string expected, string actual)
{
using (var expectedReader = new StringReader(expected))
using (var actualReader = new StringReader(actual))
{
while (true)
{
var expectedLine = expectedReader.ReadLine();
var actualLine = actualReader.ReadLine();
Assert.AreEqual(expectedLine, actualLine);
if(expectedLine == null || actualLine == null)
break;
}
}
}
Of course, you could also make the method a little more generic and write to return a bool instead.
public static bool AreLinesEqual(string expected, string actual)
{
using (var expectedReader = new StringReader(expected))
using (var actualReader = new StringReader(actual))
{
while (true)
{
var expectedLine = expectedReader.ReadLine();
var actualLine = actualReader.ReadLine();
if (expectedLine != actualLine)
return false;
if(expectedLine == null || actualLine == null)
break;
}
}
return true;
}
What surprises me most is that there isn't a method like this included in any unit testing framework I've used.

I've had this issue with line endings in an unit test.
//compare files ignoring line ends
org.junit.Assert.assertEquals(
read.readPayload("myFile.xml")
.replace("\n", "")
.replace("\r", ""),
values.getFile()
.replace("\n", "")
.replace("\r", ""));
I usually do not like to make this kind of comparison (comparing the whole file), as a better approach would be validating the fields. But it answers this question here, as it removes line endings for most of the systems (the replace calls is the trick).
PS: read.readPayload reads a text file from the resources folder and puts it into a String, and values is a structure that contains a String with the raw content of a file (as String) in its attributes.
PS2: No performance was considered, since it was just an ugly fix for unit test

Related

How to validate or compare string by omitting certain part of it

I have a string as below
"a1/type/xyz/parts"
The part where 'xyz' exists is dynamic and varies accordingly at any size. I want to compare just the two strings are equal discarding the 'xyz' portion exactly.
For example I have string as below
"a1/type/abcd/parts"
Then my comparison has to be successful
I tried with regular expression as below. Though my knowledge on regular expressions is limited and it did not work. Probably something wrong in the way I used.
var regex = #"^[a-zA-Z]{2}/\[a-zA-Z]{16}/\[0-9a-zA-Z]/\[a-z]{5}/$";
var result = Regex.Match("mystring", regex).Success;
Another idea is to get substring of first and last part omitting the unwanted portion and comparing it.
The comparison should be successful by discarding certain portion of the string with effective code.
Comparison successful cases
string1: "a1/type/21412ghh/parts"
string2: "a1/type/eeeee122ghh/parts"
Comparison failure cases:
string1: "a1/type/21412ghh/parts"
string2: "a2/type/eeeee122ghh/parts/mm"
In short "a1/type/abcd/parts" in this part of string the non-bold part is static always.
Honestly, you could do this using regex, and pull apart the string. But you have a specified delimiter, just use String.Split:
bool AreEqualAccordingToMyRules(string input1, string input2)
{
var split1 = input1.Split('/');
var split2 = input2.Split('/');
return split1.Length == split2.Length // strings must have equal number of sections
&& split1[0] == split2[0] // section 1 must match
&& split1[1] == split2[1] // section 2 must match
&& split1[3] == split2[3] // section 4 must match
}
You can try Split (to get parts) and Linq (to exclude 3d one)
using System.Linq;
...
string string1 = "a1/type/xyz/parts";
string string2 = "a1/type/abcd/parts";
bool result = string1
.Split('/') // string1 parts
.Where((v, i) => i != 2) // all except 3d one
.SequenceEqual(string2 // must be equal to
.Split('/') // string2 parts
.Where((v, i) => i != 2)); // except 3d one
Here's a small programm using string functions to compare the parts before and after the middle part:
public class Program
{
public static void Main(string[] args)
{
Console.WriteLine(CutOutMiddle("a1/type/21412ghh/parts"));
Console.WriteLine("True: " + CompareNoMiddle("a1/type/21412ghh/parts", "a1/type/21412ghasdasdh/parts"));
Console.WriteLine("False: " + CompareNoMiddle("a1/type/21412ghh/parts", "a2/type/21412ghh/parts/someval"));
Console.WriteLine("False: " + CompareNoMiddle("a1/type/21412ghh/parts", "a1/type/21412ghasdasdh/parts/someappendix"));
}
private static bool CompareNoMiddle(string s1, string s2)
{
var s1CutOut = CutOutMiddle(s1);
var s2CutOut = CutOutMiddle(s2);
return s1CutOut == s2CutOut;
}
private static string CutOutMiddle(string val)
{
var fistSlash = val.IndexOf('/', 0);
var secondSlash = val.IndexOf('/', fistSlash+1);
var thirdSlash = val.IndexOf('/', secondSlash+1);
var firstPart = val.Substring(0, secondSlash);
var secondPart = val.Substring(thirdSlash, val.Length - thirdSlash);
return firstPart + secondPart;
}
}
returns
a1/type/parts
True: True
False: False
False: False
This solution should cover your case, as said by others, if you have a delimiter use it. In the function below you could change int skip for string ignore or something similar and within the comparison loop if(arrayStringOne[i] == ignore) continue;.
public bool Compare(string valueOne, string valueTwo, int skip) {
var delimiterOccuranceOne = valueOne.Count(f => f == '/');
var delimiterOccuranceTwo = valueTwo.Count(f => f == '/');
if(delimiterOccuranceOne == delimiterOccuranceTwo) {
var arrayStringOne = valueOne.Split('/');
var arrayStringTwo = valueTwo.Split('/');
for(int i=0; i < arrayStringOne.Length; ++i) {
if(i == skip) continue; // or instead of an index you could use a string
if(arrayStringOne[i] != arrayStringTwo[i]) {
return false;
}
}
return true;
}
return false;
}
Compare("a1/type/abcd/parts", "a1/type/xyz/parts", 2);

Check if string contains characters in certain order in C#r

I have a code that's working right now, but it doesn't check if the characters are in order, it only checks if they're there. How can I modify my code so the the characters 'gaoaf' are checked in that order in the string?
Console.WriteLine("5.feladat");
StreamWriter sw = new StreamWriter("keres.txt");
sw.WriteLine("gaoaf");
string s = "";
for (int i = 0; i < n; i++)
{
s = zadatok[i].nev+zadatok[i].cim;
if (s.Contains("g") && s.Contains("a") && s.Contains("o") && s.Contains("a") && s.Contains("f") )
{
sw.WriteLine(i);
sw.WriteLine(zadatok[i].nev + zadatok[i].cim);
}
}
sw.Close();
You can convert the letters into a pattern and use Regex:
var letters = "gaoaf";
var pattern = String.Join(".*",letters.AsEnumerable());
var hasletters = Regex.IsMatch(s, pattern, RegexOptions.IgnoreCase);
For those that needlessly avoid .*, you can also solve this with LINQ:
var ans = letters.Aggregate(0, (p, c) => p >= 0 ? s.IndexOf(c.ToString(), p, StringComparison.InvariantCultureIgnoreCase) : p) != -1;
If it is possible to have repeated adjacent letters, you need to complicate the LINQ solution slightly:
var ans = letters.Aggregate(0, (p, c) => {
if (p >= 0) {
var newp = s.IndexOf(c.ToString(), p, StringComparison.InvariantCultureIgnoreCase);
return newp >= 0 ? newp+1 : newp;
}
else
return p;
}) != -1;
Given the (ugly) machinations required to basically terminate Aggregate early, and given the (ugly and inefficient) syntax required to use an inline anonymous expression call to get rid of the temporary newp, I created some extensions to help, an Aggregate that can terminate early:
public static TAccum AggregateWhile<TAccum, T>(this IEnumerable<T> src, TAccum seed, Func<TAccum, T, TAccum> accumFn, Predicate<TAccum> whileFn) {
using (var e = src.GetEnumerator()) {
if (!e.MoveNext())
throw new Exception("At least one element required by AggregateWhile");
var ans = accumFn(seed, e.Current);
while (whileFn(ans) && e.MoveNext())
ans = accumFn(ans, e.Current);
return ans;
}
}
Now you can solve the problem fairly easily:
var ans2 = letters.AggregateWhile(-1,
(p, c) => s.IndexOf(c.ToString(), p+1, StringComparison.InvariantCultureIgnoreCase),
p => p >= 0
) != -1;
Why not something like this?
static bool CheckInOrder(string source, string charsToCheck)
{
int index = -1;
foreach (var c in charsToCheck)
{
index = source.IndexOf(c, index + 1);
if (index == -1)
return false;
}
return true;
}
Then you can use the function like this:
bool result = CheckInOrder("this is my source string", "gaoaf");
This should work because IndexOf returns -1 if a string isn't found, and it only starts scanning AFTER the previous match.

RegEx.Replace is only replacing first occurrence, need all

I'm having an issue with Regex.Replace in C# as it doesn't seem to be replacing all occurrences of the matched pattern.
private string ReplaceBBCode(string inStr)
{
var outStr = Regex.Replace(inStr, #"\[(b|i|u)\](.*?)\[/\1\]", #"<$1>$2</$1>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
outStr = Regex.Replace(outStr, "(\r|\n)+", "<br />");
return outStr;
}
The input string:
[b]Saint Paul's Food Kitchen[/b] [b] [/b]Saint Paul's food kitchen opens weekly to provide food to those in need.
The result:
<b>Saint Paul's Food Kitchen</b> [b] [/b]Saint Paul's food kitchen opens weekly to provide food to those in need.
I've tested this in regexhero.net and it works exactly as it should there.
EDIT:
Sorry, copied the wrong version of the function. It now shows the correct code, that behaves incorrectly for me.
The output I'm getting is completely different from what you say you're getting, but
The biggest problem I see, is that you probably don't want your regex to be greedy.
try replacing the .* with .*?
No need for Regex:
private static string ReplaceBBCode(string inStr)
{
return inStr.Replace("[b]", "<b>").Replace("[/b]", "</b>")
.Replace("[i]", "<i>").Replace("[/i]", "</i>")
.Replace("[u]", "<u>").Replace("[/u]", "</u>")
.Replace("\r\n", "\n")
.Replace("\n", "<br />");
}
I like this one better:
private static string ReplaceBBCode(string inStr)
{
StringBuilder outStr = new StringBuilder();
bool addBR = false;
for(int i=0; i<inStr.Length; i++){
if (addBR){
outStr.Append("<br />");
addBR = false;
}
if (inStr[i] == '\r' || inStr[i] == '\n'){
if (!addBR)
addBR = true;
}
else {
addBR = false;
if (i+2 < inStr.Length && inStr[i] == '['
&& (inStr[i+1] == 'b' || inStr[i+1] == 'i' || inStr[i+1] == 'u')
&& inStr[i+2] == ']'){
outStr.Append("<").Append(inStr[i+1]).Append(">");
i+=2;
}
else if(i+3 < inStr.Length && inStr[i] == '[' && inStr[i+1] == '/'
&& (inStr[i+2] == 'b' || inStr[i+2] == 'i' || inStr[i+2] == 'u')
&& inStr[i+3] == ']'){
outStr.Append("</").Append(inStr[i+2]).Append(">");
i+=3;
}
else
outStr.Append(inStr[i]);
}
}
return outStr.ToString();
}
This solved the issue, it also handles nested tags. Not sure why, but rebuilding over and over it still was causing errors. Its possible our VS2010 is corrupted and not building properly, or that the framework is corrupted. Not sure what the cause of the problem is, but this solved it:
private string ReplaceBBCode(string inStr)
{
var outStr = inStr;
var bbre = new Regex(#"\[(b|i|u)\](.*?)\[/\1\]", RegexOptions.IgnoreCase | RegexOptions.Multiline);
while( bbre.IsMatch(outStr))
outStr = bbre.Replace(outStr, #"<$1>$2</$1>");
outStr = Regex.Replace(outStr, "(\r|\n)+", "<br />");
return outStr;
}

Improve performance of sorting files by extension

With a given array of file names, the most simpliest way to sort it by file extension is like this:
Array.Sort(fileNames,
(x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));
The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!
Theoretical, there is a way to optimize it: instead of using Path.GetExtension() and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.') without creating new strings.
Now, suppose i found the LastIndexOf('.'), i want to reuse native .NET's StringComparer and apply it only to the part on string after the LastIndexOf('.'), to preserve all culture consideration. Didn't found a way to do that.
Any ideas?
Edit:
With tanascius's idea to use char.CompareTo() method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension() in some manner. what do you think?
Edit 2:
I found that this implementation do not considering culture since char.CompareTo() method do not considering culture, so this is not a perfect solution.
Any ideas?
public static int CompareExtensions(string filePath1, string filePath2)
{
if (filePath1 == null && filePath2 == null)
{
return 0;
}
else if (filePath1 == null)
{
return -1;
}
else if (filePath2 == null)
{
return 1;
}
int i = filePath1.LastIndexOf('.');
int j = filePath2.LastIndexOf('.');
if (i == -1)
{
i = filePath1.Length;
}
else
{
i++;
}
if (j == -1)
{
j = filePath2.Length;
}
else
{
j++;
}
for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
{
int compareResults = filePath1[i].CompareTo(filePath2[j]);
if (compareResults != 0)
{
return compareResults;
}
}
if (i >= filePath1.Length && j >= filePath2.Length)
{
return 0;
}
else if (i >= filePath1.Length)
{
return -1;
}
else
{
return 1;
}
}
Create a new array that contains each of the filenames in ext.restofpath format (or some sort of pair/tuple format that can default sort on the extension without further transformation). Sort that, then convert it back.
This is faster because instead of having to retrieve the extension many times for each element (since you're doing something like N log N compares), you only do it once (and then move it back once).
Not the most memory efficient but the fastest according to my tests:
SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
string extension = Path.GetExtension(fileName);
List<string> list;
if (!dic.TryGetValue(extension, out list))
{
list = new List<string>();
dic.Add(extension, list);
}
list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();
Did a mini benchmark on 800k randomly generated 8.3 filenames:
Sorting items with Linq to Objects... 00:00:04.4592595
Sorting items with SortedDictionary... 00:00:02.4405325
Sorting items with Array.Sort... 00:00:06.6464205
You can write a comparer that compares each character of the extension. char has a CompareTo(), too (see here).
Basically you loop until you have no more chars left in at least one string or one CompareTo() returns a value != 0.
EDIT: In response to the edits of the OP
The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line
string.Compare( filePath1[i].ToString(), filePath2[j].ToString(),
m_CultureInfo, m_CompareOptions );
to enable the use of CultureInfo and CompareOptions. However this slows down everything compared to a version using a plain char.CompareTo() (about factor 2). But, according to my own SO question this seems to be the way to go.
public sealed class ExtensionComparer : IComparer<string>
{
private readonly CultureInfo m_CultureInfo;
private readonly CompareOptions m_CompareOptions;
public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}
public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
{
m_CultureInfo = cultureInfo;
m_CompareOptions = compareOptions;
}
public int Compare( string filePath1, string filePath2 )
{
if( filePath1 == null || filePath2 == null )
{
if( filePath1 != null )
{
return 1;
}
if( filePath2 != null )
{
return -1;
}
return 0;
}
var i = filePath1.LastIndexOf( '.' ) + 1;
var j = filePath2.LastIndexOf( '.' ) + 1;
if( i == 0 || j == 0 )
{
if( i != 0 )
{
return 1;
}
return j != 0 ? -1 : 0;
}
while( true )
{
if( i == filePath1.Length || j == filePath2.Length )
{
if( i != filePath1.Length )
{
return 1;
}
return j != filePath2.Length ? -1 : 0;
}
var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
//var compareResults = filePath1[i].CompareTo( filePath2[j] );
if( compareResults != 0 )
{
return compareResults;
}
i++;
j++;
}
}
}
Usage:
fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
CompareOptions.StringSort ) );
the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.
if you were using linq i would suggest something like this:
filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
.OrderBy(t => t.ext).ToArray();
this ensures that Path.GetExtension is only called once for each filename.

Eric Lippert's challenge "comma-quibbling", best answer?

I wanted to bring this challenge to the attention of the stackoverflow community. The original problem and answers are here. BTW, if you did not follow it before, you should try to read Eric's blog, it is pure wisdom.
Summary:
Write a function that takes a non-null IEnumerable and returns a string with the following characteristics:
If the sequence is empty the resulting string is "{}".
If the sequence is a single item "ABC" then the resulting string is "{ABC}".
If the sequence is the two item sequence "ABC", "DEF" then the resulting string is "{ABC and DEF}".
If the sequence has more than two items, say, "ABC", "DEF", "G", "H" then the resulting string is "{ABC, DEF, G and H}". (Note: no Oxford comma!)
As you can see even our very own Jon Skeet (yes, it is well known that he can be in two places at the same time) has posted a solution but his (IMHO) is not the most elegant although probably you can not beat its performance.
What do you think? There are pretty good options there. I really like one of the solutions that involves the select and aggregate methods (from Fernando Nicolet). Linq is very powerful and dedicating some time to challenges like this make you learn a lot. I twisted it a bit so it is a bit more performant and clear (by using Count and avoiding Reverse):
public static string CommaQuibbling(IEnumerable<string> items)
{
int last = items.Count() - 1;
Func<int, string> getSeparator = (i) => i == 0 ? string.Empty : (i == last ? " and " : ", ");
string answer = string.Empty;
return "{" + items.Select((s, i) => new { Index = i, Value = s })
.Aggregate(answer, (s, a) => s + getSeparator(a.Index) + a.Value) + "}";
}
Inefficient, but I think clear.
public static string CommaQuibbling(IEnumerable<string> items)
{
List<String> list = new List<String>(items);
if (list.Count == 0) { return "{}"; }
if (list.Count == 1) { return "{" + list[0] + "}"; }
String[] initial = list.GetRange(0, list.Count - 1).ToArray();
return "{" + String.Join(", ", initial) + " and " + list[list.Count - 1] + "}";
}
If I was maintaining the code, I'd prefer this to more clever versions.
How about this approach? Purely cumulative - no back-tracking, and only iterates once. For raw performance, I'm not sure you'll do better with LINQ etc, regardless of how "pretty" a LINQ answer might be.
using System;
using System.Collections.Generic;
using System.Text;
static class Program
{
public static string CommaQuibbling(IEnumerable<string> items)
{
StringBuilder sb = new StringBuilder('{');
using (var iter = items.GetEnumerator())
{
if (iter.MoveNext())
{ // first item can be appended directly
sb.Append(iter.Current);
if (iter.MoveNext())
{ // more than one; only add each
// term when we know there is another
string lastItem = iter.Current;
while (iter.MoveNext())
{ // middle term; use ", "
sb.Append(", ").Append(lastItem);
lastItem = iter.Current;
}
// add the final term; since we are on at least the
// second term, always use " and "
sb.Append(" and ").Append(lastItem);
}
}
}
return sb.Append('}').ToString();
}
static void Main()
{
Console.WriteLine(CommaQuibbling(new string[] { }));
Console.WriteLine(CommaQuibbling(new string[] { "ABC" }));
Console.WriteLine(CommaQuibbling(new string[] { "ABC", "DEF" }));
Console.WriteLine(CommaQuibbling(new string[] {
"ABC", "DEF", "G", "H" }));
}
}
If I was doing a lot with streams which required first/last information, I'd have thid extension:
[Flags]
public enum StreamPosition
{
First = 1, Last = 2
}
public static IEnumerable<R> MapWithPositions<T, R> (this IEnumerable<T> stream,
Func<StreamPosition, T, R> map)
{
using (var enumerator = stream.GetEnumerator ())
{
if (!enumerator.MoveNext ()) yield break ;
var cur = enumerator.Current ;
var flags = StreamPosition.First ;
while (true)
{
if (!enumerator.MoveNext ()) flags |= StreamPosition.Last ;
yield return map (flags, cur) ;
if ((flags & StreamPosition.Last) != 0) yield break ;
cur = enumerator.Current ;
flags = 0 ;
}
}
}
Then the simplest (not the quickest, that would need a couple more handy extension methods) solution will be:
public static string Quibble (IEnumerable<string> strings)
{
return "{" + String.Join ("", strings.MapWithPositions ((pos, item) => (
(pos & StreamPosition.First) != 0 ? "" :
pos == StreamPosition.Last ? " and " : ", ") + item)) + "}" ;
}
Here as a Python one liner
>>> f=lambda s:"{%s}"%", ".join(s)[::-1].replace(',','dna ',1)[::-1]
>>> f([])
'{}'
>>> f(["ABC"])
'{ABC}'
>>> f(["ABC","DEF"])
'{ABC and DEF}'
>>> f(["ABC","DEF","G","H"])
'{ABC, DEF, G and H}'
This version might be easier to understand
>>> f=lambda s:"{%s}"%" and ".join(s).replace(' and',',',len(s)-2)
>>> f([])
'{}'
>>> f(["ABC"])
'{ABC}'
>>> f(["ABC","DEF"])
'{ABC and DEF}'
>>> f(["ABC","DEF","G","H"])
'{ABC, DEF, G and H}'
Here's a simple F# solution, that only does one forward iteration:
let CommaQuibble items =
let sb = System.Text.StringBuilder("{")
// pp is 2 previous, p is previous
let pp,p = items |> Seq.fold (fun (pp:string option,p) s ->
if pp <> None then
sb.Append(pp.Value).Append(", ") |> ignore
(p, Some(s))) (None,None)
if pp <> None then
sb.Append(pp.Value).Append(" and ") |> ignore
if p <> None then
sb.Append(p.Value) |> ignore
sb.Append("}").ToString()
(EDIT: Turns out this is very similar to Skeet's.)
The test code:
let Test l =
printfn "%s" (CommaQuibble l)
Test []
Test ["ABC"]
Test ["ABC";"DEF"]
Test ["ABC";"DEF";"G"]
Test ["ABC";"DEF";"G";"H"]
Test ["ABC";null;"G";"H"]
I'm a fan of the serial comma: I eat, shoot, and leave.
I continually need a solution to this problem and have solved it in 3 languages (though not C#). I would adapt the following solution (in Lua, does not wrap answer in curly braces) by writing a concat method that works on any IEnumerable:
function commafy(t, andword)
andword = andword or 'and'
local n = #t -- number of elements in the numeration
if n == 1 then
return t[1]
elseif n == 2 then
return concat { t[1], ' ', andword, ' ', t[2] }
else
local last = t[n]
t[n] = andword .. ' ' .. t[n]
local answer = concat(t, ', ')
t[n] = last
return answer
end
end
This isn't brilliantly readable, but it scales well up to tens of millions of strings. I'm developing on an old Pentium 4 workstation and it does 1,000,000 strings of average length 8 in about 350ms.
public static string CreateLippertString(IEnumerable<string> strings)
{
char[] combinedString;
char[] commaSeparator = new char[] { ',', ' ' };
char[] andSeparator = new char[] { ' ', 'A', 'N', 'D', ' ' };
int totalLength = 2; //'{' and '}'
int numEntries = 0;
int currentEntry = 0;
int currentPosition = 0;
int secondToLast;
int last;
int commaLength= commaSeparator.Length;
int andLength = andSeparator.Length;
int cbComma = commaLength * sizeof(char);
int cbAnd = andLength * sizeof(char);
//calculate the sum of the lengths of the strings
foreach (string s in strings)
{
totalLength += s.Length;
++numEntries;
}
//add to the total length the length of the constant characters
if (numEntries >= 2)
totalLength += 5; // " AND "
if (numEntries > 2)
totalLength += (2 * (numEntries - 2)); // ", " between items
//setup some meta-variables to help later
secondToLast = numEntries - 2;
last = numEntries - 1;
//allocate the memory for the combined string
combinedString = new char[totalLength];
//set the first character to {
combinedString[0] = '{';
currentPosition = 1;
if (numEntries > 0)
{
//now copy each string into its place
foreach (string s in strings)
{
Buffer.BlockCopy(s.ToCharArray(), 0, combinedString, currentPosition * sizeof(char), s.Length * sizeof(char));
currentPosition += s.Length;
if (currentEntry == secondToLast)
{
Buffer.BlockCopy(andSeparator, 0, combinedString, currentPosition * sizeof(char), cbAnd);
currentPosition += andLength;
}
else if (currentEntry == last)
{
combinedString[currentPosition] = '}'; //set the last character to '}'
break; //don't bother making that last call to the enumerator
}
else if (currentEntry < secondToLast)
{
Buffer.BlockCopy(commaSeparator, 0, combinedString, currentPosition * sizeof(char), cbComma);
currentPosition += commaLength;
}
++currentEntry;
}
}
else
{
//set the last character to '}'
combinedString[1] = '}';
}
return new string(combinedString);
}
Another variant - separating punctuation and iteration logic for the sake of code clarity. And still thinking about perfomrance.
Works as requested with pure IEnumerable/string/ and strings in the list cannot be null.
public static string Concat(IEnumerable<string> strings)
{
return "{" + strings.reduce("", (acc, prev, cur, next) =>
acc.Append(punctuation(prev, cur, next)).Append(cur)) + "}";
}
private static string punctuation(string prev, string cur, string next)
{
if (null == prev || null == cur)
return "";
if (null == next)
return " and ";
return ", ";
}
private static string reduce(this IEnumerable<string> strings,
string acc, Func<StringBuilder, string, string, string, StringBuilder> func)
{
if (null == strings) return "";
var accumulatorBuilder = new StringBuilder(acc);
string cur = null;
string prev = null;
foreach (var next in strings)
{
func(accumulatorBuilder, prev, cur, next);
prev = cur;
cur = next;
}
func(accumulatorBuilder, prev, cur, null);
return accumulatorBuilder.ToString();
}
F# surely looks much better:
let rec reduce list =
match list with
| [] -> ""
| head::curr::[] -> head + " and " + curr
| head::curr::tail -> head + ", " + curr :: tail |> reduce
| head::[] -> head
let concat list = "{" + (list |> reduce ) + "}"
Disclaimer: I used this as an excuse to play around with new technologies, so my solutions don't really live up to the Eric's original demands for clarity and maintainability.
Naive Enumerator Solution
(I concede that the foreach variant of this is superior, as it doesn't require manually messing about with the enumerator.)
public static string NaiveConcatenate(IEnumerable<string> sequence)
{
StringBuilder sb = new StringBuilder();
sb.Append('{');
IEnumerator<string> enumerator = sequence.GetEnumerator();
if (enumerator.MoveNext())
{
string a = enumerator.Current;
if (!enumerator.MoveNext())
{
sb.Append(a);
}
else
{
string b = enumerator.Current;
while (enumerator.MoveNext())
{
sb.Append(a);
sb.Append(", ");
a = b;
b = enumerator.Current;
}
sb.AppendFormat("{0} and {1}", a, b);
}
}
sb.Append('}');
return sb.ToString();
}
Solution using LINQ
public static string ConcatenateWithLinq(IEnumerable<string> sequence)
{
return (from item in sequence select item)
.Aggregate(
new {sb = new StringBuilder("{"), a = (string) null, b = (string) null},
(s, x) =>
{
if (s.a != null)
{
s.sb.Append(s.a);
s.sb.Append(", ");
}
return new {s.sb, a = s.b, b = x};
},
(s) =>
{
if (s.b != null)
if (s.a != null)
s.sb.AppendFormat("{0} and {1}", s.a, s.b);
else
s.sb.Append(s.b);
s.sb.Append("}");
return s.sb.ToString();
});
}
Solution with TPL
This solution uses a producer-consumer queue to feed the input sequence to the processor, whilst keeping at least two elements buffered in the queue. Once the producer has reached the end of the input sequence, the last two elements can be processed with special treatment.
In hindsight there is no reason to have the consumer operate asynchronously, which would eliminate the need for a concurrent queue, but as I said previously, I was just using this as an excuse to play around with new technologies :-)
public static string ConcatenateWithTpl(IEnumerable<string> sequence)
{
var queue = new ConcurrentQueue<string>();
bool stop = false;
var consumer = Future.Create(
() =>
{
var sb = new StringBuilder("{");
while (!stop || queue.Count > 2)
{
string s;
if (queue.Count > 2 && queue.TryDequeue(out s))
sb.AppendFormat("{0}, ", s);
}
return sb;
});
// Producer
foreach (var item in sequence)
queue.Enqueue(item);
stop = true;
StringBuilder result = consumer.Value;
string a;
string b;
if (queue.TryDequeue(out a))
if (queue.TryDequeue(out b))
result.AppendFormat("{0} and {1}", a, b);
else
result.Append(a);
result.Append("}");
return result.ToString();
}
Unit tests elided for brevity.
Late entry:
public static string CommaQuibbling(IEnumerable<string> items)
{
string[] parts = items.ToArray();
StringBuilder result = new StringBuilder('{');
for (int i = 0; i < parts.Length; i++)
{
if (i > 0)
result.Append(i == parts.Length - 1 ? " and " : ", ");
result.Append(parts[i]);
}
return result.Append('}').ToString();
}
public static string CommaQuibbling(IEnumerable<string> items)
{
int count = items.Count();
string answer = string.Empty;
return "{" +
(count==0) ? "" :
( items[0] +
(count == 1 ? "" :
items.Range(1,count-1).
Aggregate(answer, (s,a)=> s += ", " + a) +
items.Range(count-1,1).
Aggregate(answer, (s,a)=> s += " AND " + a) ))+ "}";
}
It is implemented as,
if count == 0 , then return empty,
if count == 1 , then return only element,
if count > 1 , then take two ranges,
first 2nd element to 2nd last element
last element
Here's mine, but I realize it's pretty much like Marc's, some minor differences in the order of things, and I added unit-tests as well.
using System;
using NUnit.Framework;
using NUnit.Framework.Extensions;
using System.Collections.Generic;
using System.Text;
using NUnit.Framework.SyntaxHelpers;
namespace StringChallengeProject
{
[TestFixture]
public class StringChallenge
{
[RowTest]
[Row(new String[] { }, "{}")]
[Row(new[] { "ABC" }, "{ABC}")]
[Row(new[] { "ABC", "DEF" }, "{ABC and DEF}")]
[Row(new[] { "ABC", "DEF", "G", "H" }, "{ABC, DEF, G and H}")]
public void Test(String[] input, String expectedOutput)
{
Assert.That(FormatString(input), Is.EqualTo(expectedOutput));
}
//codesnippet:93458590-3182-11de-8c30-0800200c9a66
public static String FormatString(IEnumerable<String> input)
{
if (input == null)
return "{}";
using (var iterator = input.GetEnumerator())
{
// Guard-clause for empty source
if (!iterator.MoveNext())
return "{}";
// Take care of first value
var output = new StringBuilder();
output.Append('{').Append(iterator.Current);
// Grab next
if (iterator.MoveNext())
{
// Grab the next value, but don't process it
// we don't know whether to use comma or "and"
// until we've grabbed the next after it as well
String nextValue = iterator.Current;
while (iterator.MoveNext())
{
output.Append(", ");
output.Append(nextValue);
nextValue = iterator.Current;
}
output.Append(" and ");
output.Append(nextValue);
}
output.Append('}');
return output.ToString();
}
}
}
}
How about skipping complicated aggregation code and just cleaning up the string after you build it?
public static string CommaQuibbling(IEnumerable<string> items)
{
var aggregate = items.Aggregate<string, StringBuilder>(
new StringBuilder(),
(b,s) => b.AppendFormat(", {0}", s));
var trimmed = Regex.Replace(aggregate.ToString(), "^, ", string.Empty);
return string.Format(
"{{{0}}}",
Regex.Replace(trimmed,
", (?<last>[^,]*)$", #" and ${last}"));
}
UPDATED: This won't work with strings with commas, as pointed out in the comments. I tried some other variations, but without definite rules about what the strings can contain, I'm going to have real problems matching any possible last item with a regular expression, which makes this a nice lesson for me on their limitations.
I quite liked Jon's answer, but that's because it's much like how I approached the problem. Rather than specifically coding in the two variables, I implemented them inside of a FIFO queue.
It's strange because I just assumed that there would be 15 posts that all did exactly the same thing, but it looks like we were the only two to do it that way. Oh, looking at these answers, Marc Gravell's answer is quite close to the approach we used as well, but he's using two 'loops', rather than holding on to values.
But all those answers with LINQ and regex and joining arrays just seem like crazy-talk! :-)
I don't think that using a good old array is a restriction. Here is my version using an array and an extension method:
public static string CommaQuibbling(IEnumerable<string> list)
{
string[] array = list.ToArray();
if (array.Length == 0) return string.Empty.PutCurlyBraces();
if (array.Length == 1) return array[0].PutCurlyBraces();
string allExceptLast = string.Join(", ", array, 0, array.Length - 1);
string theLast = array[array.Length - 1];
return string.Format("{0} and {1}", allExceptLast, theLast)
.PutCurlyBraces();
}
public static string PutCurlyBraces(this string str)
{
return "{" + str + "}";
}
I am using an array because of the string.Join method and because if the possibility of accessing the last element via an index. The extension method is here because of DRY.
I think that the performance penalities come from the list.ToArray() and string.Join calls, but all in one I hope that piece of code is pleasent to read and maintain.
I think Linq provides fairly readable code. This version handles a million "ABC" in .89 seconds:
using System.Collections.Generic;
using System.Linq;
namespace CommaQuibbling
{
internal class Translator
{
public string Translate(IEnumerable<string> items)
{
return "{" + Join(items) + "}";
}
private static string Join(IEnumerable<string> items)
{
var leadingItems = LeadingItemsFrom(items);
var lastItem = LastItemFrom(items);
return JoinLeading(leadingItems) + lastItem;
}
private static IEnumerable<string> LeadingItemsFrom(IEnumerable<string> items)
{
return items.Reverse().Skip(1).Reverse();
}
private static string LastItemFrom(IEnumerable<string> items)
{
return items.LastOrDefault();
}
private static string JoinLeading(IEnumerable<string> items)
{
if (items.Any() == false) return "";
return string.Join(", ", items.ToArray()) + " and ";
}
}
}
You can use a foreach, without LINQ, delegates, closures, lists or arrays, and still have understandable code. Use a bool and a string, like so:
public static string CommaQuibbling(IEnumerable items)
{
StringBuilder sb = new StringBuilder("{");
bool empty = true;
string prev = null;
foreach (string s in items)
{
if (prev!=null)
{
if (!empty) sb.Append(", ");
else empty = false;
sb.Append(prev);
}
prev = s;
}
if (prev!=null)
{
if (!empty) sb.Append(" and ");
sb.Append(prev);
}
return sb.Append('}').ToString();
}
public static string CommaQuibbling(IEnumerable<string> items)
{
var itemArray = items.ToArray();
var commaSeparated = String.Join(", ", itemArray, 0, Math.Max(itemArray.Length - 1, 0));
if (commaSeparated.Length > 0) commaSeparated += " and ";
return "{" + commaSeparated + itemArray.LastOrDefault() + "}";
}
Here's my submission. Modified the signature a bit to make it more generic. Using .NET 4 features (String.Join() using IEnumerable<T>), otherwise works with .NET 3.5. Goal was to use LINQ with drastically simplified logic.
static string CommaQuibbling<T>(IEnumerable<T> items)
{
int count = items.Count();
var quibbled = items.Select((Item, index) => new { Item, Group = (count - index - 2) > 0})
.GroupBy(item => item.Group, item => item.Item)
.Select(g => g.Key
? String.Join(", ", g)
: String.Join(" and ", g));
return "{" + String.Join(", ", quibbled) + "}";
}
There's a couple non-C# answers, and the original post did ask for answers in any language, so I thought I'd show another way to do it that none of the C# programmers seems to have touched upon: a DSL!
(defun quibble-comma (words)
(format nil "~{~#[~;~a~;~a and ~a~:;~#{~a~#[~; and ~:;, ~]~}~]~}" words))
The astute will note that Common Lisp doesn't really have an IEnumerable<T> built-in, and hence FORMAT here will only work on a proper list. But if you made an IEnumerable, you certainly could extend FORMAT to work on that, as well. (Does Clojure have this?)
Also, anyone reading this who has taste (including Lisp programmers!) will probably be offended by the literal "~{~#[~;~a~;~a and ~a~:;~#{~a~#[~; and ~:;, ~]~}~]~}" there. I won't claim that FORMAT implements a good DSL, but I do believe that it is tremendously useful to have some powerful DSL for putting strings together. Regex is a powerful DSL for tearing strings apart, and string.Format is a DSL (kind of) for putting strings together but it's stupidly weak.
I think everybody writes these kind of things all the time. Why the heck isn't there some built-in universal tasteful DSL for this yet? I think the closest we have is "Perl", maybe.
Just for fun, using the new Zip extension method from C# 4.0:
private static string CommaQuibbling(IEnumerable<string> list)
{
IEnumerable<string> separators = GetSeparators(list.Count());
var finalList = list.Zip(separators, (w, s) => w + s);
return string.Concat("{", string.Join(string.Empty, finalList), "}");
}
private static IEnumerable<string> GetSeparators(int itemCount)
{
while (itemCount-- > 2)
yield return ", ";
if (itemCount == 1)
yield return " and ";
yield return string.Empty;
}
return String.Concat(
"{",
input.Length > 2 ?
String.Concat(
String.Join(", ", input.Take(input.Length - 1)),
" and ",
input.Last()) :
String.Join(" and ", input),
"}");
I have tried using foreach. Please let me know your opinions.
private static string CommaQuibble(IEnumerable<string> input)
{
var val = string.Concat(input.Process(
p => p,
p => string.Format(" and {0}", p),
p => string.Format(", {0}", p)));
return string.Format("{{{0}}}", val);
}
public static IEnumerable<T> Process<T>(this IEnumerable<T> input,
Func<T, T> firstItemFunc,
Func<T, T> lastItemFunc,
Func<T, T> otherItemFunc)
{
//break on empty sequence
if (!input.Any()) yield break;
//return first elem
var first = input.First();
yield return firstItemFunc(first);
//break if there was only one elem
var rest = input.Skip(1);
if (!rest.Any()) yield break;
//start looping the rest of the elements
T prevItem = first;
bool isFirstIteration = true;
foreach (var item in rest)
{
if (isFirstIteration) isFirstIteration = false;
else
{
yield return otherItemFunc(prevItem);
}
prevItem = item;
}
//last element
yield return lastItemFunc(prevItem);
}
Here are a couple of solutions and testing code written in Perl based on the replies at http://blogs.perl.org/users/brian_d_foy/2013/10/comma-quibbling-in-perl.html.
#!/usr/bin/perl
use 5.14.0;
use warnings;
use strict;
use Test::More qw{no_plan};
sub comma_quibbling1 {
my (#words) = #_;
return "" unless #words;
return $words[0] if #words == 1;
return join(", ", #words[0 .. $#words - 1]) . " and $words[-1]";
}
sub comma_quibbling2 {
return "" unless #_;
my $last = pop #_;
return $last unless #_;
return join(", ", #_) . " and $last";
}
is comma_quibbling1(qw{}), "", "1-0";
is comma_quibbling1(qw{one}), "one", "1-1";
is comma_quibbling1(qw{one two}), "one and two", "1-2";
is comma_quibbling1(qw{one two three}), "one, two and three", "1-3";
is comma_quibbling1(qw{one two three four}), "one, two, three and four", "1-4";
is comma_quibbling2(qw{}), "", "2-0";
is comma_quibbling2(qw{one}), "one", "2-1";
is comma_quibbling2(qw{one two}), "one and two", "2-2";
is comma_quibbling2(qw{one two three}), "one, two and three", "2-3";
is comma_quibbling2(qw{one two three four}), "one, two, three and four", "2-4";
It hasn't quite been a decade since the last post so here's my variation:
public static string CommaQuibbling(IEnumerable<string> items)
{
var text = new StringBuilder();
string sep = null;
int last_pos = items.Count();
int next_pos = 1;
foreach(string item in items)
{
text.Append($"{sep}{item}");
sep = ++next_pos < last_pos ? ", " : " and ";
}
return $"{{{text}}}";
}
In one statement:
public static string CommaQuibbling(IEnumerable<string> inputList)
{
return
String.Concat("{",
String.Join(null, inputList
.Select((iw, i) =>
(i == (inputList.Count() - 1)) ? $"{iw}" :
(i == (inputList.Count() - 2) ? $"{iw} and " : $"{iw}, "))
.ToArray()), "}");
}
In .NET Core we can leverage SkipLast and TakeLast.
public static string CommaQuibblify(IEnumerable<string> items)
{
var head = string.Join(", ", items.SkipLast(2).Append(""));
var tail = string.Join(" and ", items.TakeLast(2));
return '{' + head + tail + '}';
}
https://dotnetfiddle.net/X58qvZ

Categories