I want an efficient way of grouping strings whilst keeping duplicates and order.
Something like this
1100110002200 -> 101020
I tried this previously
_case.GroupBy(c => c).Select(g => g.Key)
but I got 102
But this gives me what I want, I just want to optimize it, so I wouldn't have to scour the entire list each time
static List<char> group(string _case)
{
var groups = new List<char>();
for (int i = 0; i < _case.Length; i++)
{
if (groups.LastOrDefault() != _case[i])
groups.Add(_case[i]);
}
return groups;
}
While I like the elegant solution of rshepp, it turns out that the very basic code can run even 5 times faster than that.
public static string Simplify2(string str)
{
if (string.IsNullOrEmpty(str)) { return str; }
StringBuilder sb = new StringBuilder();
char last = str[0];
sb.Append(last);
foreach (char c in str)
{
if (last != c)
{
sb.Append(c);
last = c;
}
}
return sb.ToString();
}
You could create a method that loops each character and checks the previous character for equality. If they aren't the same, append/yield return the character. This is pretty easy to do with Linq.
public static string Simplify(string str)
{
return string.Concat(str.Where((c, i) => i == 0 || c != str[i - 1]));
}
Usage:
string simplified = Simplify("1100110002200");
// 101020
In my testing, my method and yours are roughly equal in speed, mine being insignificantly slower after 10 million executions (4260ms vs 4241ms).
However, my method returns the result as a string whereas yours doesn't. If you need to convert your result back to a string (which is likely) then my method is indeed much faster/more efficient (4260ms vs 6569ms).
Related
I have a follow string example
0 0 1 2.33 4
2.1 2 11 2
There are many ways to convert it to an array, but I need the fastest one, because files can contain 1 billion elements.
string can contain an indefinite number of spaces between numbers
i'am trying
static void Main()
{
string str = "\n\n\n 1 2 3 \r 2322.2 3 4 \n 0 0 ";
byte[] byteArray = Encoding.ASCII.GetBytes(str);
MemoryStream stream = new MemoryStream(byteArray);
var values = ReadNumbers(stream);
}
public static IEnumerable<object> ReadNumbers(Stream st)
{
var buffer = new StringBuilder();
using (var sr = new StreamReader(st))
{
while (!sr.EndOfStream)
{
char digit = (char)sr.Read();
if (!char.IsDigit(digit) && digit != '.')
{
if (buffer.Length == 0) continue;
double ret = double.Parse(buffer.ToString() , culture);
buffer.Clear();
yield return ret;
}
else
{
buffer.Append(digit);
}
}
if (buffer.Length != 0)
{
double ret = double.Parse(buffer.ToString() , culture);
buffer.Clear();
yield return ret;
}
}
}
There are a few things you can do to improve the performance of your code. First, you can use the Split method to split the string into an array of strings, where each element of the array is a number in the string. This will be faster than reading each character of the string one at a time and checking if it is a digit.
Next, you can use double.TryParse to parse each element of the array into a double, rather than using double.Parse and catching any potential exceptions. TryParse will be faster because it does not throw an exception if the string is not a valid double.
Here is an example of how you could implement this:
public static IEnumerable<double> ReadNumbers(string str)
{
string[] parts = str.Split(new[] {' ', '\n', '\r', '\t'}, StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
if (double.TryParse(part, NumberStyles.Any, CultureInfo.InvariantCulture, out double value))
{
yield return value;
}
}
}
I'd rather suggest the simpliest solution first and haunt for nano-seconds if there really is a problem with that code.
var doubles = myInput.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).Select(x => double.Parse(x, whateverCulture))
Do that for every line in your file, not for the entire file at once, as reading such a huge file at once may crush your memory.
Pretty easy to understand. Afterwards perform a benchmark-test with your data and see if it really affects performance when trying to parse the data. However chances are the actual bottleneck is reading that huge file- which essentially is a IO-thing.
You can improve your solution by trying to avoid creating many objects on the heap. Especially buffer.ToString() is called repeatedly and creates new strings. You can use a ReadOnlySpan<char> struct to slice the string and at the same time avoid heap allocations. A span provides pointers into the original string without making copies of it or parts of it when slicing.
Also do not return the doubles as object, as this will box them. I.e., it will store them on the heap. See: Boxing and Unboxing (C# Programming Guide). If you prefer your solution over mine, use an IEnumerable<double> as return type of your method.
The use of ReadOnlySpans; however, has the disadvantage that it cannot be used in iterator methods. The reason is that a ReadOnlySpan must be allocated on the stack, but an iterator method wraps its state in a class. If you try, you will get the Compiler Error CS4013:
Instance of type cannot be used inside a nested function, query expression, iterator block or async method
Therefore, we must either store the numbers in a collection or consume them in-place. Since I don't know what you want to do with the numbers, I use the former approach:
public static List<double> ReadNumbers(string input)
{
ReadOnlySpan<char> inputSpan = input.AsSpan();
int start = 0;
bool isNumber = false;
var list = new List<double>(); // Improve by passing the expected maximum length.
int i;
for (i = 0; i < inputSpan.Length; i++) {
char c = inputSpan[i];
bool isDigit = Char.IsDigit(c);
if (isDigit && !isNumber) {
start = i;
isNumber = true;
} else if (isNumber && !isDigit && c != '.') {
isNumber = false;
if (Double.TryParse(inputSpan[start..i], CultureInfo.InvariantCulture, out double d)) {
list.Add(d);
}
}
}
if (isNumber) {
if (Double.TryParse(inputSpan[start..i], CultureInfo.InvariantCulture, out double d)) {
list.Add(d);
}
}
return list;
}
inputSpan[start..i] creates a slice as ReadOnlySpan<char>.
Test
string str = "\n\n\n 1 2 3 \r 2322.2 3 4 \n 0 0 ";
foreach (double d in ReadNumbers(str)) {
Console.WriteLine(d);
}
But whenever you are asking for speed, you must run benchmarks to compare the different approaches. Very often what seems a superior solution may fail in the benchmark.
See also: All About Span: Exploring a New .NET Mainstay
If I have two values eg/ABC001 and ABC100 or A0B0C1 and A1B0C0, is there a RegEx I can use to make sure the two values have the same pattern?
Well, here's my shot at it. This doesn't use regular expressions, and assumes s1 and s2 only contain numbers or digits:
public static bool SamePattern(string s1, string s2)
{
if (s1.Length == s2.Length)
{
char[] chars1 = s1.ToCharArray();
char[] chars2 = s2.ToCharArray();
for (int i = 0; i < chars1.Length; i++)
{
if (!Char.IsDigit(chars1[i]) && chars1[i] != chars2[i])
{
return false;
}
else if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
{
return false;
}
}
return true;
}
else
{
return false;
}
}
A description of the algorithm is as follows:
If the strings have different lengths, return false.
Otherwise, check the characters in the same position in both strings:
If they are both digits or both numbers, move on to the next iteration.
If they aren't digits but aren't the same, return false.
If one is a digit and one is a number, return false.
If all characters in both strings were checked successfully, return true.
If you don't know the pattern in advance, but are only going to encounter two groups of characters (alpha and digits), then you could do the following:
Write some C# that parsed the first pattern, looking at each char and determine if it's alpha, or digit, then generate a regex accordingly from that pattern.
You may find that there's no point writing code to generate a regex, as it could be just as simple to check the second string against the first.
Alternatively, without regex:
First check the strings are the same length.
Then loop through both strings at the same time, char by char. If char[x] from string 1 is alpha, and char[x] from string two is the same, you're patterns are matching.
Try this, it should cope if a string sneaks in some symbols. Edited to compare character values ... and use Char.IsLetter and Char.IsDigit
private bool matchPattern(string string1, string string2)
{
bool result = (string1.Length == string2.Length);
char[] chars1 = string1.ToCharArray();
char[] chars2 = string2.ToCharArray();
for (int i = 0; i < string1.Length; i++)
{
if (Char.IsLetter(chars1[i]) != Char.IsLetter(chars2[i]))
{
result = false;
}
if (Char.IsLetter(chars1[i]) && (chars1[i] != chars2[i]))
{
//Characters must be identical
result = false;
}
if (Char.IsDigit(chars1[i]) != Char.IsDigit(chars2[i]))
result = false;
}
return result;
}
Consider using Char.GetUnicodeCategory
You can write a helper class for this task:
public class Mask
{
public Mask(string originalString)
{
OriginalString = originalString;
CharCategories = originalString.Select(Char.GetUnicodeCategory).ToList();
}
public string OriginalString { get; private set; }
public IEnumerable<UnicodeCategory> CharCategories { get; private set; }
public bool HasSameCharCategories(Mask other)
{
//null checks
return CharCategories.SequenceEqual(other.CharCategories);
}
}
Use as
Mask mask1 = new Mask("ab12c3");
Mask mask2 = new Mask("ds124d");
MessageBox.Show(mask1.HasSameCharCategories(mask2).ToString());
I don't know C# syntax but here is a pseudo code:
split the strings on ''
sort the 2 arrays
join each arrays with ''
compare the 2 strings
A general-purpose solution with LINQ can be achieved quite easily. The idea is:
Sort the two strings (reordering the characters).
Compare each sorted string as a character sequence using SequenceEquals.
This scheme enables a short, graceful and configurable solution, for example:
// We will be using this in SequenceEquals
class MyComparer : IEqualityComparer<char>
{
public bool Equals(char x, char y)
{
return x.Equals(y);
}
public int GetHashCode(char obj)
{
return obj.GetHashCode();
}
}
// and then:
var s1 = "ABC0102";
var s2 = "AC201B0";
Func<char, double> orderFunction = char.GetNumericValue;
var comparer = new MyComparer();
var result = s1.OrderBy(orderFunction).SequenceEqual(s2.OrderBy(orderFunction), comparer);
Console.WriteLine("result = " + result);
As you can see, it's all in 3 lines of code (not counting the comparer class). It's also very very easily configurable.
The code as it stands checks if s1 is a permutation of s2.
Do you want to check if s1 has the same number and kind of characters with s2, but not necessarily the same characters (e.g. "ABC" to be equal to "ABB")? No problem, change MyComparer.Equals to return char.GetUnicodeCategory(x).Equals(char.GetUnicodeCategory(y));.
By changing the values of orderFunction and comparer you can configure a multitude of other comparison options.
And finally, since I don't find it very elegant to define a MyComparer class just to enable this scenario, you can also use the technique described in this question:
Wrap a delegate in an IEqualityComparer
to define your comparer as an inline lambda. This would result in a configurable solution contained in 2-3 lines of code.
I saw this question that asks given a string "smith;rodgers;McCalne" how can you produce a collection. The answer to this was to use String.Split.
If we don't have Split() built in what do you do instead?
Update:
I admit writing a split function is fairly easy. Below is what I would've wrote. Loop through the string using IndexOf and extract using Substring.
string s = "smith;rodgers;McCalne";
string seperator = ";";
int currentPosition = 0;
int lastPosition = 0;
List<string> values = new List<string>();
do
{
currentPosition = s.IndexOf(seperator, currentPosition + 1);
if (currentPosition == -1)
currentPosition = s.Length;
values.Add(s.Substring(lastPosition, currentPosition - lastPosition));
lastPosition = currentPosition+1;
} while (currentPosition < s.Length);
I took a peek at SSCLI implementation and its similar to the above except it handles way more use cases and it uses an unsafe method to determine the indexes of the separators before doing the substring extraction.
Others have suggested the following.
An Extension Method that uses an Iterator Block
Regex suggestion (no implementation)
Linq Aggregate method
Is this it?
It's reasonably simple to write your own Split equivalent.
Here's a quick example, although in reality you'd probably want to create some overloads for more flexibility. (Well, in reality you'd just use the framework's built-in Split methods!)
string foo = "smith;rodgers;McCalne";
foreach (string bar in foo.Split2(";"))
{
Console.WriteLine(bar);
}
// ...
public static class StringExtensions
{
public static IEnumerable<string> Split2(this string source, string delim)
{
// argument null checking etc omitted for brevity
int oldIndex = 0, newIndex;
while ((newIndex = source.IndexOf(delim, oldIndex)) != -1)
{
yield return source.Substring(oldIndex, newIndex - oldIndex);
oldIndex = newIndex + delim.Length;
}
yield return source.Substring(oldIndex);
}
}
You make your own loop to do the split. Here is one that uses the Aggregate extension method. Not very efficient, as it uses the += operator on the strings, so it should not really be used as anything but an example, but it works:
string names = "smith;rodgers;McCalne";
List<string> split = names.Aggregate(new string[] { string.Empty }.ToList(), (s, c) => {
if (c == ';') s.Add(string.Empty); else s[s.Count - 1] += c;
return s;
});
Regex?
Or just Substring. This is what Split does internally
A quick brain teaser: given a string
This is a string with repeating spaces
What would be the LINQ expressing to end up with
This is a string with repeating spaces
Thanks!
For reference, here's one non-LINQ way:
private static IEnumerable<char> RemoveRepeatingSpaces(IEnumerable<char> text)
{
bool isSpace = false;
foreach (var c in text)
{
if (isSpace && char.IsWhiteSpace(c)) continue;
isSpace = char.IsWhiteSpace(c);
yield return c;
}
}
This is not a linq type task, use regex
string output = Regex.Replace(input," +"," ");
Of course you could use linq to apply this to a collection of strings.
public static string TrimInternal(this string text)
{
var trimmed = text.Where((c, index) => !char.IsWhiteSpace(c) || (index != 0 && !char.IsWhiteSpace(text[index - 1])));
return new string(trimmed.ToArray());
}
Since nobody seems to have given a satisfactory answer, I came up with one. Here's a string-based solution (.Net 4):
public static string RemoveRepeatedSpaces(this string s)
{
return s[0] + string.Join("",
s.Zip(
s.Skip(1),
(x, y) => x == y && y == ' ' ? (char?)null : y));
}
However, this is just a general case of removing repeated elements from a sequence, so here's the generalized version:
public static IEnumerable<T> RemoveRepeatedElements<T>(
this IEnumerable<T> s, T dup)
{
return s.Take(1).Concat(
s.Zip(
s.Skip(1),
(x, y) => x.Equals(y) && y.Equals(dup) ? (object)null : y)
.OfType<T>());
}
Of course, that's really just a more specific version of a function that removes all consecutive duplicates from its input stream:
public static IEnumerable<T> RemoveRepeatedElements<T>(this IEnumerable<T> s)
{
return s.Take(1).Concat(
s.Zip(
s.Skip(1),
(x, y) => x.Equals(y) ? (object)null : y)
.OfType<T>());
}
And obviously you can implement the first function in terms of the second:
public static string RemoveRepeatedSpaces(this string s)
{
return string.Join("", s.RemoveRepeatedElements(' '));
}
BTW, I benchmarked my last function against the regex version (Regex.Replace(s, " +", " ")) and they were were within nanoseconds of each other, so the extra LINQ overhead is negligible compared to the extra regex overhead. When I generalized it to remove all consecutive duplicate characters, the equivalent regex (Regex.Replace(s, "(.)\\1+", "$1")) was 3.5 times slower than my LINQ version (string.Join("", s.RemoveRepeatedElements())).
I also tried the "ideal" procedural solution:
public static string RemoveRepeatedSpaces(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
char lastChar = '\0';
foreach (char c in s)
if (c != ' ' || lastChar != ' ')
sb.Append(lastChar = c);
return sb.ToString();
}
This is more than 5 times faster than a regex!
In practice, I would probably just use your original solution or regular expressions (if you want a quick & simple solution). A geeky approach that uses lambda functions would be to define a fixed point operator:
T FixPoint<T>(T initial, Func<T, T> f) {
T current = initial;
do {
initial = current;
current = f(initial);
} while (initial != current);
return current;
}
This keeps calling the operation f repeatedly until the operation returns the same value that it got as an argument. You can think of the operation as a generalized loop - it is quite useful, though I guess it is too geeky to be included in .NET BCL. Then you can write:
string res = FixPoint(original, s => s.Replace(" ", " "));
It is not as efficient as your original version, but unless there are too many spaces it should work fine.
Linq is by definition related to enumerable (i.e. collections, list, arrays). You could transorm your string into a collection of char and select the non space one but this is definitevly not a job for Linq.
Paul Creasey's answer is the way to go.
If you want to treat tabs as whitespace as well, go with:
text = Regex.Replace(text, "[ |\t]+", " ");
UPDATE:
The most logical way to solve this problem while satisfying the "using LINQ" requirement has been suggested by both Hasan and Ani. However, notice that these solutions involve accessing a character in a string by index.
The spirit of the LINQ approach is that it can be applied to any enumerable sequence. Because any reasonably efficient solution to this problem requires maintaining some kind of state (with Ani's and Hasan's solutions it's easy to miss this fact as the state is already maintained within the string itself), a generic approach that accepts any sequence of items is likely going to be much more straightforward to implement using procedural code.
This procedural code may then be abstracted into a method that looks like a LINQ-style method, of course. But I would not recommend tackling a problem like this with the attitude of "I want to use LINQ in this solution" from the get-go because it will impose very awkward restriction on your code.
For what it's worth, here's how I'd implement the general idea.
public static IEnumerable<T> StripConsecutives<T>(this IEnumerable<T> source, T value, IEqualityComparer<T> comparer)
{
// null-checking omitted for brevity
using (var enumerator = source.GetEnumerator())
{
if (enumerator.MoveNext())
{
yield return enumerator.Current;
}
else
{
yield break;
}
T prev = enumerator.Current;
while (enumerator.MoveNext())
{
T current = enumerator.Current;
if (comparer.Equals(prev, value) && comparer.Equals(current, value))
{
// This is a consecutive occurrence of value --
// moving on...
}
else
{
yield return current;
}
prev = current;
}
}
}
Split to list, filter, then rejoin, 2 lines of code...
var test = " Alpha Beta Tango ";
var l = test.Split(' ').Where(s => !string.IsNullOrEmpty(s));
var result = string.Join(" ", l);
// result = "Alpha Beta Tango"
Refactoring as an extension method:
using Extensions;
void Main()
{
var test = " Alpha Beta Tango ";
var result = test.RemoveRepeatedSpaces();
// result = "Alpha Beta Tango";
}
static class Extentions
{
public static string RemoveRepeatedSpaces(this string s)
{
if (s == null)
return string.Empty;
var l = s.Split(' ').Where(a => !string.IsNullOrEmpty(a));
return string.Join(" ", l);
}
}
What's the cleanest/best way in C# to convert something like 400AMP or 6M to an integer? I won't always know what the suffix is, and I just want whatever it is to go away and leave me with the number.
You could use a regular expression:
Regex reg = new Regex("[0-9]*");
int result = Convert.ToInt32(reg.Match(input));
Okay, here's a long-winded solution which should be reasonably fast. It's similar to Guffa's middle answer, but I've put the conditions inside the body of the loop as I think that's simpler (and allows us to fetch the character just once). It's a matter of personal taste really.
It deliberately doesn't limit the number of digits that it matches, because if the string is an integer which overflows Int32, I think I'd rather see an exception than just a large integer :)
Note that this also handles negative numbers, which I don't think any of the other solutions so far do...
using System;
class Test
{
static void Main()
{
Console.WriteLine(ParseLeadingInt32("-1234AMP"));
Console.WriteLine(ParseLeadingInt32("+1234AMP"));
Console.WriteLine(ParseLeadingInt32("1234AMP"));
Console.WriteLine(ParseLeadingInt32("-1234"));
Console.WriteLine(ParseLeadingInt32("+1234"));
Console.WriteLine(ParseLeadingInt32("1234"));
}
static int ParseLeadingInt32(string text)
{
// Declared before loop because we need the
// final value
int i;
for (i=0; i < text.Length; i++)
{
char c = text[i];
if (i==0 && (c=='-' || c=='+'))
{
continue;
}
if (char.IsDigit(c))
{
continue;
}
break;
}
return int.Parse(text.Substring(0, i));
}
}
It's possibly not the cleanest method, but it's reasonably simple (a one liner) and I would imagine faster than a regex (uncompiled, for sure).
var str = "400AMP";
var num = Convert.ToInt32(str.Substring(0, str.ToCharArray().TakeWhile(
c => char.IsDigit(c)).Count()));
Or as an extension method:
public static int GetInteger(this string value)
{
return Convert.ToInt32(str.Substring(0, str.ToCharArray().TakeWhile(
c => char.IsDigit(c)).Count()));
}
Equivalently, you could construct the numeric string from the result of the TakeWhile function, as such:
public static int GetInteger(this string value)
{
return new string(str.ToCharArray().TakeWhile(
c => char.IsNumber(c)).ToArray());
}
Haven't benchmarked them, so I wouldn't know which is quicker (though I'd very much suspect the first). If you wanted to get better performance, you would just convert the LINQ (extension method calls on enumerables) to a for loop.
Hope that helps.
There are several options...
Like using a regular expression:
int result = int.Parse(Regex.Match(input, #"^\d+").Groups[0].Value);
Among the fastest; simply looping to find digits:
int i = 0;
while (i < input.Length && Char.IsDigit(input, i)) i++;
int result = int.Parse(input.Substring(0, i));
Use LastIndexOfAny to find the last digit:
int i = input.LastIndexOfAny("0123456789".ToCharArray()) + 1;
int result = int.Parse(input.Substring(0, i));
(Note: breaks with strings that has digits after the suffix, like "123asdf123".)
Probably fastest; parse it yourself:
int i = 0;
int result = 0;
while (i < input.Length) {
char c = input[i];
if (!Char.IsDigit(c)) break;
result *= 10;
result += c - '0';
i++;
}
If all you want to do is remove an unknown postfix from what would otherwise be an int, here is how I would do it:
I like a utility static method I call IsInt(string possibleInt) which will, as the name implies, return True if the string will parse into an int. You could write this same static method into your utility class (if it's not there already) and try:
`string foo = "12345SomePostFix";
while (!Tools.ToolBox.IsInt(foo))
{
foo = foo.Remove(foo.Length - 1);
}
int fooInt = int.Parse(foo);`