C#: Remove Excess Text From String - c#

Okay, so after looking around here on SO, I have found a solution that meets about 95% of my requirement, although I believe it may need to be redone at this point.
ISSUE
Say I have a value range supplied as "1000 - 1009 ABC1 ABC SOMETHING ELSE" where I just need the 1000 - 1009 part. I need to be able to remove excess characters from the string supplied, even if they truly are accepted characters, but only if they are part of secondary strings with text. (Sorry if that description seems odd, my mind isn't full power today.)
CURRENT SOLUTION
I currently have a simple method utilizing Linq to return only accepted characters, however this will return "1000 - 10091" which is not the range I am needing. I've thought about looping through the strings individual characters and comparing to previous characters as I go using IsDigit and IsLetter to my advantage, but then comes the issue of replacing the unacceptable characters or removing them. I think if I gave it a day or two I could figure it out with a clear mind, but it needs to be done by the end of the day, and I am banging my head against the keyboard.
void RemoveExcessText(ref string val) {
string allowedChars = "0123456789-+>";
val = new string(val.Where(c => allowedChars.Contains(c)).ToArray());
}
// Alternatively?
char previousChar = ' ';
for (int i = 0; i < val.Length; i++) {
if (char.IsLetter(val[i])) {
previousChar = val[i];
val.Remove(i, 1);
} else if (char.IsDigit(val[i])) {
if (char.IsLetter(previousChar)) {
val.Remove(i, 1);
}
}
}
But how do I calculate white space and leave in the +, -, and > charactrers? I am losing my mind on this one today.

Why not use a regular expression?
Regex.Match("1000 - 1009 ABC1 ABC SOMETHING ELSE", #"^(\d+)([\s\-]+)(\d+)");
Should give you what you want
I made a fiddle

You use a regular expression with a capturing group:
Regex r = new Regex("^(?<v>[-0-9 ]+?)");
This means "from the start of the input string (^) match [0 to 9 or space or hyphen] and keep going for as many occurrences of these characters as are available (+?) and store it into variable v (?)"
We get it out like this:
r.Matches(input)[0].Groups["v"].Value
Note though that if the input string doesn't match, the match collection will be 0 long and a call to [0] will crash. To this end you might want to robust it up with some extra error checking:
MatchCollection mc = r.Matches(input);
if(mc.Length > 0)
MessageBox.Show(mc[0].Groups["v"].Value;

You could match this with a regular expression. \d{1,4} means match a decimal digit at least once up to 4 times. Followed by space, hyphen, space, and 1 to 4 digits again, then anything else. Only the part inside parenthesis is output in your results.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var pattern = #"(^\d{1,4} - \d{1,4}).*";
string input = ("1000 - 1009 ABC1 ABC SOMETHING ELSE");
string replacement = "$1";
string result = Regex.Replace(input, pattern, replacement);
Console.WriteLine(result);
}
}
https://dotnetfiddle.net/cZGlX4

Related

How to remove multiple, repeating & unnecessary punctuation from string in C#?

Considering strings like this:
"This is a string....!"
"This is another...!!"
"What is this..!?!?"
...
// There are LOTS of examples of weird/angry sentence-endings like the ones above.
I want to replace the unnecessary punctuation at the end to make it look like this:
"This is a string!"
"This is another!"
"What is this?"
What I basically do is:
- split by space
- check if last char in string contains a punctuation
- start replacing with the patterns below
I have tried a very big ".Replace(string, string)" function, but it does not work - there has to be a simpler regex I guess.
Documentation:
Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
As well as:
Because this method returns the modified string, you can chain together successive calls to the Replace method to perform multiple replacements on the original string.
Anything is wrong here.
EDIT: ALL the proposed solutions work fine! Thank you very much!
This one was the best suited solution for my project:
Regex re = new Regex("[.?!]*(?=[.?!]$)");
string output = re.Replace(input, "");
Your solution works almost fine (demo), the only issue is when the same sequence could be matched starting at different spots. For example, ..!?!? from your last line is not part of the substitution list, so ..!? and !? get replaced by two separate matches, producing ?? in the output.
It looks like your strategy is pretty straightforward: in a chain of multiple punctuation characters the last character wins. You can use regular expressions to do the replacement:
[!?.]*([!?.])
and replace it with $1, i.e. the capturing group that has the last character:
string s;
while ((s = Console.ReadLine()) != null) {
s = Regex.Replace(s, "[!?.]*([!?.])", "$1");
Console.WriteLine(s);
}
Demo
Simply
[.?!]*(?=[.?!]$)
should do it for you. Like
Regex re = new Regex("[.?!]*(?=[.?!]$)");
Console.WriteLine(re.Replace("This is a string....!", ""));
This replaces all punctuations but the last with nothing.
[.?!]* matches any number of consecutive punctuation characters, and the (?=[.?!]$) is a positive lookahead making sure it leaves one at the end of the string.
See it here at ideone.
Or you can do it without regExps:
string TrimPuncMarks(string str)
{
HashSet<char> punctMarks = new HashSet<char>() {'.', '!', '?'};
int i = str.Length - 1;
for (; i >= 0; i--)
{
if (!punctMarks.Contains(str[i]))
break;
}
// the very last punct mark or null if there were no any punct marks in the end
char? suffix = i < str.Length - 1 ? str[str.Length - 1] : (char?)null;
return str.Substring(0, i+1) + suffix;
}
Debug.Assert("What is this?" == TrimPuncMarks("What is this..!?!?"));
Debug.Assert("What is this" == TrimPuncMarks("What is this"));
Debug.Assert("What is this." == TrimPuncMarks("What is this."));

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

.Net Removing all the first 0 of a string

I got the following :
01.05.03
I need to convert that to 1.5.3
The problem is I cannot only trim the 0 because if I got :
01.05.10
I need to convert that to 1.5.10
So, what's the better way to solve that problem ? Regex ? If so, any regex example doing that ?
Expanding on the answer of #FrustratedWithFormsDesigner:
string Strip0s(string s)
{
return string.Join<int>(".", from x in s.Split('.') select int.Parse(x));
}
Regex-replace
(?<=^|\.)0+
with the empty string. The regex is:
(?<= # begin positive look-behind (i.e. "a position preceded by")
^|\. # the start of the string or a literal dot †
) # end positive look-behind
0+ # one or more "0" characters
† note that not all regex flavors support variable-length look-behind, but .NET does.
If you expect this kind of input: "00.03.03" and want to to keep the leading zero in this case (like "0.3.3"), use this expression instead:
(?<=^|\.)0+(?=\d)
and again replace with the empty string.
From the comments (thanks Kobi): There is a more concise expression that does not require look-behind and is equivalent to my second suggestion:
\b0+(?=\d)
which is
\b # a word boundary (a position between a word char and a non-word char)
0+ # one or more "0" characters
(?=\d) # positive look-ahead: a position that's followed by a digit
This works because the 0 happens to be a word character, so word boundaries can be used to find the first 0 in a row. It is a more compatible expression, because many regex flavors do not support variable-length look-behind, and some (like JavaScript) no look-behind at all.
You could split the string on ., then trim the leading 0s on the results of the split, then merge them back together.
I don't know of a way to do this in a single operation, but you could write a function that hides this and makes it look like a single operation. ;)
UPDATE:
I didn't even think of the other guy's regex. Yeah, that will probably do it in a single operation.
Here's another way you could do what FrustratedWithFormsDesigner suggests:
string s = "01.05.10";
string s2 = string.Join(
".",
s.Split('.')
.Select(str => str.TrimStart('0'))
.ToArray()
);
This is almost the same as dtb's answer, but doesn't require that the substrings be valid integers (it would also work with, e.g., "000A.007.0HHIMARK").
UPDATE: If you'd want any strings consisting of all 0s in the input string to be output as a single 0, you could use this:
string s2 = string.Join(
".",
s.Split('.')
.Select(str => TrimLeadingZeros(str))
.ToArray()
);
public static string TrimLeadingZeros(string text) {
int number;
if (int.TryParse(text, out number))
return number.ToString();
else
return text.TrimStart('0');
}
Example input/output:
00.00.000A.007.0HHIMARK // input
0.0.A.7.HHIMARK // output
There's also the old-school way which probably has better performance characteristics than most other solutions mentioned. Something like:
static public string NormalizeVersionString(string versionString)
{
if(versionString == null)
throw new NullArgumentException("versionString");
bool insideNumber = false;
StringBuilder sb = new StringBuilder(versionString.Length);
foreach(char c in versionString)
{
if(c == '.')
{
sb.Append('.');
insideNumber = false;
}
else if(c >= '1' && c <= '9')
{
sb.Append(c);
insideNumber = true;
}
else if(c == '0')
{
if(insideNumber)
sb.Append('0');
}
}
return sb.ToString();
}
string s = "01.05.10";
string newS = s.Replace(".0", ".");
newS = newS.StartsWith("0") ? newS.Substring(1, newS.Length - 1) : newS;
Console.WriteLine(newS);
NOTE: You will have to thoroughly check for possible input combination.
This looks like it is a date format, if so I would use Date processing code
DateTime time = DateTime.Parse("01.02.03");
String newFormat = time.ToString("d.M.yy");
or even better
String newFormat = time.ToShortDateString();
which will respect you and your clients culture setting.
If this data is not a date then don't use this :)
I had a similar requirement to parse a string with street adresses, where some of the house numbers had leading zeroes and I needed to remove them while keeping the rest of the text intact, so I slightly edited the accepted answer to meet my requirements, maybe someone finds it useful. Basically doing the same as accepted answer, with the difference that I am checking if the string part can be parsed as an integer, and defaulting to the string value when false;
string Strip0s(string s)
{
int outputValue;
return
string.Join(" ",
from x in s.Split(new[] { ' ' })
select int.TryParse(x, out outputValue) ? outputValue.ToString() : x);
}
Input: "Islands Brygge 34 B 07 TV"
Output: "Islands Brygge 34 B 7 TV"

Formatting sentences in a string using C#

I have a string with multiple sentences. How do I Capitalize the first letter of first word in every sentence. Something like paragraph formatting in word.
eg ."this is some code. the code is in C#. "
The ouput must be "This is some code. The code is in C#".
one way would be to split the string based on '.' and then capitalize the first letter and then rejoin.
Is there a better solution?
In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.
I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.
Here's the code:
public static string Capitalise(string input)
{
//now the first character
return Regex.Replace(input, #"(?<=(^|[.;:])\s*)[a-z]",
(match) => { return match.Value.ToUpper(); });
}
The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.
This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).
All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).
Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.
EDIT
Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.
Here's the crude benchmark I did:
public string LowerCaseLipsum
{
get
{
//went to lipsum.com and generated 10 paragraphs of lipsum
//which I then initialised into the backing field with #"[lipsumtext]".ToLower()
return _lowerCaseLipsum;
}
}
[TestMethod]
public void CapitaliseAhmadsWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1)));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
[TestMethod]
public void CapitaliseLookAroundWay()
{
List<string> results = new List<string>();
DateTime start = DateTime.Now;
Regex r = new Regex(#"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
for (int f = 0; f < 1000; f++)
{
results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
}
TimeSpan duration = DateTime.Now - start;
Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
}
In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).
Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.
Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.
string input = #"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = #"(^|\p{P}\s+)(\w+)";
// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);
string result = rx.Replace(input, m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:
string pattern = #"(^|[.?!""]\s+)(\w+)";
EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.
The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:
P: all punctuations (comprises all of the categories below)
Pc: underscore _
Pd: dash -
Ps: open parenthesis, brackets and braces ( [ {
Pe: closing parenthesis, brackets and braces ) ] }
Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
Pf: final single/double quotes (MSDN Pi note applies)
Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /
Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.
string input = #"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";
string[] patterns = {
#"(^|\p{P}\s+)(\w+)", // all punctuation chars
#"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
#"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};
// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
Console.WriteLine("*** Current pattern: {0}", pattern);
string result = Regex.Replace(input, pattern,
m => m.Groups[1].Value
+ m.Groups[2].Value.Substring(0, 1).ToUpper()
+ m.Groups[2].Value.Substring(1));
Console.WriteLine(result);
Console.WriteLine();
}
Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.
Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.
You have a few different options:
Your approach of splitting the string, capitalizing and then re-joining
Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.
The code below uses an iterator:
public static string ToSentenceCase( string someString )
{
var sb = new StringBuilder( someString.Length );
bool wasPeriodLastSeen = true; // We want first letter to be capitalized
foreach( var c in someString )
{
if( wasPeriodLastSeen && !c.IsWhiteSpace )
{
sb.Append( c.ToUpper() );
wasPeriodLastSeen = false;
}
else
{
if( c == '.' ) // you may want to expand this to other punctuation
wasPeriodLastSeen = true;
sb.Append( c );
}
}
return sb.ToString();
}
I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.
static IEnumerable<char> CapitalLetters(string sentence)
{
//capitalize first letter
bool capitalize = true;
char lastLetter;
for (int i = 0; i < sentence.Length; i++)
{
lastLetter = sentence[i];
yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];
if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
continue;
capitalize = false;
if (lastLetter == '.' || lastLetter == '!') //etc
capitalize = true;
}
}
To use it:
string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());
Do your work in a StringBuffer.
Lowercase the whole thing.
Loop through and uppercase leading chars.
Call ToString.

Categories