Removing numbers from text using C# - c#

I have a text file for processing, which has some numbers. I want JUST text in it, and nothing else. I managed to remove the punctuation marks, but how do I remove the numbers? I want this using C# code.
Also, I want to remove words with length greater than 10. How do I do that using Reg Expressions?

You can do this with a regex:
string withNumbers = // string with numbers
string withoutNumbers = Regex.Replace(withNumbers, "[0-9]", "");
Use this regex to remove words with more than 10 characters:
[\w]{10, 100}
100 defines the max length to match. I don't know if there is a quantifier for min length...

Only letters and nothing else (because I see you also want to remove the punctuation marks)
Regex.IsMatch(input, #"^[a-zA-Z]+$");

You can also use string.Join:
string s = "asdasdad34534t3sdf43534";
s = string.Join(null, System.Text.RegularExpressions.Regex.Split(s, "[\\d]"));

The Regex.Replace method should do the trick.
// regex to match any digit
var regex = new Regex("\d");
// replace all matches in input with empty string
var output = regex.Replace(input, String.Empty);

Related

Regex only letters except set of numbers

I'm using Replace(#"[^a-zA-Z]+", "");
leave only letters, but I have a set of numbers or characters that I want to keep as well, ex: 122456 and 112466. But I'm having trouble leaving it only if it's this sequence:
ex input:
abc 1239 asm122456000
I want to:
abscasm122456
tried this: ([^a-zA-Z])+|(?!122456)
My answer doesn't applying Replace(), but achieves a similar result:
(?:[a-zA-Z]+|\d{6})
which captures the group (non-capturing group) with the alphabetic character(s) or a set of digits with 6 occurrences.
Regex 101 & Test Result
Join all the matching values into a single string.
using System.Linq;
Regex regex = new Regex("(?:[a-zA-Z]+|\\d{6})");
string input = "abc 1239 asm12245600";
string output = "";
var matches = regex.Matches(input);
if (matches.Count > 0)
output = String.Join("", matches.Select(x => x.Value));
Sample .NET Fiddle
Alternate way,
using .Split() and .All(),
string input = "abc 1239 asm122456000";
string output = string.Join("", input.Split().Where(x => !x.All(char.IsDigit)));
.NET Fiddle
It is very simple: you need to match and capture what you need to keep, and just match what you need to remove, and then utilize a backreference to the captured group value in the replacement pattern to put it back into the resulting string.
Here is the regex:
(122456|112466)|[^a-zA-Z]
See the regex demo. Details:
(122456|112466) - Capturing group with ID 1: either of the two alternatives
| - or
[^a-zA-Z] - a char other than an ASCII letter (use \P{L} if you need to match any char other than any Unicode letter).
Note the removed + quantifier as [^A-Za-z] also matches digits.
You need to use $1 in the replacement:
var result = Regex.Replace(text, #"(122456|112466)|[^a-zA-Z]", "$1");

Regex string should contain Double quotes and comma seperated numbers only

I have following string
"56565665,5656565,5656556"
I want to just check that the string must only contain Double quotes, Comma and numbers.
for that I have tried creating a regex ^"\d+\, but it only selects first string.
I am new to regex completely.
You can use ^"[\d,]+"$
See it here on regex101
In C# it would look like this due to escaping chars
using System.Text.RegularExpressions
...
Console.WriteLine(Regex.IsMatch(#"""56565665,5656565,5656556""", #"^""[\d,]+""$"));
Console.WriteLine(Regex.IsMatch(#"""56565665,5656565;5656556""", #"^""[\d,]+""$"));
Update due to question in comments about how to use it with a variable:
var str = #"""56565665,5656565,5656556""";
// var str = "\"56565665,5656565,5656556\""; <- Alternative way of escaping "
Console.WriteLine(Regex.IsMatch(str, #"^""[\d,]+""$"))
Try this , hope this will work
"^[0-9,]+$"
You can try the pattern below:
^"[0-9]+(,[0-9]+)*"$
or even
\A"[0-9]+(,[0-9]+)*"\z // Wiktor Stribiżew's idea, see his comment below
E.g.
string source = #"""123,456,789""";
string pattern = #"\A""[0-9]+(,[0-9]+)*""\z";
bool result = Regex.IsMatch(source, pattern);
Tests:
"123" - true // just a number
"123,456" - true // two numbers separated by comma
"1,2,3,4" - true // four numbers separated by comma
"," - false // just a comma, no numbers
",123" - false // leading comma
"123," - false // trailing comma
"123,,456" - false // double comma

get an special Substring in c#

I need to extract a substring from an existing string. This String starts with uninteresting characters (include "," "space" and numbers) and ends with ", 123," or ", 57," or something like this where the numbers can change. I only need the Numbers.
Thanks
public static void Main(string[] args)
{
string input = "This is 2 much junk, 123,";
var match = Regex.Match(input, #"(\d*),$"); // Ends with at least one digit
// followed by comma,
// grab the digits.
if(match.Success)
Console.WriteLine(match.Groups[1]); // Prints '123'
}
Regex to match numbers: Regex regex = new Regex(#"\d+");
Source (slightly modified): Regex for numbers only
I think this is what you're looking for:
Remove all non numeric characters from a string using Regex
using System.Text.RegularExpressions;
...
string newString = Regex.Replace(oldString, "[^.0-9]", "");
(If you don't want to allow the decimal delimiter in the final result, remove the . from the regular expression above).
Try something like this :
String numbers = new String(yourString.TakeWhile(x => char.IsNumber(x)).ToArray());
You can use \d+ to match all digits within a given string
So your code would be
var lst=Regex.Matches(inp,reg)
.Cast<Match>()
.Select(x=x.Value);
lst now contain all the numbers
But if your input would be same as provided in your question you don't need regex
input.Substring(input.LastIndexOf(", "),input.LastIndexOf(","));

C# Why i can not split the string?

string myNumber = "3.44";
Regex regex1 = new Regex(".");
string[] substrings = regex1.Split(myNumber);
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();
I tried to split the string by ".", but it the splits return 4 empty string. Why?
. means "any character" in regular expressions. So don't split using a regex - split using String.Split:
string[] substrings = myNumber.Split('.');
If you really want to use a regex, you could use:
Regex regex1 = new Regex(#"\.");
The # makes it a verbatim string literal, to stop you from having to escape the backslash. The backslash within the string itself is an escape for the dot within the regex parser.
the easiest solution would be: string[] val = myNumber.Split('.');
. is a reserved character in regex. if you literally want to match a period, try:
Regex regex1 = new Regex(#"\.");
However, you're better off simply using myNumber.Split(".");
The dot matches a single character, without caring what that character
is. The only exception are newline characters.
Source: http://www.regular-expressions.info/dot.html
Therefore your implying in your code to split the string at each character.
Use this instead.
string substr = num.Split('.');
Keep it simple, use String.Split() method;
string[] substrings = myNumber.Split('.');
It has an other overload which allows specifying split options:
public string[] Split(
char[] separator,
StringSplitOptions options
)
You don't need regex you do that by using Split method of string object
string myNumber = "3.44";
String[] substrings = myNumber.Split(".");
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();
The period "." is being interpreted as any single character instead of a literal period.
Instead of using regular expressions you could just do:
string[] substrings = myNumber.Split(".");
In Regex patterns, the period character matches any single character. If you want the Regex to match the actual period character, you must escape it in the pattern, like so:
#"\."
Now, this case is somewhat simple for Regex matching; you could instead use String.Split() which will split based on the occurrence of one or more static strings or characters:
string[] substrings = myNumber.Split('.');
try
Regex regex1 = new Regex(#"\.");
EDIT: Er... I guess under a minute after Jon Skeet is not too bad, anyway...
You'll want to place an escape character before the "." - like this "\\."
"." in a regex matches any character, so if you pass 4 characters to a regex with only ".", it will return four empty strings. Check out this page for common operators.
Try
Regex regex1 = new Regex("[.]");

Regex to match alphanumeric and spaces

What am I doing wrong here?
string q = "john s!";
string clean = Regex.Replace(q, #"([^a-zA-Z0-9]|^\s)", string.Empty);
// clean == "johns". I want "john s";
just a FYI
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
would actually be better like
string clean = Regex.Replace(q, #"[^\w\s]", string.Empty);
This:
string clean = Regex.Replace(dirty, "[^a-zA-Z0-9\x20]", String.Empty);
\x20 is ascii hex for 'space' character
you can add more individual characters that you want to be allowed.
If you want for example "?" to be ok in the return string add \x3f.
I got it:
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
Didn't know you could put \s in the brackets
The following regex is for space inclusion in textbox.
Regex r = new Regex("^[a-zA-Z\\s]+");
r.IsMatch(textbox1.text);
This works fine for me.
I suspect ^ doesn't work the way you think it does outside of a character class.
What you're telling it to do is replace everything that isn't an alphanumeric with an empty string, OR any leading space. I think what you mean to say is that spaces are ok to not replace - try moving the \s into the [] class.
There appear to be two problems.
You're using the ^ outside a [] which matches the start of the line
You're not using a * or + which means you will only match a single character.
I think you want the following regex #"([^a-zA-Z0-9\s])+"
bottom regex with space, supports all keyboard letters from different culture
string input = "78-selim güzel667.,?";
Regex regex = new Regex(#"[^\w\x20]|[\d]");
var result= regex.Replace(input,"");
//selim güzel
The circumflex inside the square brackets means all characters except the subsequent range. You want a circumflex outside of square brackets.
This regex will help you to filter if there is at least one alphanumeric character and zero or more special characters i.e. _ (underscore), \s whitespace, -(hyphen)
string comparer = "string you want to compare";
Regex r = new Regex(#"^([a-zA-Z0-9]+[_\s-]*)+$");
if (!r.IsMatch(comparer))
{
return false;
}
return true;
Create a set using [a-zA-Z0-9]+ for alphanumeric characters, "+" sign (a quantifier) at the end of the set will make sure that there will be at least one alphanumeric character within the comparer.
Create another set [_\s-]* for special characters, "*" quantifier is to validate that there can be special characters within comparer string.
Pack these sets into a capture group ([a-zA-Z0-9]+[_\s-]*)+ to say that the comparer string should occupy these features.
[RegularExpression(#"^[A-Z]+[a-zA-Z""'\s-]*$")]
Above syntax also accepts space

Categories