Can someone help me out with some regex?

Can someone help me out with some regex? - c#

I'm trying to get an int[] from the following string. I was initially doing a Regex().Split() but as you can see there isn't a definite way to split this string. I could split on [a-z] but then I have to remove the commas afterwards. It's late and I can't think of a nice way to do this.
"21,false,false,25,false,false,27,false,false,false,false,false,false,false,false"
Any suggestions?

Split on the following:
[^\d]+
You may get an empty element at the beginning if your string doesn't begin with a number and/or an empty element at the end if your string doesn't end with a number. If this happens you can trim those elements. Better yet, if Regex().Split() supports some kind of flag to not return empty strings, use that.

A simple solution here is to match the string with the pattern \d+.
Sometimes it is easier to split by the values you don't want to match, but here it serves as an anti-pattern.:
MatchCollection numbers = Regex.Matches(s, #"\d+");
Another solution is to use regular String.Split and some LINQ magic. First we split by commas, and then check that all letters in a given token are digits:
var numbers = s.Split(',').Where(word => word.All(Char.IsDigit));
It is an commonly ignored fact the \d in .Net matches all Unicode digits, and not just [0-9] (try ١٢٣ in your array, just for fun). A more robust solution is to try to parse each token according to your defined culture, and return the valid numbers. This can easily be adapted to support decimals, exponents or ethnic numeric formats:
static IEnumerable<int> GetIntegers(string s)
{
CultureInfo culture = CultureInfo.InvariantCulture;
string[] tokens = s.Split(',');
foreach (string token in tokens)
{
int number;
if (Int32.TryParse(token, NumberStyles.Integer, culture, out number))
yield return number;
}
}

Related

Can anyone improve this 'list of IP addresses' regex? [duplicate]

What is the regular expression to validate a comma delimited list like this one:
12365, 45236, 458, 1, 99996332, ......

I suggest you to do in the following way:
(\d+)(,\s*\d+)*
which would work for a list containing 1 or more elements.

This regex extracts an element from a comma separated list, regardless of contents:
(.+?)(?:,|$)
If you just replace the comma with something else, it should work for any delimiter.

It depends a bit on your exact requirements. I'm assuming: all numbers, any length, numbers cannot have leading zeros nor contain commas or decimal points. individual numbers always separated by a comma then a space, and the last number does NOT have a comma and space after it. Any of these being wrong would simplify the solution.
([1-9][0-9]*,[ ])*[1-9][0-9]*
Here's how I built that mentally:
[0-9] any digit.
[1-9][0-9]* leading non-zero digit followed by any number of digits
[1-9][0-9]*, as above, followed by a comma
[1-9][0-9]*[ ] as above, followed by a space
([1-9][0-9]*[ ])* as above, repeated 0 or more times
([1-9][0-9]*[ ])*[1-9][0-9]* as above, with a final number that doesn't have a comma.

Match duplicate comma-delimited items:
(?<=,|^)([^,]*)(,\1)+(?=,|$)
Reference.
This regex can be used to split the values of a comma delimitted list. List elements may be quoted, unquoted or empty. Commas inside a pair of quotation marks are not matched.
,(?!(?<=(?:^|,)\s*"(?:[^"]|""|\\")*,)(?:[^"]|""|\\")*"\s*(?:,|$))
Reference.

/^\d+(?:, ?\d+)*$/

i used this for a list of items that had to be alphanumeric without underscores at the front of each item.
^(([0-9a-zA-Z][0-9a-zA-Z_]*)([,][0-9a-zA-Z][0-9a-zA-Z_]*)*)$

You might want to specify language just to be safe, but
(\d+, ?)+(\d+)?
ought to work

I had a slightly different requirement, to parse an encoded dictionary/hashtable with escaped commas, like this:
"1=This is something, 2=This is something,,with an escaped comma, 3=This is something else"
I think this is an elegant solution, with a trick that avoids a lot of regex complexity:
if (string.IsNullOrEmpty(encodedValues))
{
return null;
}
else
{
var retVal = new Dictionary<int, string>();
var reFields = new Regex(#"([0-9]+)\=(([A-Za-z0-9\s]|(,,))+),");
foreach (Match match in reFields.Matches(encodedValues + ","))
{
var id = match.Groups[1].Value;
var value = match.Groups[2].Value;
retVal[int.Parse(id)] = value.Replace(",,", ",");
}
return retVal;
}
I think it can be adapted to the original question with an expression like #"([0-9]+),\s?" and parse on Groups[0].
I hope it's helpful to somebody and thanks for the tips on getting it close to there, especially Asaph!

In JavaScript, use split to help out, and catch any negative digits as well:
'-1,2,-3'.match(/(-?\d+)(,\s*-?\d+)*/)[0].split(',');
// ["-1", "2", "-3"]
// may need trimming if digits are space-separated

The following will match any comma delimited word/digit/space combination
(((.)*,)*)(.)*

Why don't you work with groups:
^(\d+(, )?)+$

If you had a more complicated regex, i.e: for valid urls rather than just numbers. You could do the following where you loop through each element and test each of them individually against your regex:
const validRelativeUrlRegex = /^(^$|(?!.*(\W\W))\/[a-zA-Z0-9\/-]+[^\W_]$)/;
const relativeUrls = "/url1,/url-2,url3";
const startsWithComma = relativeUrls.startsWith(",");
const endsWithComma = relativeUrls.endsWith(",");
const areAllURLsValid = relativeUrls
.split(",")
.every(url => validRelativeUrlRegex.test(url));
const isValid = areAllURLsValid && !endsWithComma && !startsWithComma

Extract all numbers from string

Let's say I have a string such as 123ad456. I want to make a method that separates the groups of numbers into a list, so then the output will be something like 123,456.
I've tried doing return Regex.Match(str, #"-?\d+").Value;, but that only outputs the first occurrence of a number, so the output would be 123. I also know I can use Regex.Matches, but from my understanding, that would output 123456, not separating the different groups of numbers.
I also see from this page on MSDN that Regex.Match has an overload that takes the string to find a match for and an int as an index at which to search for the match, but I don't see an overload that takes in the above in addition to a parameter for the regex pattern to search for, and the same goes for Regex.Matches.
I guess the approach to use would be to use a for loop of some sort, but I'm not entirely sure what to do. Help would be greatly appreciated.

All you have to to use Matches instead of Match. Then simply iterate over all matches:
string result = "";
foreach (Match match in Regex.Matches(str, #"-?\d+"))
{
result += match.result;
}

You may iterate over string data using foreach and use TryParse to check each character.
foreach (var item in stringData)
{
if (int.TryParse(item.ToString(), out int data))
{
// int data is contained in variable data
}
}

Using a combination of string.Join and Regex.Matches:
string result = string.Join(",", Regex.Matches(str, #"-?\d+").Select(m => m.Value));
string.Join performs better than continually appending to an existing string.

\d+ is the regex for integer numbers;
//System.Text.RegularExpressions.Regex
resultString = Regex.Match(subjectString, #"\d+").Value;
returns a string with the very first occurence of a number in subjectString.
Int32.Parse(resultString) will then give you the number.

Extracting and Manipulating Strings in C#.Net

We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string
($name$:('George') AND $phonenumer$:('456456') AND
$emailaddress$:("test#test.com"))
We need to extract the strings between the character - $
Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.
What would be the ideal way to do it? are there any out of the box features available for this?
Regards,
John

The simplest way is to use a regular expression to match all non-whitespace characters between $ :
var regex=new Regex(#"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var matches=regex.Matches(input);
This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.
Since this is a collection, you can use LINQ on it to get eg an array with the values:
var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();
That array will contain the values $name$,$phonenumer$,$emailaddress$.
Capture by name
You can specify groups in the pattern and attach names to them. For example, you can group the field name values:
var regex=new Regex(#"\$(?<name>\w+)\$");
var names=regex.Matches(input)
.OfType<Match>()
.Select(m=>m.Groups["name"].Value);
This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group
Extract both names and values
You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.
The pattern in this case is more comples:
#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"
Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with #) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.
Putting this together:
var regex = new Regex(#"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
.OfType<Match>()
.Select(m=>new { Name=m.Groups["name"].Value,
Value=m.Groups["value"].Value
})
.ToArray()
Turn them into a dictionary
Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :
var myDict = regex.Matches(input)
.OfType<Match>()
.ToDictionary(m=>m.Groups["name"].Value,
m=>m.Groups["value"].Value);
Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.
Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request
Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.
Can it go faster?
Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :
IEnumerable<string> GetNames(string input)
{
var builder=new StringBuilder(20);
bool started=false;
foreach(var c in input)
{
if (started)
{
if (c!='$')
{
builder.Append(c);
}
else
{
started=false;
var value=builder.ToString();
yield return value;
builder.Clear();
}
}
else if (c=='$')
{
started=true;
}
}
}
A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.
Modifying this code to extract values though isn't so easy.

Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).
In this example, I'm also grabbing the value of each item and then putting both in a dictionary:
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"test#test.com\"))";
var inputParts = input.Replace(" AND ", "")
.Trim(')', '(')
.Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);
var keyValuePairs = new Dictionary<string, string>();
for (int i = 0; i < inputParts.Length - 1; i += 2)
{
var key = inputParts[i];
var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');
keyValuePairs[key] = value;
}
foreach (var kvp in keyValuePairs)
{
Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}
// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();
Output

Regular Expression start-end position

I need to make a validation in a string with positional information using regex.
Example:
020005254877841100557810AAAAAA841158891BBBB
I need to get match position 5 until 10 and has to be only numbers.
How can I do this using RegEx?

hmm does it really have to be regex?
I would have done like this.
var myString = "020005254877841100557810AAAAAA841158891BBBB";
var isValid = myString.Substring(4, 5).All(Char.IsDigit);

If you absolutely HAVE to use regex, here it is.... (Though I think you should go with something like Jonas W's answer).
Match m = Regex.Match(myString, "^.{4}\d{5}.*");
if(m.Success){
//do stuff
}
The regex means, "from the beginning of the string (^), match 4 of any character (.{4}), then five digits, (\d{5}), then however many of any other characters (.*)"

If you really feel you must use a regex this will be the one:
"^....\d{5}"
You can also do multiple checks like this:
"^....(\d{5}|\D{5})
That one will match all numbers or all non-digit characters, but not a mix or anything with whitespace.

To build on what FailedDev mentioned.
string myString = "020005254877841100557810AAAAAA841158891BBBB";
myString.Substring(5, 5); //will get the substring at index 5 to 10.
double Num;
bool isNum = double.TryParse(myString, out Num); //returns true of all numbers
Hope that helps,

How to extract decimal number from string in C#

string sentence = "X10 cats, Y20 dogs, 40 fish and 1 programmer.";
string[] digits = Regex.Split (sentence, #"\D+");
For this code I get these values in the digits array
10,20,40,1
string sentence = "X10.4 cats, Y20.5 dogs, 40 fish and 1 programmer.";
string[] digits = Regex.Split (sentence, #"\D+");
For this code I get these values in the digits array
10,4,20,5,40,1
But I would like to get like
10.4,20.5,40,1
as decimal numbers. How can I achieve this?

Small improvement to #Michael's solution:
// NOTES: about the LINQ:
// .Where() == filters the IEnumerable (which the array is)
// (c=>...) is the lambda for dealing with each element of the array
// where c is an array element.
// .Trim() == trims all blank spaces at the start and end of the string
var doubleArray = Regex.Split(sentence, #"[^0-9\.]+")
.Where(c => c != "." && c.Trim() != "");
Returns:
10.4
20.5
40
1
The original solution was returning
[empty line here]
10.4
20.5
40
1
.

The decimal/float number extraction regex can be different depending on whether and what thousand separators are used, what symbol denotes a decimal separator, whether one wants to also match an exponent, whether or not to match a positive or negative sign, whether or not to match numbers that may have leading 0 omitted, whether or not extract a number that ends with a decimal separator.
A generic regex to match the most common decimal number types is provided in Matching Floating Point Numbers with a Regular Expression:
[-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?
I only changed the capturing group to a non-capturing one (added ?: after (). It matches
If you need to make it even more generic, if the decimal separator can be either a dot or a comma, replace \. with a character class (or a bracket expression) [.,]:
[-+]?[0-9]*[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?
^^^^
Note the expressions above match both integer and floats. To match only float/decimal numbers make sure the fractional pattern part is obligatory by removing the second ? after \. (demo):
[-+]?[0-9]*\.[0-9]+(?:[eE][-+]?[0-9]+)?
^
Now, 34 is not matched: is matched.
If you do not want to match float numbers without leading zeros (like .5) make the first digit matching pattern obligatory (by adding + quantifier, to match 1 or more occurrences of digits):
[-+]?[0-9]+\.[0-9]+(?:[eE][-+]?[0-9]+)?
^
See this demo. Now, it matches much fewer samples:
Now, what if you do not want to match <digits>.<digits> inside <digits>.<digits>.<digits>.<digits>? How to match them as whole words? Use lookarounds:
[-+]?(?<!\d\.)\b[0-9]+\.[0-9]+(?:[eE][-+]?[0-9]+)?\b(?!\.\d)
And a demo here:
Now, what about those floats that have thousand separators, like 12 123 456.23 or 34,345,767.678? You may add (?:[,\s][0-9]+)* after the first [0-9]+ to match zero or more sequences of a comma or whitespace followed with 1+ digits:
[-+]?(?<![0-9]\.)\b[0-9]+(?:[,\s][0-9]+)*\.[0-9]+(?:[eE][-+]?[0-9]+)?\b(?!\.[0-9])
See the regex demo:
Swap a comma with \. if you need to use a comma as a decimal separator and a period as as thousand separator.
Now, how to use these patterns in C#?
var results = Regex.Matches(input, #"<PATTERN_HERE>")
.Cast<Match>()
.Select(m => m.Value)
.ToList();

try
Regex.Split (sentence, #"[^0-9\.]+")

You'll need to allow for decimal places in your regular expression. Try the following:
\d+(\.\d+)?
This will match the numbers rather than everything other than the numbers, but it should be simple to iterate through the matches to build your array.
Something to keep in mind is whether you should also be looking for negative signs, commas, etc.

Check the syntax lexers for most programming languages for a regex for decimals.
Match that regex to the string, finding all matches.

If you have Linq:
stringArray.Select(s=>decimal.Parse(s));
A foreach would also work. You may need to check that each string is actually a number (.Parse does not throw en exception).

Credit for following goes to #code4life. All I added is a for loop for parsing the integers/decimals before returning.
public string[] ExtractNumbersFromString(string input)
{
input = input.Replace(",", string.Empty);
var numbers = Regex.Split(input, #"[^0-9\.]+").Where(c => !String.IsNullOrEmpty(c) && c != ".").ToArray();
for (int i = 0; i < numbers.Length; i++)
numbers[i] = decimal.Parse(numbers[i]).ToString();
return numbers;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can someone help me out with some regex? - c#

Related

Can anyone improve this 'list of IP addresses' regex? [duplicate]

Extract all numbers from string

Extracting and Manipulating Strings in C#.Net

Regular Expression start-end position

How to extract decimal number from string in C#

Categories

Resources