Split String At Every Non-Letter/Non-Number Character - c#

Imagine a string that contains special characters like $§%%,., numbers and letters.
I want to receive the letter and number junks of an arbitrary string as an array of strings.
A good solution seems to be the use of regex, but I don't know how to express [numbers and letters]
// example
"abc" = {"abc"};
"ab .c" = {"ab", "c"}
"ab123,cd2, ,,%&$§56" = {"ab123", "cd2", "56"}
// try
string input = "jdahs32455$§&%$§df233§$fd";
string[] output = input.Split(Regex("makejunksfromstring"));

To extract chunks of 1 or more letters/digits you may use
[A-Za-z0-9]+ # ASCII only letters/digits
[\p{L}0-9]+ # Any Unicode letters and ASCII only digits
[\p{L}\p{N}]+ # Any Unicode letters/digits
See a regex demo.
C# usage:
string[] output = Regex.Matches(input, #"[\p{L}\p{N}]+").Cast<Match>().Select(x => x.Value).ToArray();

Yes, regex is indeed a good solution for this.
And in fact, to just match all standard words in the input sequence, this is all you need:
(\w+)
Let me quickly explain
\w matches any word character and is equivalent to [a-zA-Z0-9_] - matching a through z or A through Z or 0-9 or _, you might wanna go with [a-zA-Z0-9] instead to avoid that underscore.
Wrapping an expression in () means that you want to capture that part as a group.
The + means that you want sequences of 1 or more of the preceding characters.
Refer to a regular expression cheat sheet to see all the possibilities, such as
https://cheatography.com/davechild/cheat-sheets/regular-expressions/
Or any that you find online.
Also there are tools available to quickly test out your regular expressions, such as
https://regex101.com/ (quite well visualised matching)
or http://regexstorm.net/tester specifically for .NET

Related

Tamil language full-word search with .NET Regex

I have a Grid filled with Tamil words and a search string. I need to implement a full-word search through the Grid records. I'm using .NET Regex class for that approach. It sounds pretty simple, what I used to do is:
string pattern = #"\b" + searchText + #"\b".
It works as expected in Latin languages but for Tamil, this expression returns strange results. I have read about Unicode characters in regular expressions but that doesn't seem quite helpful to me. What I probably need is to determine where is the word boundary found and why.
As an example:
For the "\bஅம்மா\b" pattern Regex found matches in
அம்மாவிடம் and அம்மாக்கள் records but not in the original அம்மா record.
The last char in "அம்மா" word is ‎0BBE TAMIL VOWEL SIGN AA and it is a combining mark (in regex, it can be matched with \p{M}).
As \b only matches between start/end of string and a word char or between a word and a non-word char, it won't match after the char and a non-word char.
Use a usual workaround in this case.
var pattern = $#"(?<!\w){searchText}(?!\w)";
See this regex demo.
Here, (?<!\w) fails the match if there is a word char before searchText and (?!\w) fails the match if there is a word char after the text to find. Note you may also use Regex.Escape(searchText) if the text can contains special regex chars.
Or, if you want to avoid matching when inside base letters/diacritics, use
var pattern = $#"(?<![\p{{L}}\p{{M}}]){searchText}(?![\p{{L}}\p{{M}}])";
See this regex demo.
The (?<![\p{L}\p{M}]) and (?![\p{L}\p{M}]) lookarounds work similarly as the ones above, just they fails the match if there is a letter or a combining mark on either side of the search phrase.

Regex searching for string that contains 3 or more digits

I'm trying to find a way to extract a word from a string only if it contains 3 or more digits/numbers in that word. It would also need to return the entire text like
TX-23443 or FUX3329442 etc...
From what I found
\w*\d\w*
won't return the any letters before the dash like the first example?
All the example I found online don't seem to be working for me. Any help is appreciated!
IF I understand your question correctly you wanted to find all the string which contains 3+ consequtive numbers in it such as TX-23443 or FUX3329442 so you wanted to extract TX-23443 and FUX3329442 even if it contains - in between the string. So here is the solution which might help you
string InpStr = "TX-23443 or FUX3329442";
MatchCollection ms = Regex.Matches(InpStr, #"[A-Za-z-]*\d{3,}");
foreach(Match m in ms)
{
Console.WriteLine(m);
}
This one should do the trick assuming your "words" have only the standard latin word characters: A-Z, a-z, 0-9 and _.
Regex word_with_3_digits = new Regex(#"(?#!cs word_with_3_digits Rev:20161129_0600)
# Match word having at least three digits.
\b # Anchor to word boundary.
(?: # Loop to find three digits.
[A-Za-z_]* # Zero or more non-digit word chars.
\d # Match one digit at a time.
){3} # End loop to find three digits.
\w* # Match remainder of word.
\b # Anchor to word boundary.
", RegexOptions.IgnorePatternWhitespace);
In javascript I would write a regex like this:
\S*\d{3,}\S*
I've prepared an online test.
Try this:
string strToCount = "Asd343DSFg534434";
int count = Regex.Matches(strToCount,"[0-9]").Count;
This one seems to be working for me even if there is a dash at the end as well.
[-]\w[-]\d{3,}[-]\w*[-]\w

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Regex match if a string has length 2 and contains 1 letter and 1 number

Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex
I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5
Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.
How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.
Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.
I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.
It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);
Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.
I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

Categories