Tidy up a string - c#

I'm looking for the best solution, performance wise, to rebuild a string by removing words that are not complete words. An acceptable word in this instance is a whole word without numbers or doesn't start with a forward slash, or a back slash. So just letters only, but can include hyphen and apostrophe's
For example:
String str ="\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/"
Using the above I'd need a new string that returns the following:
Str = "this is a word, frank's place"
I've done some research on Regex, but I can't find anything that would do what I need.
Final Code Snippet
var resultSet = Regex.Matches(item.ToLower(), #"(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)")
.Cast<Match>()
.Select(m => m.Value).ToArray();
Thanks for all your input guys - proves what a great site this is

Description
Based on your comments: A word in this instance is:
a whole word without numbers
doesn't start with a forward slash, or a back slash
just letters only
can include hyphen and apostrophes
The character class to cover all the word characters by your definition would be [a-z'-]+ and that group could be surrounded by whitespace, or the start/end of a string. You sample also shows a comma so I'm presuming a word can be followed by a comma or dot either of which are followed by white space is ok too.
This regex will:
collect all substings defined as words [a-z'-]+
allow a comma or dot after a word, but not inside or at the start of a word
rejects substrings from containing all hyphens
rejects substrings from containing all apostrophes
prevents words from having 3 or more hyphens
prevents words from having 2 or more apostrophes
(?:^|\s)(?![\\\/])(?!-+(?:\s|$))(?!'+(?:\s|$))(?!(?:[a-z'-]*?-){3,})(?!(?:[a-z'-]*?'){2,})[a-z'-]+[,.]?(?=\s|$)
Expanded explanation
(?:^|\s) match the start of the string or a white space. This eliminates the need to test for word boundary which is problematic for strings like "abdc-egfh"
(?![\\\/]) prevent the word from starting with a \ or /, however this is over kill as the character class doesn't allow it either
(?!-+(?:\s|$)) prevent strings which are all hyphens
(?!'+(?:\s|$)) prevent strings which are all apostrophes
(?!(?:[a-z'-]*?-){3,}) prevent strings which have 3 or more hyphens
(?!(?:[a-z'-]*?'){2,}) prevent strings which have 2 or more apostrophes
[a-z'-]+[,.]?(?=\s|$) match the word followed by some optional punctuation, and ensure this is followed by either a space or the end of a string
Examples
I'm not a C# programmer, but a returned array of matches from a code block like the one covered in question Return a array/list using regex and this regular expression will probably work for you. Note this expression does assume you'll use the case insensitive option.
Sample Text
\DR1234 - this is a word, 123456, frank's place DA123 SW1 :50:/ one-hyphen two-hyphens-here I-have-three-hyphens
Matches
[0] => this
[1] => is
[2] => a
[3] => word,
[4] => frank's
[5] => place
[6] => one-hyphen
[7] => two-hyphens-here

the regex: \b\w+\b will match words or if you're more picky, than \b[a-zA-Z]+\b won't include numbers or _s
http://rubular.com/r/uOVvPTb5nh
It looks like you want to allow 's and ,s, so the regex: \b[a-zA-Z,']+\b will do an okay job at that, but it will also let slip through any number of things that you might not want(such as
,','hello''',World
or, in c#,
string str =#"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Regex r = new Regex(#"\b[a-zA-Z,']+\b");
string newStr = string.Join(" ", r.Matches(str).Cast<Match>().Select(m => m.Value).ToArray());

Regex.Match("[a-z\s,']+") is what you're looking for. So here is the code example:
string pattern = "[a-z\s,']+";
string input = #"\DR1234 this is a word, 123456, frank's place DA123 SW1 :50:/";
Match match = Regex.Match(input, pattern);
while (match.Success){
Console.WriteLine(match.Value);
match = match.NextMatch();
}

Related

Split String At Every Non-Letter/Non-Number Character

Imagine a string that contains special characters like $§%%,., numbers and letters.
I want to receive the letter and number junks of an arbitrary string as an array of strings.
A good solution seems to be the use of regex, but I don't know how to express [numbers and letters]
// example
"abc" = {"abc"};
"ab .c" = {"ab", "c"}
"ab123,cd2, ,,%&$§56" = {"ab123", "cd2", "56"}
// try
string input = "jdahs32455$§&%$§df233§$fd";
string[] output = input.Split(Regex("makejunksfromstring"));
To extract chunks of 1 or more letters/digits you may use
[A-Za-z0-9]+ # ASCII only letters/digits
[\p{L}0-9]+ # Any Unicode letters and ASCII only digits
[\p{L}\p{N}]+ # Any Unicode letters/digits
See a regex demo.
C# usage:
string[] output = Regex.Matches(input, #"[\p{L}\p{N}]+").Cast<Match>().Select(x => x.Value).ToArray();
Yes, regex is indeed a good solution for this.
And in fact, to just match all standard words in the input sequence, this is all you need:
(\w+)
Let me quickly explain
\w matches any word character and is equivalent to [a-zA-Z0-9_] - matching a through z or A through Z or 0-9 or _, you might wanna go with [a-zA-Z0-9] instead to avoid that underscore.
Wrapping an expression in () means that you want to capture that part as a group.
The + means that you want sequences of 1 or more of the preceding characters.
Refer to a regular expression cheat sheet to see all the possibilities, such as
https://cheatography.com/davechild/cheat-sheets/regular-expressions/
Or any that you find online.
Also there are tools available to quickly test out your regular expressions, such as
https://regex101.com/ (quite well visualised matching)
or http://regexstorm.net/tester specifically for .NET

C# Regular expression to squeeze word where every character is separated by a space

I'm trying to write a regular expression to transform words written like "H e l l o Everyone" to "Hello Everyone".
If it is words separated by spaces like "Hello everyone, how are you?", nothing should happen.
Basically all single characters should be squeezed to a make a word and we can consider if it is more than 2 characters only are following this pattern.
If it is like "a b cdef" - Nothing should happen
But "a b c def" -> "abc def"
I tried something like this "^\w(?:(\s)\w)*$" but it is matching with "Hello world" as well.
And also, I'm not sure on how to squeeze these single characters.
Any help is greatly appreciated.
Thanks!
I suggest to match chunks of single word chars separated with single whitespaces and then removing the spaces inside within a match evaluator.
The regex is
(?<!\S)\w(?:\s\w){2,}(?!\S)
See its demo at RegexStorm. The (?<!\S) and (?!\S) make sure these chunks are enclosed with whitespaces (or are at string start/end).
Details:
(?<!\S) - a negative lookbehind making sure there is a whitespace or start of string immediately before the current location
\w - a word char (letter/digit/underscore, to match a letter, use \p{L} instead)
(?:\s\w){2,} - 2 or more sequences of:
\s - a whitespace
\w - a word char
(?!\S) - a negative lookahead making sure there is a whitespace or start of string immediately after the current location
See the C# demo:
var res = Regex.Replace(s, #"(?<!\S)\w(?:\s\w){2,}(?!\S)", m =>
new string(m.Value
.Where(c => !Char.IsWhiteSpace(c))
.ToArray()));
If you're looking for a pure regex solution,
Regex.Replace(s, #"(?<=^\w|(\s\w)+)\s(?=(\w\s)+|\w$)", string.Empty);
replaces a space with at least one space and letter pair on each side with nothing (with a little extra to handle start/end of the string).

I want only matching string using regex

I have a string "myname 18-may 1234" and I want only "myname" from whole string using a regex.
I tried using the \b(^[a-zA-Z]*)\b regex and that gave me "myname" as a result.
But when the string changes to "1234 myname 18-may" the regex does not return "myname". Please suggest the correct way to select only "myname" whole word.
Is it also possible - given the string in
"1234 myname 18-may" format - to get myname only, not may?
UPDATE
Judging by your feedback to your other question you might need
(?<!\p{L})\p{L}+(?!\p{L})
ORIGINAL ANSWER
I have come up with a lighter regex that relies on the specific nature of your data (just a couple of words in the string, only one is whole word):
\b(?<!-)\p{L}+\b
See demo
Or even a more restrictive regex that finds a match only between (white)spaces and string start/end:
(?<=^|\s)\p{L}+(?=\s|$)
The following regex is context-dependent:
\p{L}+(?=\s+\d{1,2}-\p{L}{3}\b)
See demo
This will match only the word myname.
The regex means:
\p{L}+ - Match 1 or more Unicode letters...
(?=\s+\d{1,2}-\p{L}{3}\b) - until it finds 1 or more whitespaces (\s+) followed with 1 or 2 digits, followed with a hyphen and 3 Unicode letters (\p{L}{3}) which is a whole word (\b). This construction is a positive look-ahead that only checks if something can be found after the current position in the string, but it does not "consume" text.
Since the date may come before the string, you can add an alternation:
\p{L}+(?=[ ]+\d{1,2}-\p{L}{3}\b)|(?<=\d{1,2}-\p{L}{3}[ ]+)\p{L}+
See another demo
The (?<=\d{1,2}-\p{L}{3}\s+) is a look-behind that checks for the same thing (almost) as the look-ahead, but before the myname.
here is a solution without RegEx
string input = "myname 18-may 1234";
string result = input.Split(' ').Where(x => x.All(y => char.IsLetter(y))).FirstOrDefault();
Do a replace using this regex:
(\s*\d+\-.{3}\s*|\s*.{3}\-\d+\s*)|(\s*\d+\s*)
you will end up with just your name.
Demo

Regex for catching word with special characters between letters

I am new to regex, I'm programming an advanced profanity filter for a commenting feature (in C#). Just to save time, I know that all filters can be fooled, no matter how good they are, you don't have to tell me that. I'm just trying to make it a bit more advanced than basic word replacement. I've split the task into several separate approaches and this is one of them.
What I need is a specific piece of regex, that catches strings such as these:
s_h_i_t
s h i t
S<>H<>I<>T
s_/h_/i_/t
s***h***i***t
you get the idea.
I guess what I'm looking for is a regex that says "one or more characters that are not alphanumeric". This should include both spaces and all special characters that you can type on a standard (western) keyboard. If possible, it should also include line breaks, so it would catch things like
s
h
i
t
There should always be at least one of the characters present, to avoid likely false positives such as in
Finish it.
This will of course mean that things like
sh_it
will not be caught, but as I said, it doesn't matter, it doesn't have to be perfect. All I need is the regex, I can do the splitting of words and inserting the regex myself. I have the RegexOptions.IgnoreCase option set in my C# code, so character case in the actual word is not an issue. Also, this regex shouldn't worry about "leetspeek", i.e. some of the actual letters of the word being replaced by other characters:
sh1t
I have a different approach that deals with that.
Thank you in advance for your help.
Lets see if this regex works for you:
/\w(?:_|\W)+/
Alright, HamZa's answer worked. However I ran into a programmatic problem while working on the solution. When I was replacing just the words, I always knew the length of the word. So I knew exactly how many asterisks to replace it with. If I'm matching shit, I know I need to put 4 asterisks. But if I'm matching s[^a-z0-9]+h[^a-z0-9]+[^a-z0-9]+i[^a-z0-9]+t, I might catch s#h#i#t or I may catch s------h------i--------t. In both cases the length of the matched text will differ wildly from that of the pattern. How can I get the actual length of the matched string?
\bs[\W_]*h[\W_]*i[\W_]*t[\W_]*(?!\w)
matches characters between letters that aren't word characters or character _ or whitespace characters (also new line breaks)
\b (word boundrary) ensures that Finish it won't match
(?!\w) ensures that sh ituuu wont match, you may want to remove/modify that, as s_hittt will not match as well. \bs[\W_]*h[\W_]*i[\W_]*t+[\W_]*(?!\w) will match the word with repeated last character
modification \bs[\W_]*h[\W_]*i[\W_]*t[\W_]*?(?!\w) will make the match of last character class not greedy and in sh it&&& only sh it will match
\bs[\W\d_]*h[\W\d_]*i[\W\d_]*t+[\W\d_]*?(?!\w) will match sh1i444t (digits between characters)
EDIT:
(?!\w) is a negative lookahead. It basicly checks if your match is followed by a word character (word characters are [A-z09_]). It has a length of 0, which means it won't be included in the match. If you want to catch words like "shi*tface" you'll have to remove it.
( http://www.regular-expressions.info/lookaround.html )
A word booundrary [/b] matches a place where word starts or ends, it's length is 0, which means that it matches between characters
[\W] is a negative character class, I think it's equal to [^a-zA-Z0-9_] or [^\w]
You want to match words where each letter is separated with the identical non-word char(s).
You can use
\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b
See the regex demo. (I added (?!\n) to make the regex work for each line as if it were a separate string.) Details:
\b - word boundary
\p{L} - a letter
(?=([\W_]+)) - a positive lookahead that matches a location that is immediately followed with any non-word or _ char (captured into Group 1)
(?:\1\p{L})+ - one or more repetitions of a sequence of the same char captured into Group 1 and a letter
\b - word boundary.
To check if there is such a pattern in a string, you can use
var HasSpamWords = Regex.IsMatch(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b");
To return all occurrences in a string, you can use
var results = Regex.Matches(text, #"\b\p{L}(?=([\W_]+))(?:\1\p{L})+\b")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the C# demo.
Getting the length of each string is easy if you get Match.Length and use .Select(x => x.Length). If you need to get the length of the string with all special chars removed, simply use .Select(x => x.Value.Count(c => char.IsLetter(c))) (see this C# demo).

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Categories