How separate words by excluding square brackets in C# using regex - c#

I have a sentence like "Hello there [Gesture : 2.5] How are you" and I have separate the words by avoiding the whole square brackets. For example "Hello there How are you".
I tried to separate the words before the colon but that's not what I want. This is the code I've tried.
MatchCollection matches2 = Regex.Matches(avatarVM.AvatarText, #"([\w\s\[\]]+)");
The above code only separate the words before ":" which also include the opening square bracket and the word after. I want to avoid the whole square brackets

Perhaps invert the problem and concentrate on what you want to remove rather than what you want to keep. For example, this will match the brackets and a space either side, and replace with a single space:
// Hello there How are you
var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #" \[.+\] ", " ");
If required, you could use a slightly more complicated version that can handle the square brackets not necessarily being surrounded by spaces, for example at the start or end of the input string:
var output = Regex.Replace(
"Hello there How are you [Gesture : 2.5]", // input string
#"[^\w]{0,1}\[.+\]([^\w]){0,1}", // our pattern to match the brackets and what's in between them
"$1"); // replace with the first capture group - in this case the character that comes after
And if you wanted to you could use the overload of Replace taking a MatchEvaluator delegate to have more control over how it is replaced in the string and with what depending on what your needs are.

Assuming you want the fragments before and after the brackets as separate entries in a collection:
Regex.Split(avatarVM.AvatarText, #"\[[^\]]+\]");
This will also work if there are multiple bracketed fragments in the string. You may want to .Trim() each entry.

var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #"\[.*?\] ", string.Empty);

Related

How to do the complicated replacement using regex in C#

I encountered a deeply nested curly braces string, like this:
{{{text1},{text2}},{{text3},{text4}}}
I just want to keep the inner most curly braces and replace another curly braces with square brakets, so the result will looks like this:
[[{text1},{text2}],[{text3},{text4}]]
how to do this replacement with Regex.Replace() function in C#?
thanks
This will take two replacement, first replace every { with [ which is followed by { and second replace every } with ] which is preceded by a non-word boundary \B. Try this C# code,
string input = "{{{text1},{text2}},{{text3},{text4}}}";
Regex regex = new Regex("{(?={)");
string result = regex.Replace(input, "[");
regex = new Regex("\\B}");
result = regex.Replace(result, "]");
Console.WriteLine("Result: " + result);
Prints,
Result: [[{text1},{text2}],[{text3},{text4}]]
Online C# demo
You can even use a positive look behind (?<=})} instead of \\B} for second replacement but I deliberately avoided it to keep the solution simple and to make it work even for languages that don't support look behinds but using (?<=})} will be strictly better than \\B}. Choose as you like.

Regex to find anything after ']' and before '['

I have a regex working to find anything between the square brackets in a text file, which is this:
Regex squareBrackets = new Regex(#"\[(.*?)\]");
And I want to create a regex that is basically the opposite way round to select whatever is after what's in the square brackets. So I thought just swap them round?
Regex textAfterTitles = new Regex(#"\](.*?)\[");
But this does not work and Regex's confuse me - can anyone help?
Cheers
You can use a lookbehind:
var textAfterTiles = new Regex(#"(?<=\[(.*?)\]).*");
You can combine it with a lookahead if you have multiple such bracketed groups, such as:
var textAfterTiles = "before [one] inside [two] after"
And you wanted to match " inside " and " after", you could do this:
new Regex(#"(?<=\[(.*?)\])[^\[]*");
The same \[(.*?)] regex (I'd just remove the redundant escaping backslash before ]), or even better regex is \[([^]]*)], can be used to split the text and get the text outside [...] (if used with RegexOptions.ExplicitCapture modifier):
var data = "A bracket is a tall punctuation mark[1] typically used in matched pairs within text,[2] to set apart or interject other text.";
Console.WriteLine(String.Join("\n", Regex.Split(data,#"\[([^]]*)]",RegexOptions.ExplicitCapture)));
Output of the C# demo:
A bracket is a tall punctuation mark
typically used in matched pairs within text,
to set apart or interject other text.
The RegexOptions.ExplicitCapture flag makes the capturing group inside the pattern non-capturing, and thus, the captured text is not output into the resulting split array.
If you do not have to keep the same regex, just remove the capture group, use \[[^]]*] for splitting.
You can try this one
\]([^\]]*)\[

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Format String : Parsing

I have a parsing question. I have a paragraph which has instances of :  word  . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.
So when I have those instances I want to convert the string so I have
A new line character after : and the word.
Removed the double space after the word.
Replace all double spaces with new line characters.
Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.
Thanks
Assuming your original string is exactly in the form you described, this will do:
var newString = myString.Trim().Replace(" ", "\n");
The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.
Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.
The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.
I suggest you read up on the String class and all its methods and properties.
You can try
var str = ": first : second ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
":\n${word}\n");
Using RegularExpressions will give you exact matches on what you are looking for.
The regex match for a colon, two spaces, a word, then two more spaces is:
Dim reg as New Regex(": [a-zA-Z]* ")
[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.
[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.
Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.
source: http://msdn.microsoft.com/en-us/library/ms972966.aspx
You can retrieve all the matches using reg.Matches(inputString).
Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out
edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.
You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.
The following is an example using Regular Expressions. See also this question for more info.
Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.
The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.
string Paragraph = "Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.";
string Pattern = #": (?<word>\S*) ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
{
var LoneWord = m.Groups[1].Value;
return #":" + Environment.NewLine + LoneWord + Environment.NewLine;
},
RegexOptions.IgnoreCase);
Input
Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.
Output
Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.
Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:
Result = Result.Replace(" ", Environment.NewLine);

Ensure a whitespace exists after any occurance of a specific char in a string

if anyone can help me with this one...
I have a string named text. I need to ensure that after every occurrence of a specific character " there is a whitespace. If there is not a whitespace, then I need to insert one.
I am not sure on the best approach to accomplish this in c#, I think regular expressions may be the way to go, but I do not have enough knowledge about regular expressions to do this...
If anyone can help it would be appreciated.
// rule: all 'a's must be followed by space.
// 'a's that are already followed by space must
// remain the same.
String text = "banana is a fruit";
text = Regex.Replace(text, #"a(?!\s)", x=>x + " ");
// at this point, text contains: ba na na is a fruit
The regular expression a(?!\s) searches for 'a' that is not followed by a whitespace. The lambda expression x=>x + " " tells the replace function to replace any occurence of 'a' without a following whitespace for 'a' with a whitespace
So, assuming you have your string:
string text = "12345,123523, 512,5,23, 18";
I assume here to put a space after each comma that has no whitespace after it. Adapt as you like.
You can use a regular expression replace:
Regex.Replace(text, ",(?!\s)", ", ");
This regular expression just searches for a comma that is not followed by any whitespace (space, tab, etc.) and replaces it by the same comma and a single space.
We can do a little better, still:
Regex.Replace(text, "(?<=,)(?!\s)", " ");
which leaves the comma intact and only replaces the empty space between a comma and a following non-whitespace character by a single space. Essentially it just inserts the new space, if needed.

Categories