Align space and punctuation with one regex - c#

I have a couple of strings like:
test
test. hi
test, hello.(actual whitespace)
hello -this is a test
hello v2 , i am a " test" as well
I'd like to align punctuation following some set of rules like:
commas should have a trailing space but not a leading one
hyphens should have spaces on both sides if there is at least one space on any side of it in the original string
dots should have s trailing space aside from a case when they are on the end of the string
quotation marks (single and double) should not have spaces on opening/closing sides
etc etc (other rules will be added as needed, covering first 4 is enough for this case)
So the output will be like:
test
test. hi
test, hello.
hello - this is a test
hello v2, i am a "test" as well
My questions are: is it possible to do it in one go - with a single regex instead of creating a regex for each case, and if yes - what would be the regex that can do that? Is there a more efficient way of doing it than in a single regex (if it's possible), especially considering the fact that i'm already iterating through the whole string to remove some special unicode characters?

Using the MatchEvaluator delegate version of Regex.Replace, you can use a Regex to find problematic punctuation, and then use conditional logic to return the proper result. This doesn't handle rule 4 - it isn't easy to recognize open versus close quotes in a regular expression.
List<String> src;
var p = new Regex(#"\s*,\s*|\s+-\s*|-\s+|\s*\.\s+(?=.)", RegexOptions.Compiled);
var ans = src.Select(s => p.Replace(s, m => {
var mv = m.Value.Trim();
return mv == "," ? ", " : mv == "-" ? " - " : mv == "." ? ". " : mv;
})).ToList();

Related

How separate words by excluding square brackets in C# using regex

I have a sentence like "Hello there [Gesture : 2.5] How are you" and I have separate the words by avoiding the whole square brackets. For example "Hello there How are you".
I tried to separate the words before the colon but that's not what I want. This is the code I've tried.
MatchCollection matches2 = Regex.Matches(avatarVM.AvatarText, #"([\w\s\[\]]+)");
The above code only separate the words before ":" which also include the opening square bracket and the word after. I want to avoid the whole square brackets
Perhaps invert the problem and concentrate on what you want to remove rather than what you want to keep. For example, this will match the brackets and a space either side, and replace with a single space:
// Hello there How are you
var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #" \[.+\] ", " ");
If required, you could use a slightly more complicated version that can handle the square brackets not necessarily being surrounded by spaces, for example at the start or end of the input string:
var output = Regex.Replace(
"Hello there How are you [Gesture : 2.5]", // input string
#"[^\w]{0,1}\[.+\]([^\w]){0,1}", // our pattern to match the brackets and what's in between them
"$1"); // replace with the first capture group - in this case the character that comes after
And if you wanted to you could use the overload of Replace taking a MatchEvaluator delegate to have more control over how it is replaced in the string and with what depending on what your needs are.
Assuming you want the fragments before and after the brackets as separate entries in a collection:
Regex.Split(avatarVM.AvatarText, #"\[[^\]]+\]");
This will also work if there are multiple bracketed fragments in the string. You may want to .Trim() each entry.
var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #"\[.*?\] ", string.Empty);

C# Regex How to replace multiple quotation marks respectively in a string with « and »

I want to replace the quotation marks in some strings. Although, this have to be done respectively and using « ». It is not definite that it will begin and end with quotation marks.
For example I have this string:
"THIS IS "inner1" THE MAIN "inner2" SENTENCE"
I want to change it to:
«THIS IS «inner1» THE MAIN «inner2» SENTENCE»
SOLUTION:
With much help from musefan (code is a bit different than his original solution since it is not definite that the string will begin and end with quotation marks). It is not done by linking in some way the pairs of quotation marks but by replacing them if they follow or followed by a whitespace and then check and apply replacement, if necessary, to the first and last character of the string provided.
using System;
public class Test
{
public static void Main()
{
string input = "\"THIS IS \"inner1\" THE MAIN \"inner2\" SENTENCE\"";
string result=input;
//Replace quotes that follow space with « and replace quotes that precede space with »
result = result.Replace(" \"", " «").Replace("\" ", "» ");
//if first character is " then replace with «
if (result.Substring(0, 1) == "\"")
result = "«" + result.Substring(1);
//get last character of the string
char last = result[result.Length - 1];
//if it is " then replace it with »
if (last.ToString() == "\"")
result = result.Remove(result.Length - 1) + "»";
Console.WriteLine(result);
}
}
The main problem is: how do you know when a quote should be the start of a new set, or the end of an existing one? There are many possible use cases that might require differently handling.
So, I have made the assumption that you are going to use space characters to work out if the quote is the start of a new set, or if it is the end of an existing one. The reason for this assumption is that it is the most obvious logic to ensure you get the desired result.
With that in mind, it becomes very simple:
// First remove the out quotes, we will manually change them at the end.
string result = input.Substring(1, input.Length - 2);
// Replace quotes that follow space with « and replace quotes that precede space with »
result = result.Replace(" \"", " «").Replace("\" ", "» ");
// Add the outer chevrons around the result.
result = string.Format("«{0}»", result);
Here is a working example.
Disclaimer: Please keep in mind that this answer is provided based on the sample data you have given. There are many possible inputs where it may be required to re-think the rules/logic in order to achieve the desired result. However, I cannot cater for that without knowing those additional requirements.
Feel free to edit your question if you have more specific requirements and I will try to update my answer, however you may need to prompt me with a comment so I know you have changed your requirements.

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.
ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:
The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you
Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too
var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.
I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words
Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)
Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

Format String : Parsing

I have a parsing question. I have a paragraph which has instances of :  word  . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.
So when I have those instances I want to convert the string so I have
A new line character after : and the word.
Removed the double space after the word.
Replace all double spaces with new line characters.
Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.
Thanks
Assuming your original string is exactly in the form you described, this will do:
var newString = myString.Trim().Replace(" ", "\n");
The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.
Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.
The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.
I suggest you read up on the String class and all its methods and properties.
You can try
var str = ": first : second ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
":\n${word}\n");
Using RegularExpressions will give you exact matches on what you are looking for.
The regex match for a colon, two spaces, a word, then two more spaces is:
Dim reg as New Regex(": [a-zA-Z]* ")
[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.
[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.
Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.
source: http://msdn.microsoft.com/en-us/library/ms972966.aspx
You can retrieve all the matches using reg.Matches(inputString).
Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out
edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.
You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.
The following is an example using Regular Expressions. See also this question for more info.
Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.
The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.
string Paragraph = "Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.";
string Pattern = #": (?<word>\S*) ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
{
var LoneWord = m.Groups[1].Value;
return #":" + Environment.NewLine + LoneWord + Environment.NewLine;
},
RegexOptions.IgnoreCase);
Input
Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.
Output
Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.
Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:
Result = Result.Replace(" ", Environment.NewLine);

Categories