Format String : Parsing

Format String : Parsing - c#

I have a parsing question. I have a paragraph which has instances of :  word  . So basically it has a colon, two spaces, a word (could be anything), then two more spaces.
So when I have those instances I want to convert the string so I have
A new line character after : and the word.
Removed the double space after the word.
Replace all double spaces with new line characters.
Don't know exactly how about to do this. I'm using C# to do this. Bullet point 2 above is what I'm having a hard time doing this.
Thanks

Assuming your original string is exactly in the form you described, this will do:
var newString = myString.Trim().Replace(" ", "\n");
The Trim() removes leading and trailing whitespaces, taking care of your spaces at the end of the string.
Then, the Replace replaces the remaining " " two space characters, with a "\n" new line character.
The result is assigned to the newString variable. This is needed, as myString will not change - as strings in .NET are immutable.
I suggest you read up on the String class and all its methods and properties.

You can try
var str = ": first : second ";
var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}",
":\n${word}\n");

Using RegularExpressions will give you exact matches on what you are looking for.
The regex match for a colon, two spaces, a word, then two more spaces is:
Dim reg as New Regex(": [a-zA-Z]* ")
[a-zA-Z] will look for any character within the alphabetical range. Can append 0-9 on as well if you accept numbers within the word. The * afterwards indicated that there can be 0 or more instances of the preceding value.
[a-zA-Z]* will attempt to do a full match of any set of contiguous alpha characters.
Upon further reading, you may use [\w] in place of [a-zA-Z0-9] if that's what you are looking for. This will match any 'word' character.
source: http://msdn.microsoft.com/en-us/library/ms972966.aspx
You can retrieve all the matches using reg.Matches(inputString).
Review http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx for more information on regular expression replacements and your options from there out
edit: Before I was using \s to search for spaces. This will match any whitespace character including tabs, new lines and other. That is not what we want, so I reverted it back to search for exact space characters.

You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.

The following is an example using Regular Expressions. See also this question for more info.
Basically the pattern string tells the regex to look for a colon followed by two spaces. Then we save in a capture group named "word" whatever the word is surrounded by two spaces on either side. Finally two more spaces are specified to finish the pattern.
The replace uses a lambda which says for every match, replace it with a colon, a new line, the "lone" word, and another newline.
string Paragraph = "Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.";
string Pattern = #": (?<word>\S*) ";
string Result = Regex.Replace(Paragraph, Pattern, m =>
{
var LoneWord = m.Groups[1].Value;
return #":" + Environment.NewLine + LoneWord + Environment.NewLine;
},
RegexOptions.IgnoreCase);
Input
Jackdaws love my big sphinx of quartz: fizz The quick onyx goblin jumps over the lazy dwarf. Where: buzz The crazy dogs.
Output
Jackdaws love my big sphinx of quartz:
fizz
The quick onyx goblin jumps over the lazy dwarf. Where:
buzz
The quick brown fox.
Note, for item 3 on your list, if you also want to replace individual occurrences of two spaces with newlines, you could do this:
Result = Result.Replace(" ", Environment.NewLine);

Related

How separate words by excluding square brackets in C# using regex

I have a sentence like "Hello there [Gesture : 2.5] How are you" and I have separate the words by avoiding the whole square brackets. For example "Hello there How are you".
I tried to separate the words before the colon but that's not what I want. This is the code I've tried.
MatchCollection matches2 = Regex.Matches(avatarVM.AvatarText, #"([\w\s\[\]]+)");
The above code only separate the words before ":" which also include the opening square bracket and the word after. I want to avoid the whole square brackets

Perhaps invert the problem and concentrate on what you want to remove rather than what you want to keep. For example, this will match the brackets and a space either side, and replace with a single space:
// Hello there How are you
var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #" \[.+\] ", " ");
If required, you could use a slightly more complicated version that can handle the square brackets not necessarily being surrounded by spaces, for example at the start or end of the input string:
var output = Regex.Replace(
"Hello there How are you [Gesture : 2.5]", // input string
#"[^\w]{0,1}\[.+\]([^\w]){0,1}", // our pattern to match the brackets and what's in between them
"$1"); // replace with the first capture group - in this case the character that comes after
And if you wanted to you could use the overload of Replace taking a MatchEvaluator delegate to have more control over how it is replaced in the string and with what depending on what your needs are.

Assuming you want the fragments before and after the brackets as separate entries in a collection:
Regex.Split(avatarVM.AvatarText, #"\[[^\]]+\]");
This will also work if there are multiple bracketed fragments in the string. You may want to .Trim() each entry.

var output = Regex.Replace("Hello there [Gesture : 2.5] How are you", #"\[.*?\] ", string.Empty);

How to correctly represent a whitespace character

I wanted to know how to represent a whitespace character in C#. I found the empty string representation string.Empty. Is there anything like that that represents a whitespace character?
I would like to do something like this:
test.ToLower().Split(string.Whitespace)
//test.ToLower().Split(Char.Whitespace)

Which whitespace character? The empty string is pretty unambiguous - it's a sequence of 0 characters. However, " ", "\t" and "\n" are all strings containing a single character which is characterized as whitespace.
If you just mean a space, use a space. If you mean some other whitespace character, there may well be a custom escape sequence for it (e.g. "\t" for tab) or you can use a Unicode escape sequence ("\uxxxx"). I would discourage you from including non-ASCII characters in your source code, particularly whitespace ones.
EDIT: Now that you've explained what you want to do (which should have been in your question to start with) you'd be better off using Regex.Split with a regular expression of \s which represents whitespace:
Regex regex = new Regex(#"\s");
string[] bits = regex.Split(text.ToLower());
See the Regex Character Classes documentation for more information on other character classes.

No, there isn't such constant.

The WhiteSpace CHAR can be referenced using ASCII Codes here.
And Character# 32 represents a white space, Therefore:
char space = (char)32;
For example, you can use this approach to produce desired number of white spaces anywhere you want:
int _length = {desired number of white spaces}
string.Empty.PadRight(_length, (char)32));

So I had the same problem so what I did was create a string with a white space and just index the character.
String string = "Hello Morning Good Night";
char empty = string.charAt(5);
Now whenever I need a empty character I will pull it from my reference in memory.

Which whitespace character? The most common is the normal space, which is between each word in my sentences. This is just " ".

Using regular expressions, you can represent any whitespace character with the metacharacter "\s"
MSDN Reference

You can always use Unicode character, for me personally this is the most clear solution:
var space = "\u0020"

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you

Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too

var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();

I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.

I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words

Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)

Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

What will be the regular expression for parsing all words except last one?

I want to match the whole string except the last word. i.e. for
This is my house
the matched string should be This is my. What will be the regular expression for this?

This should do it:
^([\w ]*) [\w]+$
^ is start of line
([\w ]*) is your group of any number of letters and space
\w+ is a space followed by one or more word characters
$ is end of line.

You really don't need regexp for this task, delete everything from the last whitespace to the end of the string and you'll have what you need.

Personally, I'd go with something less opaque for such a simple task:
var words = Regex.Split("this is my house",#"\s");
var allButLastWord = string.Join(" ",words.Take(words.Length-1));

If it ends with a whitespace (as per example), you can define it as:
^(.*)\s
This will remove the whitespace at the end which I believe is desirable effect.

You could just use Split and do something like this -
var text = "This is my house";
var arr = text.Split(' ');
var newtext = String.Join(" ",arr.Take(arr.Length-1));

string GetAllWordsExceptLast (string original)
{
original = original.Trim();
return original.Substring(0, original.LastIndexOf(' '));
}
Unless you're really determined to use Regular Expressions. Just seems a little overkill for such a simple operation.

A robust solution trimming leading spaces:
(?!\s)(.+)(?=\s)
See it at work on regex101.com
Example: "5 hours", " 5 hours", "twenty five minutes", " twenty five minutes"
Explanation
Negative Lookahead (?!\s)
Assert that the Regex below does not match
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Positive Lookahead (?=\s)
Assert that the Regex below matches
\s matches any whitespace character (equal to [\r\n\t\f\v ])

Ensure a whitespace exists after any occurance of a specific char in a string

if anyone can help me with this one...
I have a string named text. I need to ensure that after every occurrence of a specific character " there is a whitespace. If there is not a whitespace, then I need to insert one.
I am not sure on the best approach to accomplish this in c#, I think regular expressions may be the way to go, but I do not have enough knowledge about regular expressions to do this...
If anyone can help it would be appreciated.

// rule: all 'a's must be followed by space.
// 'a's that are already followed by space must
// remain the same.
String text = "banana is a fruit";
text = Regex.Replace(text, #"a(?!\s)", x=>x + " ");
// at this point, text contains: ba na na is a fruit
The regular expression a(?!\s) searches for 'a' that is not followed by a whitespace. The lambda expression x=>x + " " tells the replace function to replace any occurence of 'a' without a following whitespace for 'a' with a whitespace

So, assuming you have your string:
string text = "12345,123523, 512,5,23, 18";
I assume here to put a space after each comma that has no whitespace after it. Adapt as you like.
You can use a regular expression replace:
Regex.Replace(text, ",(?!\s)", ", ");
This regular expression just searches for a comma that is not followed by any whitespace (space, tab, etc.) and replaces it by the same comma and a single space.
We can do a little better, still:
Regex.Replace(text, "(?<=,)(?!\s)", " ");
which leaves the comma intact and only replaces the empty space between a comma and a following non-whitespace character by a single space. Essentially it just inserts the new space, if needed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Format String : Parsing - c#

You can try var str = ": first : second "; var result = Regex.Replace(str, ":\\s{2}(?<word>[a-zA-Z0-9]+)\\s{2}", ":\n${word}\n");

You can use string.TrimEnd - http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx - to trim spaces at the end of the string.

Related

How separate words by excluding square brackets in C# using regex

How to correctly represent a whitespace character

Regex - Get all words that are not wrapped with a "/"

What will be the regular expression for parsing all words except last one?

Ensure a whitespace exists after any occurance of a specific char in a string

Categories

Resources