C# how to separate a string by numbering (1. 2. ...) - c#

I have this string "1: This 2: Is 3: A 4: Test" and would like to split it based on the numbering, like this:
"1: This"
"2: Is"
"3: A"
"4: Test"
I think this should be possible with a regular expression, but unfortunately I don't understand much about it.
This: string[] result = Regex.Split(input, #"\D+"); just splits the numbers without the colon and the content behind it.

You can use
string[] result = Regex.Split(text, #"(?!^)(?=(?<!\d)\d+:)")
See this regex demo. Note that the (?<!\d) negative lookbehind is necessary when you have bullet point with two or more digits. Details:
(?!^) - not at the start of string
(?=(?<!\d)\d+:) - the position that is immediately followed with one or more digits (not preceded with any digit) and a : char.

If you use a capture group () like this:
string[] result = Regex.Split(str, #"(\d+:)");
the captured values will be added to the array too. Then all that is left to do is to merge every first value with every second value (we skip index 0 as it is empty):
List<string> values = new();
for (int i = 1; i < result.Length; i += 2)
{
values.Add(result[i] + result[i + 1]);
}
There are probably cleaner ways to do this, but this works.

Using \D+ matches 1 or more non digits, and will therefore match : This to split on.
Instead of using split, you can also match the parts:
\b[0-9]+:.*?(?=\b[0-9]+:|$)
The pattern matches:
\b A word boundary to prevent a partial word match
[0-9]+: Match 1+ digits and :
.*? Match as least as possible characters
(?=\b[0-9]+:|$) Positive lookahead, assert either 1+ digits and : or the end of the string to the right
.NET regex demo
Example in C#:
string str = "1: This 2: Is 3: A 4: Test";
string pattern = #"\b[0-9]+:.*?(?=\b[0-9]+:|$)";
MatchCollection matchList = Regex.Matches(str, pattern);
string[] result = matchList.Cast<Match>().Select(match => match.Value).ToArray();
Array.ForEach(result, Console.WriteLine);
Output
1: This
2: Is
3: A
4: Test

Split by space then take each second item. Because if you define the word as something delimited by (white)space, '1.' or '2.' are words too, and you aren't able to distinguish them.
string[] split = content.Split(' ', StringSplitOptions.None);
string[] result = new string[split.Length / 2];
for (int i = 1; i < split.Length; i = i + 2) result[i / 2] = split[i];

Related

Replacing mutiple occurrences of string using string builder by regex pattern matching

We are trying to replace all matching patterns (regex) in a string builder with their respective "groups".
Firstly, we are trying to find the count of all occurrences of that pattern and loop through them (count - termination condition). For each match we are assigning the match object and replace them using their respective groups.
Here only the first occurrence is replaced and the other matches are never replaced.
*str* - contains the actual string
Regex - ('.*')\s*=\s*(.*)
To match pattern:
'nam_cd'=isnull(rtrim(x.nam_cd),''),
'Company'=isnull(rtrim(a.co_name),'')
Pattern : created using https://regex101.com/
*matches.Count* - gives the correct count (here 2)
String pattern = #"('.*')\s*=\s*(.*)";
MatchCollection matches = Regex.Matches(str, pattern);
StringBuilder sb = new StringBuilder(str);
Match match = Regex.Match(str, pattern);
for (int i = 0; i < matches.Count; i++)
{
String First = String.Empty;
Console.WriteLine(match.Groups[0].Value);
Console.WriteLine(match.Groups[1].Value);
First = match.Groups[2].Value.TrimEnd('\r');
First = First.Trim();
First = First.TrimEnd(',');
Console.WriteLine(First);
sb.Replace(match.Groups[0].Value, First + " as " + match.Groups[1].Value) + " ,", match.Index, match.Groups[0].Value.Length);
match = match.NextMatch();
}
Current output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
'cust_clnt_id' = isnull(rtrim(x.cust_clnt_id),'')
Expected output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
isnull(rtrim(x.cust_clnt_id),'') as 'cust_clnt_id'
A regex solution like this is too fragile. If you need to parse any arbitrary SQL, you need a dedicated parser. There are examples on how to parse SQL properly in Parsing SQL code in C#.
If you are sure there are no "wild", unbalaned ( and ) in your input, you may use a regex as a workaround, for a one-off job:
var result = Regex.Replace(s, #"('[^']+')\s*=\s*(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\))", "\n $2 as $1");
See the regex demo.
Details
('[^']+') - Capturing group 1 ($1): ', 1 or more chars other than ' and then '
\s*=\s* - = enclosed with 0+ whitespaces
(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\)) - Capturing group 2 ($2):
\w+ - 1+ word chars
\((?>[^()]+|(?<o>\()|(?<-o>\)))*\) - a (...) substring with any amount of balanced (...)s inside (see my explanation of this pattern).

RegEx string between N and (N+1)th Occurance

I am attempting to find nth occurrence of sub string between two special characters. For example.
one|two|three|four|five
Say, I am looking to find string between (n and n+1 th) 2nd and 3rd Occurrence of '|' character, which turns out to be 'three'.I want to do it using RegEx. Could someone guide me ?
My Current Attempt is as follows.
string subtext = "zero|one|two|three|four";
Regex r = new Regex(#"(?:([^|]*)|){3}");
var m = r.Match(subtext).Value;
If you have full access to C# code, you should consider a mere splitting approach:
var idx = 2; // Might be user-defined
var subtext = "zero|one|two|three|four";
var result = subtext.Split('|').ElementAtOrDefault(idx);
Console.WriteLine(result);
// => two
A regex can be used if you have no access to code (if you use some tool that is powered with .NET regex):
^(?:[^|]*\|){2}([^|]*)
See the regex demo. It matches
^ - start of string
(?:[^|]*\|){2} - 2 (or adjust it as you need) or more sequences of:
[^|]* - zero or more chars other than |
\| - a | symbol
([^|]*) - Group 1 (access via .Groups[1]): zero or more chars other than |
C# code to test:
var pat = $#"^(?:[^|]*\|){{{idx}}}([^|]*)";
var m = Regex.Match(subtext, pat);
if (m.Success) {
Console.WriteLine(m.Groups[1].Value);
}
// => two
See the C# demo
If a tool does not let you access captured groups, turn the initial part into a non-consuming lookbehind pattern:
(?<=^(?:[^|]*\|){2})[^|]*
^^^^^^^^^^^^^^^^^^^^
See this regex demo. The (?<=...) positive lookbehind only checks for a pattern presence immediately to the left of the current location, and if the pattern is not matched, the match will fail.
Use this:
(?:.*?\|){n}(.[^|]*)
where n is the number of times you need to skip your special character. The first capturing group will contain the result.
Demo for n = 2
Use this regex and then select the n-th match (in this case 2) from the Matches collection:
string subtext = "zero|one|two|three|four";
Regex r = new Regex("(?<=\|)[^\|]*");
var m = r.Matches(subtext)[2];

search string for everything before a set of characters in C#

I'm looking for a way to search a string for everything before a set of characters in C#. For Example, if this is my string value:
This is is a test.... 12345
I want build a new string with all of the characters before "12345".
So my new string would equal "This is is a test.... "
Is there a way to do this?
I've found Regex examples where you can focus on one character but not a sequence of characters.
You don't need to use a Regex:
public string GetBitBefore(string text, string end)
{
var index = text.IndexOf(end);
if (index == -1) return text;
return text.Substring(0, index);
}
You can use a lazy quantifier to match anything, followed by a lookahead:
var match = Regex.Match("This is is a test.... 12345", #".*?(?=\d{5})");
where:
.*? lazily matches everything (up to the lookahead)
(?=…) is a positive lookahead: the pattern must be matched, but is not included in the result
\d{5} matches exactly five digits. I'm assuming this is your lookahead; you can replace it
You can do so with help of regex lookahead.
.*(?=12345)
Example:
var data = "This is is a test.... 12345";
var rxStr = ".*(?=12345)";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var match = rx.Match(data);
if (match.Success) {
Console.WriteLine (match.Value);
}
Above code snippet will print every thing upto 12345:
This is is a test....
For more detail about see regex positive lookahead
This should get you started:
var reg = new Regex("^(.+)12345$");
var match = reg.Match("This is is a test.... 12345");
var group = match.Groups[1]; // This is is a test....
Of course you'd want to do some additional validation, but this is the basic idea.
^ means start of string
$ means end of string
The asterisk tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more
{min,max} indicate the minimum/maximum number of matches.
\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks).
[^a] means not so exclude a
The dot matches a single character, except line break characters
In your case there many way to accomplish the task.
Eg excluding digit: ^[^\d]*
If you know the set of characters and they are not only digit, don't use regex but IndexOf(). If you know the separator between first and second part as "..." you can use Split()
Take a look at this snippet:
class Program
{
static void Main(string[] args)
{
string input = "This is is a test.... 12345";
// Here we call Regex.Match.
MatchCollection matches = Regex.Matches(input, #"(?<MySentence>(\w+\s*)*)(?<MyNumberPart>\d*)");
foreach (Match item in matches)
{
Console.WriteLine(item.Groups["MySentence"]);
Console.WriteLine("******");
Console.WriteLine(item.Groups["MyNumberPart"]);
}
Console.ReadKey();
}
}
You could just split, not as optimal as the indexOf solution
string value = "oiasjdoiasj12345";
string end = "12345";
string result = value.Split(new string[] { end }, StringSplitOptions.None)[0] //Take first part of the result, not the quickest but fairly simple

Retrieve Alphabet with white space

I would like to retrieve the alphabet only but the code is not enough to make it.
What am I missing?
[A-Öa-ö]+$
16440 dallas
23941 cityO < You also have white space after "O"
931 00 Texas
10581 New Orleans
It's because you specify a sequence from the ASCII character table. And åäö is not directly after Z in the ascii table.
You can see it here: http://www.asciitable.com/
So what you need is a regex that specifies those separately:
[A-Za-zåäöÅÄÖ]+$
So the complete regex is:
var re = new Regex("([A-Za-zåäöÅÄÖ]+)$", RegexOptions.Multiline);
var matches = re.Matches(data);
Console.WriteLine(matches[0].Groups[1].Value);
However, since you want to allow white spaces within the name (as for "New Orleans") you need to allow it, simply include it in the regex:
var re = new Regex("([A-Za-zåäöÅÄÖ ]+)$", RegexOptions.Multiline);
Unfortunately that also includes white spaces in the beginning and the end:
" New Orleans "
To fix that you start by specifying the regex as greedy, i.e. tell it to use less characters:
new Regex("([A-Za-zåäöÅÄÖ ]+?)$", RegexOptions.Multiline)
The problem with that is that it do not take other lines than New orleans. Don't ask me why. To fix that I told the regex that it must have a space between the digits and the text and that there may be a space after the text:
var re = new Regex("\\s([A-Za-zåäöÅÄÖ ]+?)[\\s]*$", RegexOptions.Multiline);
which works with all lines.
Regex breakdown:
\\s A single whitespace (which should not be included in the match since it's not in the parenthesis expression)
([A-Za-zåäöÅÄÖ ]+?)
Find a character which either is in the alphabet or space
+ there must be one or more
? use greedy search.
[\\s]*
[\\s] Find a white space character
* There must be zero or more if it
Alternative
As an alternative to regex you can do something like this:
public IEnumerable<string> GetCodes(string data)
{
var lines = data.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
foreach (var line in lines)
{
for (var i = 0; i < line.Length; i++)
{
if (!char.IsLetter(line[i]))
continue;
var text = line.Substring(i).TrimEnd(' ');
yield return text;
break;
}
}
}
Which is invoked like:
var codes = GetCodes(yourData).ToList();
In C#, you can use \p{L} Unicode category class to match all Unicode characters. You may match zero or more whitespace characters with \s*. End of string is $ (or \Z or \z). The word you need can be captured and this capture can easily be retrieved from the match result via GroupCollection.
Thus, you can use
(\p{L}+)\s*$
or - if you plan to match specific Finnish, etc. letters:
(?i)([A-ZÅÄÖ]+)\s*$
See the regex demo
C# demo:
var strs = new string[] {"16440 dallas", "23941 cityO ", "931 00 Texas", "10581 New Orleans"};
foreach (var s in strs) {
var match = Regex.Match(s, #"(\p{L}+)\s*$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
}
}

Removing words with special characters in them

I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So:
"I saw a dog. It was black!"
should result in
{saw,a,dog,was,black}
So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
Example (ideone):
var re = #"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar
How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
Word can only be preceded by start of line or some number of whitespace characters
Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
Word can only contain lower case (latin) letters and dashes
The named group containing each word is "word"
Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}
This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = #"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");
You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, #"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example
You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

Categories