Regex Word splitting in C#

Regex Word splitting in C# - c#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)

Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.

var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");

Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.

The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'´_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

Related

C# Regex, extract strings by reference string

Edit: Solution by #Heinzi
https://stackoverflow.com/a/1731641/87698
I got two strings, for example someText-test-stringSomeMoreText? and some kind of pattern string like this one {0}test-string{1}?.
I'm trying to extract the substrings from the first string that match the position of the placeholders in the second string.
The resulting substrings should be: someText- and SomeMoreText.
I tried to extract with Regex.Split("someText-test-stringSomeMoreText?", "[.]*test-string[.]*\?". However this doesn't work.
I hope somebody has another idea...

One option you have is to use named groups:
(?<prefix>.*)test-string(?<suffix>.*)\?
This will return 2 groups containing the wanted prefix and the suffix.
var match = Regex.Match("someText-test-stringSomeMoreText?",
#"(?<prefix>.*)test-string(?<suffix>.*)\?");
Console.WriteLine(match.Groups["prefix"]);
Console.WriteLine(match.Groups["suffix"]);

I got a solution, at least its a bit dynamical.
First I split up the pattern string {0}test-string{1}? with
string[] patternElements = Regex.Split(inputPattern, #"(\\\{[a-zA-Z0-9]*\})");
Then I spit up the input string someText-test-stringSomeMoreText? with
string[] inputElements = inputString.Split(patternElements, StringSplitOptions.RemoveEmptyEntries);
Now the inputElements are the text pieces corresponding to the placeholders {0},{1},...

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.

Incredibly complex regex incoming:
logFile.Split(' ');

This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');

You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*

The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")

Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Regex - Get all words that are not wrapped with a "/"

Im really trying to learn regex so here it goes.
I would really like to get all words in a string which do not have a "/" on either side.
For example, I need to do this to:
"Hello Great /World/"
I need to have the results:
"Hello"
"Great"
is this possible in regex, if so, how do I do it? I think i would like the results to be stored in a string array :)
Thank you

Just use this regular expression \b(?<!/)\w+(?!/)\b:
var str = "Hello Great /World/ /I/ am great too";
var words = Regex.Matches(str, #"\b(?<!/)\w+(?!/)\b")
.Cast<Match>()
.Select(m=>m.Value)
.ToArray();
This will get you:
Hello
Great
am
great
too

var newstr = Regex.Replace("Hello Great /World/", #"/(\w+?)/", "");
If you realy want an array of strings
var words = Regex.Matches(newstr, #"\w+")
.Cast<Match>()
.Select(m => m.Value)
.ToArray();

I would first split the string into the array, then filter out matching words. This solution might also be cleaner than a big regexp, because you can spot the requirements for "word" and the filter better.
The big regexp solution would be something like word boundary - not a slash - many no-whitespaces - not a slash - word boundary.

I would use a regex replace to replace all /[a-zA-Z]/ with '' (nothing) then get all words

Try this one : (Click here for a demo)
(\s(?<!/)([A-Za-z]+)(?!/))|((?<!/)([A-Za-z]+)(?!/)\s)

Using this example excerpt:
The /character/ "_" (underscore/under-strike) can be /used/ in /variable/ names /in/ many /programming/ /languages/, while the /character/ "/" (slash/stroke/solidus) is typically not allowed.
...this expression matches any string of letters, numbers, underscores, or apostrophes (fairly typical idea of a "word" in English) that does not have a / character both before and after it - wrapped with a "/"
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/))
...and is the purest form, using only one character class to define "word" characters. It matches the example as follows:
Matched Not Matched
------------- -------------
The character
_ used
underscore variable
under in
strike programming
can languages
be character
in stroke
names
many
while
the
slash
solidus
is
typically
not
allowed
If excluding /stroke/, is not desired, then adding a bit to the end limitation will allow it, depending upon how you want to define the beginning of a "next" word:
\b([\w']+)\b(?<=(?<!/)\1|\1(?!/([^\w]))).
changes (?!/) to (?!/([^\w])), which allows /something/ if it does have a letter, number, or underscore immediately after it. This would move stroke from the "Not Matched" to the "Matched" list, above.
note: \w matches uppercase or lowercase letters, numbers and the underscore character
If you want to alter your concept for "word" from the above, simply exchange the characters and shorthand character classes contained in the [\w'] part of the expression to something like [a-zA-Z'] to exclude digits or [\w'-] to include hyphens, which would capture under-strike as a single match, rather than two separate matches:
\b([\w'-]+)\b(?<=(?<!/)\1|\1(?!/([^\w])))
IMPORTANT ALTERNATIVE!!! (I think)
I just thought of an alternative to Matching any words that are not wrapped with / symbols: simply consume all of these symbols and words that are surrounded in them (splitting). This has a few benefits: no lookaround means this could be used in more contexts (JavaScript does not support lookbehind and some flavors of regex don't support lookaround at all) while increasing efficiency; also, using a split expression means a direct result of a String array:
string input = "The /character/ "_" (underscore/under-strike) can be..."; //etc...
string[] resultsArray = Regex.Split(input, #"([^\w'-]+?(/[\w]+/)?)+");
voila!

read serial numbers with regex and place them as individual elements of a List<string>

I am parsing an excel document and one column includes n number of serial numbers for each row, separated by a white space.
sample serials: 1108656 1108657 1108658 1108659 1108660 1108661 1108662 1108663 1108664 1108665 1108666
how could I use regex to analyze that string and return a List or IEnumerable where each serial number in the sample is an individual element?
serials are between 5 and 8 numbers long.
I am using C# and the .Net Regex.

If the string is simply numbers separated by spaces I'd suggest going with String.Split method like this :
string[] mySerialNumbers = searchString.Split(new char[]{' '});
See the documentation of String.Split.
To have the result as an IEnumerable you can simply create a List<string> with the result of String.Split like this :
List<string> mySerialNumbers = new List<string>(searchString.Split(new char[]{' '});
Edit:
After reading the comment, the Regex way would indeed kind of validate the input to make sure no other characters are there, which is a good thing. The regex for this would be as simple as this :
foreach(Match match in Regex.Matches("1108656 1108657 1108658 1108659", "[0-9]{5,8}"))
{
// Do something with match.Value here like : int.Parse(match.Value)
}
The regex expression [0-9]{5,8} means any digit repeated between 5 and 8 times. Of course this Regex is really simple and will simply capture the good things. For example as tring with 1234567 abcd 7654321 will not give any error, it will simply capture the 2 numbers and silently ignore the letters. You could make a much more complex regex to validate even better. This could be a solid starting reference for regex : http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

Regex.Spilt method can be used here
string[] SerialNum = Regex.Split("yourStringVar", " ")

C# reliable way to pattern match?

At the moment I am trying to match patterns such as
text text date1 date2
So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.
Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.
Anyone got any better solutions?
Update
Here is a small example of the pattern I would need to match:
FIND me#test.com 01/01/2010 to 10/01/2010
Here is my current regex:
FIND [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}
This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.
I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...

To match at least one or more whitespace characters (space, tab, newline), use:
\s+
Substitute the above wherever you have the physical space in your pattern and you should be fine.

Example of matching multiple groups in a text with multiple whitespaces and/or newlines.
var txt = "text text date1\ndate2";
var matches = Regex.Match(txt, #"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);
matches.Groups[n].Value with n from 1 to 4 will contain your matches.

I would split the string into a string array and match each resulting string to the necessary Regular Expression.

\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b

Its a nasty expression but here is something that will work for the input you provided:
^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$
This will work with variable amounts of whitespace between the capture groups as well.

Through ORegex you can tokenize your string and just pattern match on token sequences:
var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);
var matches = oregex.Matches(tokens); //here is your subsequence tokens.
...
public bool IsText(string str)
{
...
}
public bool IsDate(string str)
{
...
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Word splitting in C# - c#

var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+"); But to include non-latin letters, I would use this: var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");

Try this: Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+") Using Matches() will collect only the words, Split() will divide the string which is not what you want.

The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will. var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'´_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

Related

C# Regex, extract strings by reference string

Matching a substring of any length and characters using RegEx

Regex - Get all words that are not wrapped with a "/"

read serial numbers with regex and place them as individual elements of a List<string>

C# reliable way to pattern match?

Categories

Resources