Match but exclude a string using C# regular expression [duplicate] - c#

Say I have the string "User Name:firstname.surname" contained in a larger string how can I use a regular expression to just get the firstname.surname part?
Every method i have tried returns the string "User Name:firstname.surname" then I have to do a string replace on "User Name:" to an empty string.
Could back references be of use here?
Edit:
The longer string could contain "Account Name: firstname.surname" hence why I want to match the "User Name:" part of the string aswell to just get that value.

I like to use named groups:
Match m = Regex.Match("User Name:first.sur", #"User Name:(?<name>\w+\.\w+)");
if(m.Success)
{
string name = m.Groups["name"].Value;
}
Putting the ?<something> at the beginning of a group in parentheses (e.g. (?<something>...)) allows you to get the value from the match using something as a key (e.g. from m.Groups["something"].Value)
If you didn't want to go to the trouble of naming your groups, you could say
Match m = Regex.Match("User Name:first.sur", #"User Name:(\w+\.\w+)");
if(m.Success)
{
string name = m.Groups[1].Value;
}
and just get the first thing that matches. (Note that the first parenthesized group is at index 1; the whole expression that matches is at index 0)

You could also try the concept of "lookaround". This is a kind of zero-width assertion, meaning it will match characters but it won't capture them in the result.
In your case, we could take a positive lookbehind: we want what's behind the target string "firstname.surname" to be equal to "User Name:".
Positive lookbehind operator: (?<=StringBehind)StringWeWant
This can be achieved like this, for instance (a little Java example, using string replace):
String test = "Account Name: firstname.surname; User Name:firstname.surname";
String regex = "(?<=User Name:)firstname.surname";
String replacement = "James.Bond";
System.out.println(test.replaceAll(regex, replacement));
This replaces only the "firstname.surname" strings that are preceeded by "User Name:" without replacing the "User Name:" itself - which is not returned by the regex, only matched.
OUTPUT: Account Name: firstname.surname; User Name:James.Bond
That is, if the language you're using supports this kind of operations

Make a group with parantheses, then get it from the Match.Groups collection, like this:
string s = "User Name:firstname.surname";
Regex re = new Regex(#"User Name:(.*\..*)");
Match match = re.Match(s);
if (match.Success)
{
MessageBox.Show(match.Groups[1].Value);
}
(note: the first group, with index 0, is the whole match)

All regular expression libraries I have used allow you to define groups in the regular expression using parentheses, and then access that group from the result.
So, your regexp might look like: User name:([^.].[^.])
The complete match is group 0. The part that matches inside the parentheses is group 1.

Related

Using Regex to extract part of a string from a HTML/text file

I have a C# regular expression to match author names in a text document that is written as:
"author":"AUTHOR'S NAME"
The regex is as follows:
new Regex("\"author\":\"[A-Za-z0-9]*\\s?[A-Za-z0-9]*")
This returns "author":"AUTHOR'S NAME. However, I don't want the quotation marks or the word Author before. I just want the name.
Could anyone help me get the expected value please?
Use regex groups to get a part of the string. ( ) acts as a capture group and can be accessed by the .Groups field.
.Groups[0] matches the whole string
.Groups[1] matches the first group (and so on)
string pattern = "\"author\":\"([A-Za-z0-9]*\\s?[A-Za-z0-9]*)\"";
var match = Regex.Match("\"author\":\"Name123\"", pattern);
string authorName = match.Groups[1];
You can also use look-around approach to only get a match value:
var txt = "\"author\":\"AUTHOR'S NAME\"";
var rgx = new Regex(#"(?<=""author"":"")[^""]+(?="")");
var result = rgx.Match(txt).Value;
My regex yields 555,020 iterations per second speed with this input string, which should suffice.
result will be AUTHOR'S NAME.
(?<="author":") checks if we have "author":" before the match, [^"]+ looks safe since you only want to match alphanumerics and space between the quotes, and (?=") is checking the trailing quote.

Regex matching too many groups

I'm using C# Regex class. I'm trying to split two strings from one. The source (input) string is constructed in following way:
first part must match PO|P|S|[1-5] (in regex syntax).
second part can be VP|GZ|GAR|PP|NAD|TER|NT|OT|LO (again, regex syntax). Second part can occur zero or one time.
Acceptable examples are "PO" (one group), "POGAR" (both groups PO+GAR), "POT" (P+OT)...
So I've use the following regex expression:
Regex r = new Regex("^(?<first>PO|P|S|[1-5])(?<second>VP|GZ|GAR|PP|NAD|TER|NT|OT|LO)?$");
Match match = r.Match(potentialToken);
When potentialToken is "PO", it returns 3 groups! How come? I am expecting just one group (first).
match.Groups are {"PO","PO",""}
Named groups are OK - match.Groups["first"] returns 1 instance, while match.Groups["second"].Success is false.
When using the numbered groups, the first group is always the complete matched (sub)string (cf. docs - "the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern"), i.e. in your case PO.
The second element in Groups is the capture of your first named group, and the third element is the capture of your second named group - just like the two captures you can retrieve by name. If you check Success of the numbered groups, you will see that the last element (the one matching your second named group) has a Success value of false, as well. You can interpret this as "the group exists, but it did not match anything".
To confirm this, have a look at the output of this testing code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex r = new Regex("^(?<first>PO|P|S|[1-5])(?<second>VP|GZ|GAR|PP|NAD|TER|NT|OT|LO)?$");
Match match = r.Match("PO");
for (int i = 0; i < match.Groups.Count; i++) {
Console.WriteLine(string.Format("{0}: {1}; {2}", i, match.Groups[i].Success, match.Groups[i].Value));
}
}
}
You can run it here.
RegularExpression will always have one group which is "Group 0" at index 0 even though you don't have any capturing groups.
"Group 0" will be equal to whole match the regex has made(Match.Value).
Then in your case you get 3 groups because "Group 0" + "Group first" + "Group second". As mentioned "Group second" is an optional group so when it doesn't take part in subject .Net regex engine marks "Group second".Success = false. I don't see anything surprise here. This is the expected behavior.

what is a good pattern to processes each individual regex match through a method

I'm trying to figure out a pattern where I run a regex match on a long string, and each time it finds a match, it runs a replace on it. The thing is, the replace will vary depending on the matched value. This new value will be determined by a method. For example:
var matches = Regex.Match(myString, myPattern);
while(matches.Success){
Regex.Replace(myString, matches.Value, GetNewValue(matches.Groups[1]));
matches = matches.NextMatch();
}
The problem (i think) is that if I run the Regex.Replace, all of the match indexes get messed up so the result ends up coming out wrong. Any suggestions?
If you replace each pattern with a fixed string, Regex.replace does that for you. You don't need to iterate the matches:
Regex.Replace(myString, myPattern, "replacement");
Otherwise, if the replacement depends upon the matched value, use the MatchEvaluator delegate, as the 3rd argument to Regex.Replace. It receives an instance of Match and returns string. The return value is the replacement string. If you don't want to replace some matches, simply return match.Value:
string myString = "aa bb aa bb";
string myPattern = #"\w+";
string result = Regex.Replace(myString, myPattern,
match => match.Value == "aa" ? "0" : "1" );
Console.WriteLine(result);
// 0 1 0 1
If you really need to iterate the matches and replace them manually, you need to start replacement from the last match towards the first, so that the index of the string is not ruined for the upcoming matches. Here's an example:
var matches = Regex.Matches(myString, myPattern);
var matchesFromEndToStart = matches.Cast<Match>().OrderByDescending(m => m.Index);
var sb = new StringBuilder(myString);
foreach (var match in matchesFromEndToStart)
{
if (IsGood(match))
{
sb.Remove(match.Index, match.Length)
.Insert(match.Index, GetReplacementFor(match));
}
}
Console.WriteLine(sb.ToString());
Just be careful, that your matches do not contain nested instances. If so, you either need to remove matches which are inside another match, or rerun the regex pattern to generate new matches after each replacement. I still recommend the second approach, which uses the delegates.
If I understand your question correctly, you want to perform a replace based on a constant Regular Expression, but the replacement text you use will change based on the actual text that the regex matches on.
The Captures property of the Match Class (not the Match method) returns a collection of all the matches with your regex within the input string. It contains information like the position within the string, the matched value and the length of the match. If you iterate over this collection with a foreach loop you should be able to treat each match individually and perform some string manipulations where you can dynamically modify the replacement value.
I would use something like
Regex regEx = new Regex("some.*?pattern");
string input = "someBLAHpattern!";
foreach (Match match in regEx.Matches(input))
{
DoStuffWith(match.Value);
}

what will be the best way to parse string inside 2 characters

i have this string:
"Network adapter 'Realtek PCIe GBE Family Controller' on local host"
what will be the best way to return only the string between "'" ? (Realtek PCIe GBE Family Controller)
If you're comfortable with regular expressions, you could use a pattern like:
/'[^']*'/
to capture everything between the single quotes
You can use regular expressions, like this:
var s = "hello 'world' hehe";
var m = Regex.Match(s, "'([^']*)'");
string res = null;
if (m.Success) {
res = m.Groups[1].ToString();
}
Console.WriteLine(res);
The key to the solution is this regular expression:
'([^']*)'
It starts the match when it finds a single quote, and continues until it finds the closing quote, capturing everything in between. The captured group is then retrieved through the Regex API. Note that the capturing groups that you define start at index 1; index zero is reserved to mean "the entire match".
Take a look at the demo on ideone.
You can use the Substring() method to chop it up.
tempStr = str.Substring(str.IndexOf("'")+1);
yourStr = tempStr.SubString(0, tempStr.IndexOf("'"));

Split name with a regular expression

I'm trying to come up with regular expression which will split full names.
The first part is validation - I want to make sure the name matches the pattern "Name Name" or "Name MI Name", where MI can be one character optionally followed by a period. This weeds out complex names like "Jose Jacinto De La Pena" - and that's fine. The expression I came up with is ^([a-zA-Z]+\s)([a-zA-Z](\.?)\s){0,1}([a-zA-Z'-]+)$ and it seems to do the job.
But how do I modify it to split the name into two parts only? If middle initial is present, I want it to be a part of the first "name", in other words "James T. Kirk" should be split into "James T." and "Kirk". TIA.
Just add some parenthesis
^(([a-z]+\s)([a-z](\.?))\s){0,1}([a-z'-]+)$
Your match will be in group 1 now
string resultString = null;
try {
resultString = Regex.Match(subjectString, #"^(([a-z]+\s)([a-z](\.?))\s){0,1}([a-z'-]+)$", RegexOptions.IgnoreCase).Groups[1].Value;
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Also, I made the regex case insensitive so that you can make it shorter (no a-zA-Z but a-z)
Update 1
The number groups don't work well for the case there is no initial so I wrote the regex from sratch
^(\w+\s(\w\.\s)?)(\w+)$
\w stands for any word charater and this is maybe what you need (you can replace it by a-z if that works better)
Update 2
There is a nice feature in C# where you can name your captures
^(?<First>\w+\s(?:\w\.\s)?)(?<Last>\w+)$
Now you can refer to the group by name instead of number (think it's a bit more readable)
var subjectString = "James T. Kirk";
Regex regexObj = new Regex(#"^(?<First>\w+\s(?:\w\.\s)?)(?<Last>\w+)$", RegexOptions.IgnoreCase);
var groups = regexObj.Match(subjectString).Groups;
var firstName = groups["First"].Value;
var lastName = groups["Last"].Value;
You can accomplish this by making what is currently your second capturing group a non-capturing group by adding ?: just before the opening parentheses, and then moving that entire second group into the end of the first group, so it would become the following:
^([a-zA-Z]+\s(?:[a-zA-Z](\.?)\s)?)([a-zA-Z'-]+)
Note that I also replaced the {0,1} with ?, because they are equivalent.
This will result in two capturing groups, one for the first name and middle initial (if it exists), and one for the last name.
I'm not sure if you want this way, but there is a method of doing it without regular expressions.
If the name is in the form of Name Name then you could do this:
// fullName is a string that has the full name, in the form of 'Name Name'
string firstName = fullName.Split(' ')[0];
string lastName = fullName.Split(' ')[1];
And if the name is in the form of Name MIName then you can do this:
string firstName = fullName.Split('.')[0] + ".";
string lastName = fullName.Split('.')[1].Trim();
Hope this helps!
Just put the optional part in the first capturing group:
(?i)^([a-z]+(?:\s[a-z]\.?)?)\s([a-z'-]+)$

Categories