extract last match from string in c# - c#

i have strings in the form [abc].[some other string].[can.also.contain.periods].[our match]
i now want to match the string "our match" (i.e. without the brackets), so i played around with lookarounds and whatnot. i now get the correct match, but i don't think this is a clean solution.
(?<=\.?\[) starts with '[' or '.['
([^\[]*) our match, i couldn't find a way to not use a negated character group
`.*?` non-greedy did not work as expected with lookarounds,
it would still match from the first match
(matches might contain escaped brackets)
(?=\]$) string ends with an ]
language is .net/c#. if there is an easier solution not involving a regex i'd be also happy to know
what really irritates me is the fact, that i cannot use (.*?) to capture the string, as it seems non-greedy does not work with lookbehinds.
i also tried: Regex.Split(str, #"\]\.\[").Last().TrimEnd(']');, but i'm not really pround of this solution either

The following should do the trick. Assuming the string ends after the last match.
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
var search = new Regex("\\.\\[(.*?)\\]$", RegexOptions.RightToLeft);
string ourMatch = search.Match(input).Groups[1]);

Assuming you can guarantee the input format, and it's just the last entry you want, LastIndexOf could be used:
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
int lastBracket = input.LastIndexOf("[");
string result = input.Substring(lastBracket + 1, input.Length - lastBracket - 2);

With String.Split():
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
char[] seps = {'[',']','\\'};
string[] splitted = input.Split(seps,StringSplitOptions.RemoveEmptyEntries);
you get "out match" in splitted[7] and can.also.contain.periods is left as one string (splitted[4])
Edit: the array will have the string inside [] and then . and so on, so if you have a variable number of groups, you can use that to get the value you want (or remove the strings that are just '.')
Edited to add the backslash to the separator to treat cases like '\[abc\]'
Edit2: for nested []:
string input = #"[abc].[some other string].[can.also.contain.periods].[our [the] match]";
string[] seps2 = { "].["};
string[] splitted = input.Split(seps2, StringSplitOptions.RemoveEmptyEntries);
you our [the] match] in the last element (index 3) and you'd have to remove the extra ]

You have several options:
RegexOptions.RightToLeft - yes, .NET regex can do this! Use it!
Match the whole thing with greedy prefix, use brackets to capture the suffix that you're interested in
So generally, pattern becomes .*(pattern)
In this case, .*\[([^\]]*)\], then extract what \1 captures (see this on rubular.com)
References
regular-expressions.info/Grouping with brackets

Related

C# Extract part of the string that starts with specific letters

I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?
This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string
Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();

Get sub-strings from a string that are enclosed using some specified character

Suppose I have a string
Likes (20)
I want to fetch the sub-string enclosed in round brackets (in above case its 20) from this string. This sub-string can change dynamically at runtime. It might be any other number from 0 to infinity. To achieve this my idea is to use a for loop that traverses the whole string and then when a ( is present, it starts adding the characters to another character array and when ) is encountered, it stops adding the characters and returns the array. But I think this might have poor performance. I know very little about regular expressions, so is there a regular expression solution available or any function that can do that in an efficient way?
If you don't fancy using regex you could use Split:
string foo = "Likes (20)";
string[] arr = foo.Split(new char[]{ '(', ')' }, StringSplitOptions.None);
string count = arr[1];
Count = 20
This will work fine regardless of the number in the brackets ()
e.g:
Likes (242535345)
Will give:
242535345
Works also with pure string methods:
string result = "Likes (20)";
int index = result.IndexOf('(');
if (index >= 0)
{
result = result.Substring(index + 1); // take part behind (
index = result.IndexOf(')');
if (index >= 0)
result = result.Remove(index); // remove part from )
}
Demo
For a strict matching, you can do:
Regex reg = new Regex(#"^Likes\((\d+)\)$");
Match m = reg.Match(yourstring);
this way you'll have all you need in m.Groups[1].Value.
As suggested from I4V, assuming you have only that sequence of digits in the whole string, as in your example, you can use the simpler version:
var res = Regex.Match(str,#"\d+")
and in this canse, you can get the value you are looking for with res.Value
EDIT
In case the value enclosed in brackets is not just numbers, you can just change the \d with something like [\w\d\s] if you want to allow in there alphabetic characters, digits and spaces.
Even with Linq:
var s = "Likes (20)";
var s1 = new string(s.SkipWhile(x => x != '(').Skip(1).TakeWhile(x => x != ')').ToArray());
const string likes = "Likes (20)";
int likesCount = int.Parse(likes.Substring(likes.IndexOf('(') + 1, (likes.Length - likes.IndexOf(')') + 1 )));
Matching when the part in paranthesis is supposed to be a number;
string inputstring="Likes (20)"
Regex reg=new Regex(#"\((\d+)\)")
string num= reg.Match(inputstring).Groups[1].Value
Explanation:
By definition regexp matches a substring, so unless you indicate otherwise the string you are looking for can occur at any place in your string.
\d stand for digits. It will match any single digit.
We want it to potentially be repeated several times, and we want at least one. The + sign is regexp for previous symbol or group repeated 1 or more times.
So \d+ will match one or more digits. It will match 20.
To insure that we get the number that is in paranteses we say that it should be between ( and ). These are special characters in regexp so we need to escape them.
(\d+) would match (20), and we are almost there.
Since we want the part inside the parantheses, and not including the parantheses we tell regexp that the digits part is a single group.
We do that by using parantheses in our regexp. ((\d+)) will still match (20), but now it will note that 20 is a subgroup of this match and we can fetch it by Match.Groups[].
For any string in parantheses things gets a little bit harder.
Regex reg=new Regex(#"\((.+)\)")
Would work for many strings. (the dot matches any character) But if the input is something like "This is an example(parantesis1)(parantesis2)", you would match (parantesis1)(parantesis2) with parantesis1)(parantesis2 as the captured subgroup. This is unlikely to be what you are after.
The solution can be to do the matching for "any character exept a closing paranthesis"
Regex reg=new Regex(#"\(([^\(]+)\)")
This will find (parantesis1) as the first match, with parantesis1 as .Groups[1].
It will still fail for nested paranthesis, but since regular expressions are not the correct tool for nested paranthesis I feel that this case is a bit out of scope.
If you know that the string always starts with "Likes " before the group then Saves solution is better.

C# creating a string which will be parsed, based on user input fails when they enter a tokenizing character

I know what is going on, but i was trying to make it so that my .Split() ignores certain characters.
sample:
1|2|3|This is a string|type:1
the parts "This is a string" is user input The user could enter in a splitting character, | in this case, so i wanted to escape it with \|. It still seems to split based on that. This is being done on the web, so i was thinking that a smart move might actually be just JSON.encode(user_in) to get around it?
1|2|3| This is \|a string|type:1
Still splits on the escaped character because i didnt define it as a special case. How would i get around this issue?
you could use Regex.Split instead and then split on | not preceded by a .
// -- regex for | not preceded by a \
string input = #"1|2|3|This is a string\|type:1";
string pattern = #"(?<!\\)[|]";
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
You can replace your delimiter with something special first, next split it and finally replace it back.
var initial = #"1|2|3|This is \| a string|type:1";
var modified = initial.Replace(#"\|", "###");
IEnumerable<string> result = modified.Split('|');
result = result.Select(i => i.Replace("###", #"\|"));

Using Regex to match quoted string with embedded, non-escaped quotes

I am trying to match a string in the following pattern with a regex.
string text = "'Emma','The Last Leaf','Gulliver's travels'";
string pattern = #"'(.*?)',?";
foreach (Match match in Regex.Matches(text,pattern,RegexOptions.IgnoreCase))
{
Console.WriteLine(match + " " + match.Index);
Console.WriteLine(match.Groups[1].Captures[0]);
}
This matches "Emma" and "The Last leaf" correctly, however the third match is "Gulliver". But the desired match is "Gulliver's travels". How can I build a regex for a patterns like this?
Since , is your delimiter, you can try changing your pattern like this. It should work.
string pattern = #"'(.*?)'(?:,|$)";
The way this works is, it looks for a single quote followed by a comma or end of the line.
I think this can work '(.*?)',|'(.*)' as regular expression.
you may consider to use look behind /look ahead:
"(?<=^'|',').*?(?='$|',')"
test with grep:
kent$ echo "'Emma','The Last Leaf','Gulliver's travels'"|grep -Po "(?<=^'|',').*?(?='$|',')"
Emma
The Last Leaf
Gulliver's travels
You can't, if you have single-quote delimited strings and Gulliver's contains a single, unescaped quote there's no way to distinguish it from the end of a string. You could always just split it by commas and trim 's from either side but I'm not sure that's what you want:
string text = "'Emma','The Last Leaf','Gulliver's travels'";
foreach(string s in text.split(new char[] {','})) {
Console.WriteLine(s.Trim('\''));
}

Regex to match alphanumeric and spaces

What am I doing wrong here?
string q = "john s!";
string clean = Regex.Replace(q, #"([^a-zA-Z0-9]|^\s)", string.Empty);
// clean == "johns". I want "john s";
just a FYI
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
would actually be better like
string clean = Regex.Replace(q, #"[^\w\s]", string.Empty);
This:
string clean = Regex.Replace(dirty, "[^a-zA-Z0-9\x20]", String.Empty);
\x20 is ascii hex for 'space' character
you can add more individual characters that you want to be allowed.
If you want for example "?" to be ok in the return string add \x3f.
I got it:
string clean = Regex.Replace(q, #"[^a-zA-Z0-9\s]", string.Empty);
Didn't know you could put \s in the brackets
The following regex is for space inclusion in textbox.
Regex r = new Regex("^[a-zA-Z\\s]+");
r.IsMatch(textbox1.text);
This works fine for me.
I suspect ^ doesn't work the way you think it does outside of a character class.
What you're telling it to do is replace everything that isn't an alphanumeric with an empty string, OR any leading space. I think what you mean to say is that spaces are ok to not replace - try moving the \s into the [] class.
There appear to be two problems.
You're using the ^ outside a [] which matches the start of the line
You're not using a * or + which means you will only match a single character.
I think you want the following regex #"([^a-zA-Z0-9\s])+"
bottom regex with space, supports all keyboard letters from different culture
string input = "78-selim güzel667.,?";
Regex regex = new Regex(#"[^\w\x20]|[\d]");
var result= regex.Replace(input,"");
//selim güzel
The circumflex inside the square brackets means all characters except the subsequent range. You want a circumflex outside of square brackets.
This regex will help you to filter if there is at least one alphanumeric character and zero or more special characters i.e. _ (underscore), \s whitespace, -(hyphen)
string comparer = "string you want to compare";
Regex r = new Regex(#"^([a-zA-Z0-9]+[_\s-]*)+$");
if (!r.IsMatch(comparer))
{
return false;
}
return true;
Create a set using [a-zA-Z0-9]+ for alphanumeric characters, "+" sign (a quantifier) at the end of the set will make sure that there will be at least one alphanumeric character within the comparer.
Create another set [_\s-]* for special characters, "*" quantifier is to validate that there can be special characters within comparer string.
Pack these sets into a capture group ([a-zA-Z0-9]+[_\s-]*)+ to say that the comparer string should occupy these features.
[RegularExpression(#"^[A-Z]+[a-zA-Z""'\s-]*$")]
Above syntax also accepts space

Categories