C# Extract part of the string that starts with specific letters - c#

I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?

This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string

Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();

Related

Match Characters after last dot in string

I have a string and I want to get the words after the last dot in the string.
Example:
input string = "XimEngine.DynamicGui.PickKind.DropDown";
Result:
DropDown
There's no need in Regex, let's find out the last . and get Substring:
string result = input.Substring(input.LastIndexOf('.') + 1);
If input doesn't have . the entire input will be returned
Not a RegEx answer, but you could do:
var result = input.Split('.').Last();
In Regex you can tell the parser to work from the end of the string/buffer by specifying the option RightToLeft.
By using that we can just specify a forward pattern to find a period (\.) and then capture (using ( )) our text we are interested into group 1 ((\w+)).
var str = "XimEngine.DynamicGui.PickKind.DropDown";
Console.WriteLine(Regex.Match(str,
#"\.(\w+)",
RegexOptions.RightToLeft).Groups[1].Value);
Outputs to console:
DropDown
By working from the other end of the string means we don't have to deal with anything at the beginning of the string to where we need to extract text.

Extract value from a string in C# from a specific position

I have bunch of files in a folder and I am looping through them.
How do I extract the value from the below example? I need the value 0519 only.
DOC 75-20-0519-1.PDF
The below code gives the complete part include -1.
Convert.ToInt32(Path.GetFileNameWithoutExtension(objFile).Split('-')[2]);
Appreciate any help.
You can try regular expressions in order to match the value.
pattern:
[0-9]+ - one ore more digits
(?=[^0-9][0-9]+$) - followed by not a digit and one or more digits and end of string
code:
using System.Text.RegularExpressions;
...
string file = "DOC 75-20-0519-1.PDF";
// "0519"
string result = Regex
.Match(Path.GetFileNameWithoutExtension(file), #"[0-9]+(?=[^0-9][0-9]+$)")
.Value;
If Split('-') fails, and you have an entire string as a result, it seems that you have a wrong delimiter. It can be, say, one of the dashes:
"DOC 75–20–0519–1.PDF"; // n-dash
"DOC 75—20—0519—1.PDF"; // m-dash
You can use REGEX for this
Match match = Regex.Match("DOC 75-20-0519-1.PDF", #"DOC\s+\d+\-\d+\-(\d+)\-\d+", RegexOptions.IgnoreCase);
string data = match.Groups[1].Value;

Check if an expression is a match with regex

In C# I have two strings: [I/text] and [S/100x20].
So, the first one is [I/ followed by text and ending in ].
And the second is [S/ followed by an integer, then x, then another integer, and ending in ].
I need to check if a given string is a match of one of this formats. I tried the following:
(?<word>.*?) and (?<word>[0-9]x[0-9])
But this does not seem to work and I am missing the [I/...] and [S/...] parts.
How can I do this?
This should do nicely:
Regex rex = new Regex(#"\[I/[^\]]+\]|\[S/\d+x\d+\]");
If the text in [I/text] is supposed to include only alphanumeric characters then #Oleg's use of the \w instead of [^\]] would be better. Also using + means there needs to be at least one of the preceding character class, and the * allows class to be optional. Adjust as needed..
And use:
string testString1 = "[I/text]";
if(rex.IsMatch(testString1))
{
// should match..
}
string testString2 = "[S/100x20]";
if(rex.IsMatch(testString2))
{
// should match..
}
Following regex does it. Matches the whole string
"(\[I/\w+\])|(\[S/\d+x\d+\])"
([I/\w+])
(S/\d+x\d+])
the above works.
use http://regexr.com?34543 to play with your expressions

Using Regex to determine if string contains a repeated sequence of a particular substring with comma separators and nothing else

I want to find if a string contains a repeated sequence of a known substring (with comma separators) and nothing else and return true if this is the case; otherwise false. For example: the substring is "0,8"
String A: "0,8,0,8,0,8,0,8" returns true
String B: "0,8,0,8,1,0,8,0" returns false because of '1'
I tried using the C# string functions Contains but it does not suit my requirements. I am totally new to regular expression but I feel it should be powerful enough to do this. What RegEx should I use to do this?
The pattern for a string containing nothing but a repeated number of a given substring (possibly zero of them, resulting in an empty string) is \A(?:substring goes here)*\z. The \A matches the beginning of the string, the \z the end of the string, and the (?:...)* matches 0 or more copies of anything matching the thing between the colon and the close parenthesis.
But your string doesn't actually match \A(?:0,8)*\z, because of the extra commas; an example that would match is "0,80,80,80,8". You need to account for the commas explicitly with something like \A0,8(?:,0,8)*\z.
You can build such a thing in C# thus:
string OkSubstring = "0,8";
string aOk = "0,8,0,8,0,8,0,8";
string bOK = "0,8,0,8,1,0,8,0";
Regex OkRegex = new Regex( #"\A" + OkSubstring + "(?:," + OkSubstring + #")*\z" );
OkRegex.isMatch(aOK); // True
OkRegex.isMatch(bOK); // False
That hard-codes the comma-delimiter; you could make it more general. Or maybe you just need the literal regex. Either way, that's the pattern you need.
EDIT Changed the anchors per Mike Samuel's suggestion.

extract last match from string in c#

i have strings in the form [abc].[some other string].[can.also.contain.periods].[our match]
i now want to match the string "our match" (i.e. without the brackets), so i played around with lookarounds and whatnot. i now get the correct match, but i don't think this is a clean solution.
(?<=\.?\[) starts with '[' or '.['
([^\[]*) our match, i couldn't find a way to not use a negated character group
`.*?` non-greedy did not work as expected with lookarounds,
it would still match from the first match
(matches might contain escaped brackets)
(?=\]$) string ends with an ]
language is .net/c#. if there is an easier solution not involving a regex i'd be also happy to know
what really irritates me is the fact, that i cannot use (.*?) to capture the string, as it seems non-greedy does not work with lookbehinds.
i also tried: Regex.Split(str, #"\]\.\[").Last().TrimEnd(']');, but i'm not really pround of this solution either
The following should do the trick. Assuming the string ends after the last match.
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
var search = new Regex("\\.\\[(.*?)\\]$", RegexOptions.RightToLeft);
string ourMatch = search.Match(input).Groups[1]);
Assuming you can guarantee the input format, and it's just the last entry you want, LastIndexOf could be used:
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
int lastBracket = input.LastIndexOf("[");
string result = input.Substring(lastBracket + 1, input.Length - lastBracket - 2);
With String.Split():
string input = "[abc].[some other string].[can.also.contain.periods].[our match]";
char[] seps = {'[',']','\\'};
string[] splitted = input.Split(seps,StringSplitOptions.RemoveEmptyEntries);
you get "out match" in splitted[7] and can.also.contain.periods is left as one string (splitted[4])
Edit: the array will have the string inside [] and then . and so on, so if you have a variable number of groups, you can use that to get the value you want (or remove the strings that are just '.')
Edited to add the backslash to the separator to treat cases like '\[abc\]'
Edit2: for nested []:
string input = #"[abc].[some other string].[can.also.contain.periods].[our [the] match]";
string[] seps2 = { "].["};
string[] splitted = input.Split(seps2, StringSplitOptions.RemoveEmptyEntries);
you our [the] match] in the last element (index 3) and you'd have to remove the extra ]
You have several options:
RegexOptions.RightToLeft - yes, .NET regex can do this! Use it!
Match the whole thing with greedy prefix, use brackets to capture the suffix that you're interested in
So generally, pattern becomes .*(pattern)
In this case, .*\[([^\]]*)\], then extract what \1 captures (see this on rubular.com)
References
regular-expressions.info/Grouping with brackets

Categories