Regular expression to use which matches text before .html and after / - c#

With this string
http://sfsdf.com/sdfsdf-sdfsdf/sdf-as.html
I need to get sdf-as
with this
hellow-1/yo-sdf.html
I need yo-sdf

This should get you want you need:
Regex re = new Regex(#"/([^/]*)\.html$");
Match match = re.Match("http://sfsdf.com/sdfsdf-sdfsdf/sdf-as.html");
Console.WriteLine(match.Groups[1].Value); //Or do whatever you want with the value
This needs using System.Text.RegularExpressions; at the top of the file to work.

There are many ways to do this. The following uses lookarounds to match only the filename portion. It actually allows no / if such is the case:
string[] urls = {
#"http://sfsdf.com/sdfsdf-sdfsdf/sdf-as.html",
#"hellow-1/yo-sdf.html",
#"noslash.html",
#"what-is/this.lol",
};
foreach (string url in urls) {
Console.WriteLine("[" + Regex.Match(url, #"(?<=/|^)[^/]*(?=\.html$)") + "]");
}
This prints:
[sdf-as]
[yo-sdf]
[noslash]
[]
How the pattern works
There are 3 parts:
(?<=/|^) : a positive lookbehind to assert that we're preceded by a slash /, or we're at the beginning of the string
[^/]* : match anything but slashes
(?=\.html$): a positive lookahead to assert that we're followed by ".html" (literally on the dot)
References
regular-expressions.info/Lookarounds, Anchors
A non-regex alternative
Knowing regex is good, and it can do wonderful things, but you should always know how to do basic string manipulations without it. Here's a non-regex solution:
static String getFilename(String url, String ext) {
if (url.EndsWith(ext)) {
int k = url.LastIndexOf("/");
return url.Substring(k + 1, url.Length - ext.Length - k - 1);
} else {
return "";
}
}
Then you'd call it as:
getFilename(url, ".html")
API links
String.Substring, EndsWith, and LastIndexOf
Attachments
Source code and output on ideone.com

Try this:
string url = "http://sfsdf.com/sdfsdf-sdfsdf/sdf-as.html";
Match match = Regex.Match(url, #"/([^/]+)\.html$");
if (match.Success)
{
string result = match.Groups[1].Value;
Console.WriteLine(result);
}
Result:
sdf-as
However it would be a better idea to use the System.URI class to parse the string so that you correctly handle things like http://example.com/foo.html?redirect=bar.html.

using System.Text.RegularExpressions;
Regex pattern = new Regex(".*\/([a-z\-]+)\.html");
Match match = pattern.Match("http://sfsdf.com/sdfsdf-sdfsdf/sdf-as.html");
if (match.Success)
{
Console.WriteLine(match.Value);
}
else
{
Console.WriteLine("Not found :(");
}

This one makes the slash and dot parts optional, and allows the file to have any extension:
new Regex(#"^(.*/)?(?<fileName>[^/]*?)(\.[^/.]*)?$", RegexOptions.ExplicitCapture);
But I still prefer Substring(LastIndexOf(...)) because it is far more readable.

Related

Regex to parse URL from an excel formula

I have a formula in excel which upon reading from C# code looks like this
"=HYPERLINK(CONCATENATE(\"https://abc.efghi.rtyui.com/#/wqeqwq/\",#REF!,\"/asdasd\"), \"View asdas\")"
I want to use regex to fetch the URL from this string, i.e.
https://abc.efghi.rtyui.com/#/wqeqwq/#REF!/asdasd
The url can be different but the format of the formula will remain the same.
"=HYPERLINK(CONCATENATE(\"{SOME_STRING}\",#REF!,\"{SOME_STRING}\"), \"View asdas\")"
Try it like this:
(?<=HYPERLINK\(CONCATENATE\(")[^"]+
Demo
The positive lookbehind allows us to skip part in-front of the URL from the full match.
If you have an arbitrary number of whitespace in-between add some \s*, e.g. see this example that also shows the escaped = at the beginning of the string.
Sample Code:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(?<=HYPERLINK\(CONCATENATE\("")[^""]+";
string input = #"=HYPERLINK(CONCATENATE(""https://abc.efghi.rtyui.com/#/wqeqwq/"",#REF!,""/asdasd""), ""View asdas"")";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
Addendum: Here is another technique that uses capturing groups and regex Replace to extract the resulting URL string (after CONCATENATE would have happened):
^\=HYPERLINK\(CONCATENATE\("([^"]+)",([^,]+),"([^"]+)".*$
Demo2
string pattern = #"^\=HYPERLINK\(CONCATENATE\(""([^""]+)"",([^,]+),""([^""]+)"".*$";
string substitution = #"$1$2$3";
string input = #"=HYPERLINK(CONCATENATE(""https://abc.efghi.rtyui.com/#/wqeqwq/"",#REF!,""/asdasd""), ""View asdas"")";
Regex regex = new Regex(pattern);
string result = regex.Replace(input, substitution, 1);
You can extract the URL from the formula using capturing groups in regular expression as given below:
string inputString = "=HYPERLINK(CONCATENATE(\"https://abc.efghi.rtyui.com/#/wqeqwq/\",#REF!,\"/asdasd\"), \"View asdas\")";
string regex = "CONCATENATE\\(\"([\\S]+)\",#REF!,\"([\\S]+)\"\\)";
Regex substringRegex = new Regex(regex, RegexOptions.IgnoreCase);
Match substringMatch = substringRegex.Match(inputString);
if (substringMatch.Success)
{
string url = substringMatch.Groups[1].Value + "#REF!" + substringMatch.Groups[2].Value;
}
I have defined two capturing groups in my regular expression. One for extracting part of the URL before #REF! and the other for extracting part of the URL after #REF!. Then I am concatenating all the extracted parts with #REF! to get the final URL.

Match text separated by SINGLE forward slash only

I am trying to split strings similar to this using Regex.Split:
https://www.linkedin.com/in/someone
To return this:
https://www.linkedin.com
in
someone
Effectively, ignoring double forward slash and only worrying about a single forward slash.
I know I should be using something like this /(?!/) negative look ahead - but can't get it to work.
This is not a duplicate of this Similar Question, because if you run that regular expression through Regex.Split, it does not give the required result.
How about this: (?<!/)/(?!/)
Breaking it down:
(?<!/): negative lookbehind for / characters
/: match a single / character
(?!/): negative lookahead for / characters
Taken together, we match a / character that does not have a / both before and after it.
Example usage:
string text = "https://www.linkedin.com/in/someone";
string[] tokens = Regex.Split(text, "(?<!/)/(?!/)");
foreach (var token in tokens)
{
Console.WriteLine($"Token: {token}");
}
Output:
Token: https://www.linkedin.com
Token: in
Token: someone
Also you can do it using this code :
string pattern = #"([^\/]+(\/{2,}[^\/]+)?)";
string input = #"https://www.linkedin.com/in/someone";
foreach(Match match in Regex.Matches(input, pattern)) {
Console.WriteLine(match);
}
Output :
https://www.linkedin.com
in
someone
As mentioned by #Panagiotis Kanavos in the comments section above, why make things complicated when you can use the Uri Class:
Provides an object representation of a uniform resource identifier (URI) and easy access to the parts of the URI.
public static void Main()
{
Uri myUri = new Uri("https://www.linkedin.com/in/someone");
string host = myUri.Scheme + Uri.SchemeDelimiter + myUri.Host;
Console.WriteLine(host);
}
OUTPUT:
DEMO:
dotNetFiddle

How to remove substring after occurence of certain characters in a string?

I have the requirement as follows:
input => "Employee.Addresses[].Address.City"
output => "Empolyee.Addresses[].City"
(Address is removed which is present after [].)
input => "Employee.Addresses[].Address.Lanes[].Lane.Name"
output => "Employee.Addresses[].Lanes[].Name"
(Address is removed which is present after []. and Lane is removed which is present after [].)
How to do this in C#?
private static IEnumerable<string> Filter(string input)
{
var subWords = input.Split('.');
bool skip = false;
foreach (var word in subWords)
{
if (skip)
{
skip = false;
}
else
{
yield return word;
}
if (word.EndsWith("[]"))
{
skip = true;
}
}
}
And now you use it like this:
var filtered = string.Join(".", Filter(input));
How about a regular expression?
Regex rgx = new Regex(#"(?<=\[\])\..+?(?=\.)");
string output = rgx.Replace(input, String.Empty);
Explanation:
(?<=\[\]) //positive lookbehind for the brackets
\. //match literal period
.+? //match any character at least once, but as few times as possible
(?=\.) //positive lookahead for a literal period
Your description of what you need is lacking. Please correct me if I have understood it incorrectly.
You need to find the pattern "[]." and then remove everything after this pattern until the next dot .
If this is the case, I believe using a Regular Expression could solve the problem easily.
So, the pattern "[]." can be written in a regular expression as
"\[\]\."
Then you need to find everything after this pattern until the next dot: ".*?\." (The .*? means every character as many times as possible but in a non-greedy way, i.e. stopping at the first dot it finds).
So, the whole pattern would be:
var regexPattern = #"\[\]\..*?\.";
And you want to replace all matches of this pattern with "[]." (i.e. removing what was match after the brackets until the dot).
So you call the Replace method in the Regex class:
var result = Regex.Replace(input, regexPattern, "[].");

c# regular expressions find and extract number of giving length

I have a string such as:
"12/11/2015: Liefertermin 71994 : 30.11.2015 -> 27.11.2015"
And I want to extract the substring 71994, which will always be a number of 5 digits
I have tried the following with no success:
private string FindDispo_InInfo()
{
Regex pattern = new Regex("^[0-9]{5,5}$");
Match match = pattern.Match(textBox1.Text);
string stDispo = match.Groups[0].Value;
return stDispo;
}
Replace the anchors ^ and $ with a word boundary \b and use a verbatim string literal:
Regex pattern = new Regex(#"\b[0-9]{5}\b");
And you can access the value using match.Value:
string stDispo = match.Value;
Fixed code:
private static string FindDispo_InInfo(string text)
{
Regex pattern = new Regex(#"\b[0-9]{5}\b");
Match match = pattern.Match(text);
if (match.Success)
return match.Value;
else
return string.Empty;
}
And here is a C# demo:
Console.WriteLine(FindDispo_InInfo("12/11/2015: Liefertermin 71994 : 30.11.2015 -> 27.11.2015"));
// => 71994
However, creating a regex object inside the method might turn out inefficient. Better declare it as a static private read-only field, and then use inside the method as many times as necessary.
What you need is (\d{5}) which will capture a number of length 5

regex replace matches with function and delete other matches

I have a string like the one below and I want to replace the FieldNN instances with the ouput from a function.
So far I have been able to replace the NN instances with the output from the function. But I am not sure how I can delete the static "field" portion with the same regex.
input string:
(Field30="2010002257") and Field1="yuan" not Field28="AAA"
required output:
(IncidentId="2010002257") and Author="yuan" not Recipient="AAA"
This is the code I have so far:
public string translateSearchTerm(string searchTerm) {
string result = "";
result = Regex.Replace(searchTerm.ToLower(), #"(?<=field).*?(?=\=)", delegate(Match Match) {
string fieldId = Match.ToString();
return String.Format("_{0}", getFieldName(Convert.ToInt64(fieldId)));
});
log.Info(String.Format("result={0}", result));
return result;
}
which gives:
(field_IncidentId="2010002257") and field_Author="yuan" not field_Recipient="aaa"
The issues I would like to resolve are:
Remove the static "field" prefixes from the output.
Make the regex case-insenitive on the "FieldNN" parts and not lowercase the quoted text portions.
Make the regex more robust so that the quoted string parts an use either double or single quotes.
Make the regex more robust so that spaces are ignored: FieldNN = "AAA" vs. FieldNN="AAA"
I really only need to address the first issue, the other three would be a bonus but I could probably fix those once I have discovered the right patterns for whitespace and quotes.
Update
I think the pattern below solves issues 2. and 4.
result = Regex.Replace(searchTerm, #"(?<=\b(?i:field)).*?(?=\s*\=)", delegate(Match Match)
To fix first issue use groups instead of positive lookbehind:
public string translateSearchTerm(string searchTerm) {
string result = "";
result = Regex.Replace(searchTerm.ToLower(), #"field(.*?)(?=\=)", delegate(Match Match) {
string fieldId = Match.Groups[1].Value;
return getFieldName(Convert.ToInt64(fieldId));
});
log.Info(String.Format("result={0}", result));
return result;
}
In this case "field" prefix will be included in each match and will be replaced.

Categories