Custom Regex for Parsing Custom Fields in HTML String

Custom Regex for Parsing Custom Fields in HTML String - c#

I am sending some html in a hidden field, and on server side I would be parsing it with regex. Currently I am able to parse
<div id="4059">asd</div>
and the code below gives me "id" in match.Groups[2] and "4059" in match.Groups[4], "div" comes at first index and 3rd comes empty.
string regex2 = #"<(?<Tag_Name>(a)|div)\b[^>]*?\b(?<URL_Type>(?(1)id))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)')";
var matches = Regex.Matches(myDiv, regex2, RegexOptions.IgnoreCase | RegexOptions.Singleline);
var links = new List<string>();
foreach (Match item in matches)
{
if (item.Groups[2].Value == "div")
{
employee.ID = item.Groups[4].Value;
}
]
Can someone please edit this regex,
<(?<Tag_Name>(a)|div)\b[^>]*?\b(?<URL_Type>(?(1)id))\s*=\s*(?:""(?<URL>(?:\\""|[^""])*)""|'(?<URL>(?:\\'|[^'])*)')
so that I could parse
<div id="5094" fieldA="asd" fieldB="def" fieldC="ghi"></div>
and the fields could be added too.
I should also mention here that I am working on a custom control and I CAN NOT USE HTML AGILITY PACK as the assemblies conflict as I add this in my project.

If you already know that the string contains only <div field="value" field="value" ...></div> (i.e. there's nothing but this div in the string), then just simplify your regex to pick out the field and value, and run it in a loop:
string regstr = #"\s+(?<field>[^\s=]+)\s*=\s*\"(?<value>[^\"]+)\"";
var reg = new Regex(regstr);
var m = reg.Match(myDiv);
while (m.Success)
{
// m.Groups["field"] and m.Groups["value"] hold your field and value
// get the next match
m = m.NextMatch();
}

Related

regex variable from script 32/34 characters

from the following code I am trying to get the data from the script variable. I'm interested in the text between ""
var code = "a37965dcd8421328a767c697448ed735";
XPathResult xpathResult = geckoWebBrowser1.Document.EvaluateXPath("/html/body/table[3]/tbody/tr[1]/td[2]/script");
var foundNodes = xpathResult.GetNodes();
foreach (var node in foundNodes)
{
var x = node.TextContent; // get text text contained by this node (including children)
GeckoHtmlElement element = node as GeckoHtmlElement; //cast to access.. inner/outerHtml
string inner = element.InnerHtml;
string outer = element.OuterHtml;
String pattent = ".[0-9a-zA-Z]{34}$.";
Match match = Regex.Match(inner, pattent);
regex is correct? what am I doing wrong?

Your Regex string can try to use [0-9a-zA-Z]{32,34} instead of .[0-9a-zA-Z]{34}$.
The . could be removed.
regex online

Your Regex rule can try like this:
bool result = Regex.Match(inner, #"^[0-9a-zA-Z]{32,34}$").Success;
Console.WriteLine(result);
If result equal true, it match success!

Replace single group via RegEx in all matches

I have a text containing HTML-Elements, where hyperlinks contain not URLs but IDs to the item the hyperlink should open. Now i'm trying to get all those IDs and replace them with new IDs. The scenario is, that all ID's have changed and i have a dictionary with "oldId -> newID" and need to replace that in the text.
This input
Some text some text <a href = "##1234"> stuff stuff stuff <a href="##9999"> xxxx
With this Dictionary mapping
1234 -> 100025
9999 -> 100026
Should generate this output
Some text some text <a href = "##100025"> stuff stuff stuff <a href="##100026"> xxxx
So far i have this:
var textContent = "...";
var regex = new Regex(#"<\s*a\s+href\s*=\s*""##(?<RefId>\d+)""\s*\\?\s*>");
var matches = regex.Matches(textContent);
foreach (var match in matches.Cast<Match>())
{
var id = -1;
if (Int32.TryParse(match.Groups["RefId"].Value, out id))
{
int newId;
// idDictionary contains the mapping from old id to new id
if (idDictionary.TryGetValue(id, out newId))
{
// Now replace the id of the current match with the new id
}
}
}`
How do i replace the IDs now?

Don't parse HTML with regular expressions.
But if you must, if you're trying to perform a replacement, use the Replace method.
var updatedContent = regex.Replace(textContent, match =>
{
var id = -1;
if (Int32.TryParse(match.Groups["RefId"].Value, out id))
{
int newId;
// idDictionary contains the mapping from old id to new id
if (idDictionary.TryGetValue(id, out newId))
{
// Now replace the id of the current match with the new id
return newId.ToString();
}
}
// No change
return match.Value;
});
Edit: As you've pointed out, this replaces the entire match. Whoops.
Firstly, change your regular expression so the thing you'll be replacing is the entire match:
#"(?<=<\s*a\s+href\s*=\s*""##)(?<RefId>\d+)(?=""\s*\\?\s*>)"
This matches just a string of digits, but ensures it has the HTML tag before and after it.
It should now do what you want, but for tidiness you can replace (?<RefId>\d+) with just \d+ (as you don't need the group any more) and match.Groups["RefId"].Value with just match.Value.

Just use callback in replace.
regex.Replace(textContent, delegate(Match m) {
int id = -1, newId;
if (Int32.TryParse(m.Groups["RefId"].Value, out id)) {
if (idDictionary.TryGetValue(id, out newId))
return newId.ToString();
}
return m.Value; // if TryGetValue fails, return the match
});

Unless you are pulling the new IDs from the HTML aswell, I don't see why you can't just use a direct String.Replace here
var html = "Some text some text <a href = '##1234'> stuff stuff stuff <a href='##9999'> xxxx";
var mappings = new Dictionary<string, string>()
{
{ "1234", "100025" },
{ "9999", "100026" },
...
};
foreach (var map in mappings)
{
html = html.Replace("##" + map.Key, "##" + map.Value);
}
Fiddle

How to get Last Index Of '\' or '//', whichever comes last?

I want to get lastindexof character from url which comes from the database on the basis of '\' or '//'
Say for example i have string like this
Administration\Masters\EmployeePulseDetailsMaster.aspx
Administration/Masters/SearchKnowYourCollegues.aspx
Administration//SMS//PushSMS.aspx
I am using that code
foreach (var item in SessionClass.UserDetails.SubModules)
{
if (Request.RawUrl.Contains(item.PageURL.Substring(item.PageURL.LastIndexOf('\\') + 1))
|| Request.RawUrl.Contains(item.PageURL.Substring(item.PageURL.LastIndexOf('/') + 1)))
{
Response.RedirectPermanent("~/Login.aspx");
}
}

You can use a regular expression to find the last occurrence of any character in a group by constructing a regular expression that looks like this:
[target-group][^target-group]*$
In your case, the target group is [/\\], so the search would look like this:
var match = Regex.Match(s, #"[/\\][^/\\]*$");
Here is a running example:
var data = new[] {
#"quick/brown/fox"
, #"jumps\over\the\lazy\dog"
, #"Administration\Masters\EmployeePulseDetailsMaster.aspx"
, #"Administration/Masters/SearchKnowYourCollegues.aspx"
, #"Administration//SMS//PushSMS.aspx"
};
foreach (var s in data) {
var m = Regex.Match(s, #"[/\\][^/\\]*$");
if (m.Success) {
Console.WriteLine(s.Substring(m.Index+1));
}
}
This prints
fox
dog
EmployeePulseDetailsMaster.aspx
SearchKnowYourCollegues.aspx
PushSMS.aspx
Demo.

I guess you want to determine if the name of the current page is in the list of SessionClass.UserDetails.SubModules. Then i'd use Request.Url.Segments.Last() to get only the name of the current page(f.e. PushSMS.aspx) and System.IO.Path.GetFileName to get the name of each url. GetFileName works with / or \:
string pageName = Request.Url.Segments.Last();
bool anyMatch = SessionClass.UserDetails.SubModules
.Any(module => pageName == System.IO.Path.GetFileName(module.PageURL));
if(anyMatch) Response.RedirectPermanent("~/Login.aspx");
You need to add using System.Linq; for Enumerable.Any.

Process part of the regex match before replacing it

I'm writing a function that will parse a file similar to an XML file from a legacy system.
....
<prod pid="5" cat='gov'>bla bla</prod>
.....
<prod cat='chi'>etc etc</prod>
....
.....
I currently have this code:
buf = Regex.Replace(entry, "<prod(?:.*?)>(.*?)</prod>", "<span class='prod'>$1</span>");
Which was working fine until it was decided that we also wanted to show the categories.
The problem is, categories are optional and I need to run the category abbreviation through a SQL query to retrieve the category's full name.
eg:
SELECT * FROM cats WHERE abbr='gov'
The final output should be:
<span class='prod'>bla bla</span><span class='cat'>Government</span>
Any idea on how I could do this?
Note1: The function is done already (except this part) and working fine.
Note2: Cannot use XML libraries, regex has to be used

Regex.Replace has an overload that takes a MatchEvaluator, which is basically a Func<Match, string>. So, you can dynamically generate a replacement string.
buf = Regex.Replace(entry, #"<prod(?<attr>.*?)>(?<text>.*?)</prod>", match => {
var attrText = match.Groups["attr"].Value;
var text = match.Groups["text"].Value;
// Now, parse your attributes
var attributes = Regex.Matches(#"(?<name>\w+)\s*=\s*(['""])(?<value>.*?)\1")
.Cast<Match>()
.ToDictionary(
m => m.Groups["name"].Value,
m => m.Groups["value"].Value);
string category;
if (attributes.TryGetValue("cat", out category))
{
// Your SQL here etc...
var label = GetLabelForCategory(category)
return String.Format("<span class='prod'>{0}</span><span class='cat'>{1}</span>", WebUtility.HtmlEncode(text), WebUtility.HtmlEncode(label));
}
// Generate the result string
return String.Format("<span class='prod'>{0}</span>", WebUtility.HtmlEncode(text));
});
This should get you started.

Get href from html using mshtml in C#

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).
<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>
I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;
Can someone please help me to get this to work.

Use this:
var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";
var address = ex.Match(tag).Groups[1].ToString();
But you should extend it with checks because for instance Groups[1] could be out of range.
In your example
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();
will match the first href="...". Or you select all occurrences:
var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();
This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way
var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));
which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
Bringing it all together to what you want as far as I understand
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
where match.Groups.Count >= 1
select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();
firstAddress holds your link, if there is one.

If your link will always start with the same path and isn't repeated on the page, you can use this (untested):
var match = Regex.Match(html, #"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");
if (match.Success)
{
var href = match.Groups["href"].Value;
....
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Custom Regex for Parsing Custom Fields in HTML String - c#

Related

regex variable from script 32/34 characters

Replace single group via RegEx in all matches

How to get Last Index Of '\' or '//', whichever comes last?

Process part of the regex match before replacing it

Get href from html using mshtml in C#

Categories

Resources