How to find and replace href values on links using AngleSharp? - c#

I have a snippet of some HTML that contains some links with hrefs that start with a hashtag like the following
Getting Started
I'm new to AngleSharp and am trying to use it to find these links and replace the hrefs to new values and then return the updated HTML markup back.

The beauty of AngleSharp is that you can essentially fall back to any JS solution - as AngleSharp exposes the W3C DOM API (which is also used by JS). All you'd need to do is replace certain camelCase with PascalCase and use standard .NET tools instead of things from JS.
Let's take for instance How to Change All Links with javascript (sorry, was the first hit on my Google search) and use this as a starting point.
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(res => res.Content(""));
var anchors = document.GetElementsByTagName("a");
for (var i = 0; i < anchors.Length; i++)
{
var anchor = anchors[i] as IHtmlAnchorElement;
anchor.Href = "http://example.com/?redirect=" + anchor.Href;
}
So in our case we are not interested in the same transformation, but quite a similar one. We could do:
for (var i = 0; i < anchors.Length; i++)
{
var anchor = anchors[i] as IHtmlAnchorElement;
if (anchor.GetAttribute("href")?.StartsWith("#") ?? false)
{
anchor.Href = "your-new-value";
}
}
Reason is that Href is always normalized (i.e., a full URL) such that an attribute value of "#foo" may be look like "http://example.com/path#foo". By looking at the raw value we can just assume that the value still starts with the hash symbol.

Related

Extracting only tags from html text file

I'm working on a steganography method which hides text withing html tags.
for example this tag: <heEAd> I have to extract every character within the tag and then
analyze the case of the letter if it is capital then the bit is set to 1 else 0 and I also want to check the end if it sees the matching closing /head tag
here is the code :
WebClient client = new WebClient();
String htmlCode = client.DownloadString("url");
String Tags = "";
for(int i = 0; i < htmlCode.Length; i++){
if(htmlCode[i] ='<'){
if(htmlCode[i] = '>')
continue;
else{
Tags += htmlCode[i];
}
}
}
That logic is terrible but how do I use IndexOf and lastIndexOf to get the desired substring I tried to use that but I'm just missing something due to the lack of my knowledge about c#
I think you need to use REGEX.
I tried to do this once with Substring and i had much job. Latter i decided to use regex and it was easier than the first one.
var regex = new Regex(#"(?<=<head>).*(?=</head>)");
return regex.Matches(strInput);

Using C# and Regex to find and surround all words and numbers within some html text with a span

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).

Scrape HTML for label then value in separate DIV tags

I am scraping a database of products and I am able to get all the HTML and retrieve most values as they have some unique items. However I am stuck on some areas that have common tags.
Example:
<div class="label">Name:</div><div class="value">John</div>
<div class="label">Age:</div><div class="value">24</div>
Any ideas on how I could get those labels and associated values?
I am using HTMLAgilityPack for the rest if there is something in there that may help.
Please use the xpath to get div's with class as label and class as value
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
Dictionary<string, string> dict = new Dictionary<string, string>();
//This will get all div's with class as label & class value in dictionary
int cnt = 1;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='label']"))
{
var val = doc.DocumentNode.SelectSingleNode("//div[#class='value'][" + cnt + "]").InnerText;
if(!dict.ContainsKey(node.InnerText))//dictionary takes unique keys only
{
dict.Add(node.InnerText, val);
cnt++;
}
}
You could try this:
Int32 endingIndex;
var Name1 = GetTextBetween(yourHtml, "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value1 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
var Name2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
public static String GetTextBetween(String allDataToParse, String startText, String endText, out Int32 indexOfEndText)
{
var indexOfStartText = allDataToParse.IndexOf(startText);
indexOfEndText = allDataToParse.IndexOf(endText);
return allDataToParse.Substring(indexOfStartText, indexOfEndText - indexOfStartText).Replace(startText, String.Empty) ;
}
Although XPath always sounds like a great idea, when you're scraping data you can't rely on the HTML to be well formed. Many webpages break their HTML regularly to make scraping harder. Even though Mark's code looks awkward, it's actually more robust in some cases.
As sad as it sounds, you can only rely on consistency in the target document when the provider has proven reliable over a long length of time. Ideally, I'd use a regular expression to search for the tags I want specifically. Here's a good starting point:
Regular expression for extracting tag attributes
Unfortunately, only you know the exact quirks of the document you're working on. A simple solution, like the one Mark proposes, will likely work if the page you're viewing is reliable. And frankly, it's less likely to be fragile and crash unexpectedly.
If you use the HTML document parsing code that HatSoft suggests, your program may work great on most documents, but in my experience websites will throw errors randomly, change their layout unexpectedly, or sometimes your network code will only receive a partial string. Perhaps this is okay, but I'd suggest you try both approaches and see what is more reliable for you.

extract text from <p>...</p> tag or directly from an HTML file

I have an HTML page that contains some filenames that i want to download from a webserver.
I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.
I have digged about this topic but havn't fount anything except -
Regex cannt be used to parse HTML.
Use HTML Agility Pack
Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?
Sample HTML that contains filename -
<p class=3DMsoNormal style=3D'margin-top:0in;margin-right:0in;margin-bottom=:0in; margin-left:1.5in;margin-bottom:.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'><![if !supportLists]> <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'><span style=3D'mso-list:Ignore'>1.<span style=3D'font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>**13572_PostAccountingReport_2009-06-03.acc**<o:p></o:p></span></p>
I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.
Cant this be achieved by anyother logic?
This is what i have done so far
string pageSource = "";
string geturl = #"C:\Documents and Settings\NASD_Download.mht";
WebRequest getRequest = WebRequest.Create(geturl);
WebResponse getResponse = getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
pageSource.Replace("=", "");
}
var fileNames = from Match m in Regex.Matches(pageSource, #"[0-9]+_+[A-Za-z]+_+[0-9]+-+[0-9]+-+[0-9]+.+[a-z]")
select m.Value;
foreach (var s in fileNames)
Response.Write(s);
Bcause of some "=" occuring in every file name i m not able to get the filename. how can I remove the occurrence of "=" in pageSource string
Thanks in advance
Akhil
Well, knowing that regex aren't ideal to find values in HTML:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\s(\S+\.ext)\s/)
if (match)
files.push(match[1]);
}
Live DEMO
Note:
Read the comments to the question.
If the extension can be anything, you can use this:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\b(\S+\.\S+)\b/)
console.log(match)
if (match)
files.push(match[1]);
}
document.getElementById('result').innerHTML = files + "";
​
But this really really not reliable.
Live DEMO
Well, you can use regular expressions to extract stuff that looks like file names. Since, as you correctly point out, regular expressions do not parse HTML, you might get false positives, i.e., you might get results that look like file names but are not.
Let's take an example:
string html = #"<p class=3DMsoNormal ...etc...";
var fileNames = from Match m in Regex.Matches(html, #"\b[A-Za-z0-9_-]+\.[A-Za-z0-9_-]{3}\b")
select m.Value;
foreach (var s in fileNames)
Console.WriteLine(s);
Console.ReadLine();
This will return
1.5in
1.5in
7.0pt
13572_PostAccountingReport_2009-06-03.acc
You see, HTML stuff that looks like a file name will be returned. Of course, you could refine the regular expression (for example, replace + with {3,}, so that at least three characters are required for the part before the dot) so that the false positives in this example are filtered out. Still, it's always going to be an approximate result, not an exact one.
It may be impossible to get file names using common pattern because of 1.5in -.25in 7.0pt and the likes, try to be more specific (if possible), like
/[a-z0-9_-]+\.[a-z]+/gi or
/>[a-z0-9_-]+\.[a-z]+</gi (markup included) or even
/>\d+_PostAccountingReport_\d+-\d+-\d+\.[a-z]+</gi

Find/parse server-side <?abc?>-like tags in html document

I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit

Categories