Add querystring to all anchor links in HTML body - c#

In C# given a string which contains HTML what is the best way to automatically add the query string data test=1 to the end of every hyperlink? It should only modify the url inside the href attribute for anchor links (eg not do it for image urls etc).
An example would be:
Input
Visit http://www.test.com today
and see what deals we have.
Output
Visit http://www.test.com today
and see what deals we have.
This seems to be a bit tricky and am not sure where the best place to start on this would be. Any help appreciated!

HTML Agility Pack is a very fine library for parsing HTML.
Sample for get all text in html:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}

Related

How to fix Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)?

How to fix Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)?
#Html.Raw(Model.FooterHtml)
Without seeing an explicit example of an HTML string you'd want to sanitize and the anticipated output post-sanitization, I can only provide a general suggestion that you leverage an HTML sanitization library.
It's a good idea to sanitize raw HTML when you receive it and before you store it, but if you're about to render HTML that is untrusted and has already been stored, you can perform sanitization in your controller when you generate your model and before you return it to your view.
https://github.com/mganss/HtmlSanitizer
Usage
Install the HtmlSanitizer NuGet package. Then:
var sanitizer = new HtmlSanitizer();
var html = #"<script>alert('xss')</script><div onload=""alert('xss')"""
+ #"style=""background-color: test"">Test<img src=""test.gif"""
+ #"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "http://www.example.com");
Assert.That(sanitized, Is.EqualTo(#"<div style=""background-color: test"">"
+ #"Test<img style=""margin: 10px"" src=""http://www.example.com/test.gif""></div>"));
The above library offers a demo at https://xss.ganss.org/
and a Fiddle at https://dotnetfiddle.net/892nOk

How to extract JSON embedded on a HTML page using C#

The JSON I wish to use is embedded on a HTML page. Within a tag on the page there is a statement:
<script>
jsonRAW = {... heaps of JSON... }
Is there a parser to extract this from HTML? I have looked at json.NET but it requires its JSON reasonably formatted.
You can try to use HTML Agility pack. This can be downloaded as a Nuget Package.
After installing, this is a tutorial on how to use HTML Agility pack.
The link has more info but it works like this in code:
var urlLink = "http://www.google.com/jsonPage"; // 1. Specify url where the json is to read.
var web = new HtmlWeb(); // Init the HTMl Web
var doc = web.Load (urlLink); // Load our url
if (doc.ParseErrors != null) { // Check for any errors and deal with it.
}
doc.DocumentNode.SelectSingleNode(""); // Access the dom.
There are other things in between but this should get you started.

C# Html parsing

I'm trying to parse HTML in my C# project without success, I am using a HtmlAgilityPack lib to do so, I can get some of the HTML body text but not all of it for some reason.
I need to grab the div with ID of latestPriceSection, and filter to the USD value from https://www.monero.how/widget
My function (doesn't work)
public void getXMRRate()
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.monero.how/widget");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").Where(x => x.InnerHtml.Contains("latestPriceSection")).ToArray();
foreach (HtmlNode item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
}
Your function doesn't work because the widget is updated via script. The div contains nothing when you load the page. You can't use HAP to scrape the information of this. Find a web service that can give you the information you need.
Alternatively you can use Selenium to get the HTML after the page has loaded the script. Or you the WebBrowser class, but that requires you to have a form application where the form contains the WebBrowser.
You need to retrieve JSON-data from https://www.monero.how/widgetLive.json, because widget use this resource in Ajax request.

Parse webpage with Fragment identifier in URL, using HTML Agility Pack

I want to parse webpage with Fragment identifier(#), f.e. http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p4
When i use my browser(Google Chrome), i have different result, for different identifier(#p1,#p2,#p3), but when i use HTML Agility Pack, i always get first page, despite of page identifier.
string sURL = "http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p"
wClient = new WebClient();
html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(wClient.DownloadString(sURL+i));
I understand, that something like Ajax used here and in fact exist only one page. How can i fix my problem, and get results from other pages using C#?
Like David said,
use URL : http://steamcommunity.com/market/search/render/?query=appid%3A570%20uncommon&search_descriptions=0&start=30&count=10
where start is the start number and count is the number of items you want.
the result is a json result, so for stating the obvious you only want to use results_html
side note: in your chrome browser (when pressed F12) click on network tab and you will see the request and result being made

What is the best way to get all webpage links with Html Agility Pack?

i am try to get all links from webpage with Html Agility Pack, after send web URL (cnn.com) i have this list (return by Html Agility class):
what is the best way to get all this page links cause some of those links start with "/" and not with the page address ?
That's what I use in cases like these:
protected Uri GetAbsoluteUri(string linkUri)
{
var uri = new Uri(linkUri, UriKind.RelativeOrAbsolute);
return uri.IsAbsoluteUri ? uri : new Uri(PageUri, uri);
}
The code above assumes that:
linkUri is the value of an anchor's href attribute
PageUri is a System.Uri object that represents the Absolute Uri of the current page
Those links that don't start with an http:// are relative to the current address (http://cnn.com) so you could prepend it to get the full address. And for those that represent javascript functions, well, there's not much you could do with HTML Agility Pack as it only parses HTML.

Categories