How to extract JSON embedded on a HTML page using C#

How to extract JSON embedded on a HTML page using C# - c#

The JSON I wish to use is embedded on a HTML page. Within a tag on the page there is a statement:
<script>
jsonRAW = {... heaps of JSON... }
Is there a parser to extract this from HTML? I have looked at json.NET but it requires its JSON reasonably formatted.

You can try to use HTML Agility pack. This can be downloaded as a Nuget Package.
After installing, this is a tutorial on how to use HTML Agility pack.
The link has more info but it works like this in code:
var urlLink = "http://www.google.com/jsonPage"; // 1. Specify url where the json is to read.
var web = new HtmlWeb(); // Init the HTMl Web
var doc = web.Load (urlLink); // Load our url
if (doc.ParseErrors != null) { // Check for any errors and deal with it.
}
doc.DocumentNode.SelectSingleNode(""); // Access the dom.
There are other things in between but this should get you started.

Related

How to fix Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)?

How to fix Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)?
#Html.Raw(Model.FooterHtml)

Without seeing an explicit example of an HTML string you'd want to sanitize and the anticipated output post-sanitization, I can only provide a general suggestion that you leverage an HTML sanitization library.
It's a good idea to sanitize raw HTML when you receive it and before you store it, but if you're about to render HTML that is untrusted and has already been stored, you can perform sanitization in your controller when you generate your model and before you return it to your view.
https://github.com/mganss/HtmlSanitizer
Usage
Install the HtmlSanitizer NuGet package. Then:
var sanitizer = new HtmlSanitizer();
var html = #"<script>alert('xss')</script><div onload=""alert('xss')"""
+ #"style=""background-color: test"">Test<img src=""test.gif"""
+ #"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>";
var sanitized = sanitizer.Sanitize(html, "http://www.example.com");
Assert.That(sanitized, Is.EqualTo(#"<div style=""background-color: test"">"
+ #"Test<img style=""margin: 10px"" src=""http://www.example.com/test.gif""></div>"));
The above library offers a demo at https://xss.ganss.org/
and a Fiddle at https://dotnetfiddle.net/892nOk

C# Html parsing

I'm trying to parse HTML in my C# project without success, I am using a HtmlAgilityPack lib to do so, I can get some of the HTML body text but not all of it for some reason.
I need to grab the div with ID of latestPriceSection, and filter to the USD value from https://www.monero.how/widget
My function (doesn't work)
public void getXMRRate()
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.monero.how/widget");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").Where(x => x.InnerHtml.Contains("latestPriceSection")).ToArray();
foreach (HtmlNode item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
}

Your function doesn't work because the widget is updated via script. The div contains nothing when you load the page. You can't use HAP to scrape the information of this. Find a web service that can give you the information you need.
Alternatively you can use Selenium to get the HTML after the page has loaded the script. Or you the WebBrowser class, but that requires you to have a form application where the form contains the WebBrowser.

You need to retrieve JSON-data from https://www.monero.how/widgetLive.json, because widget use this resource in Ajax request.

Download html doesnt retrieve dynamically generated elements

For documentation purposes I need the html of each "...Controller" element on the following website :
https://apipivotdev.azurewebsites.net/swagger/ui/index.
This doesn't work :
string html;
using (var wc = new WebClient())
html= wc.DownloadString(url);
because that will only render a few divs at the top of the document.
Parsing the underlying json and merging it with html would be too timeconsuming and again, I just need the html.
Question: How would I get the complete webpage in C# ?

get query results from a web site in c#

I am using c#. I have imei number of a phone. Need to get details of the phone from http://www.imei.info web site in my c# application.
When I go to the web site and search the imei number of my phone; I see the following URL http://www.imei.info/?imei=356061042215493 with my phone details.
How can I do this in my c# application?

You can concatenate the URL on the run-time and then download the HTML page, parse it and extract the information you want using HTMLAgilityPack. See code below as an example and then you can parse returned data to extract your information.
private List<HtmlNode> GetPageData(string imei)
{
HtmlDocument doc = new HtmlDocument();
WebClient webClient = new WebClient();
string strPage = webClient.DownloadString(
string.Format("{0}{1}", WebPage, imei));
doc.LoadHtml(strPage);
//Change parsing schema down here
return doc.DocumentNode.SelectNodes("//table[#class='sortable autostripe']//tbody//tr//td").ToList();
}

unless they have an API, you're going to need to read the page details using xml parser like: LINQ to XML or XmlReader

See WebClient.DownloadString and HtmlAgilityPack

Selenium C# Dynamic Meta Tags

Im using Selenium for C# in order to serve fully rendered javascript applications to google spiders and users with javascript disabled. I am using ASP.NET MVC to serve the pages from my controller. I need to be able to generate dynamic meta tags before the content is served to the caller. For example, the following pseudo code:
var pageSource = driver.PageSource; // This is where i get my page content
var meta = driver.findElement(By.tagname("meta.description")).getAttribute("content");
meta.content = "My New Meta Tag Value Here";
return driver.PageSource; // return the page source with edited meta tags to the client
I know how to get the page source to the caller, i am already doing this, but i cant seem to find the right selector to edit the meta tags before i push the content back to the requester. How would I accomplish this?

Selenium doesn't have a feature specifically for this. But technically, you can change meta tags with JavaScript, so you can use Selenium's IJavaScriptExecutor in C#.
If the page is using jQuery, here's one way to do it:
// new content to swap in
String newContent = "My New Meta Tag Value Here";
// jQuery function to do the swapping
String changeMetasScript = "$('meta[name=author]').attr('content', arguments[0]);"
// execute with JavaScript Executer
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
js.ExecuteScript(changeMetasScript, newContent);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract JSON embedded on a HTML page using C# - c#

The JSON I wish to use is embedded on a HTML page. Within a tag on the page there is a statement: <script> jsonRAW = {... heaps of JSON... } Is there a parser to extract this from HTML? I have looked at json.NET but it requires its JSON reasonably formatted.

Related

How to fix Improper Neutralization of Script-Related HTML Tags in a Web Page (Basic XSS)?

C# Html parsing

Download html doesnt retrieve dynamically generated elements

get query results from a web site in c#

Selenium C# Dynamic Meta Tags

Categories

Resources