How to get only plain text from HTML using C#?

How to get only plain text from HTML using C#? - c#

Hi guys.
I'm trying to create an app that will find the most frequently used words in the string.
In my case, a string is the HTML.
I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".
var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
Html variable has the same HTML as in the Source. That's well.
But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?
I want my application not to be only for BBC html, but for every HTML which I can get in the net.
I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p>.

As per This answer, try the following:
var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = #"<(.|\n)*?>";
//Replace all tag elements found using that regex with nothing
return Regex.Replace(htmlString, pattern, string.Empty);

Related

Search for specific text in html response (ASP.NET Core)

I need to search for specific word in html we page.
I try to do this using c# (asp.net core)
My point is to get url and word for search from View via js
and than in response if word is exist show it , if not, show smth
I make method for getting html code of page. Here is code
[HttpPost]
public JsonResult SearchWord([FromBody] RequestModel model){
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(model.adress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
{
readStream = new StreamReader(receiveStream);
}
else
{
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
}
string data = readStream.ReadToEnd();
string strRegex = model.word;
response.Close();
readStream.Close();
return Json(data);
}
But, how I need to search for word correctly?

To add to the previous answer you can use Regex.Match. Something like:
string pattern = #"(\w+)\s+(strRegex)";
// Instantiate the regular expression object.
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
// Match the regular expression pattern against your html data.
Match m = r.Match(data);
if (m.Success) {
//Add your logic here
}
NOTE: There are quite a few things you can do to optimize your code, specifically looking at how you are handling stream reader. I would read in chunks and try and match the chunk.

You will not be able to do much with simple pattern matching, check out this SO classic - RegEx match open tags except XHTML self-contained tags. Consider using some web scraping library like html-agility-pack if you want to do some serious scraping. If you want to only search for the single word in a web-page, no matter whether it's a markup or CDATA etc., just join all the chars in an array and use string.Contains, or Regex.

C# load html source as string

I'm trying to make an application where it will read certain texts from a website.
using AngleSharp.Parser.Html;
...
var source = #"
<html>
<head>
</head>
<body>
<td class=""period_slot_1"">
<strong>TG</strong>
</body>
</html>";
var parser = new HtmlParser();
var document = parser.Parse(source);
var strong = document.QuerySelector("strong");
MessageBox.Show(strong.TextContent); // Display text
From googling, I've successfully done above. I have copy&pasted a part of html in a variable to see if I can get the value I'm looking for.
So it gets the value I want, which is string "TG".
However, the website will have different value to "TG" every time, so I need my program to refer straight to the html of the website at the time.
Is is possible for me to load the whole html source in the source variable and make it work, if can how can I do it and what would be best for me to get what I want?
Thank you so much for reading the question.

I assume you're saying you want to read directly from a page on the internet from a url. In which case you should do:
WebClient myClient = new WebClient();
Stream response = myClient.OpenRead("http://yahoo.com");
StreamReader reader = new StreamReader(response);
string source = reader.ReadToEnd();
var parser = new HtmlParser();
var document = parser.Parse(source);
var p = document.QuerySelector("p");
// I used 'p' instead of 'strong' because there's no
//strong on that page
MessageBox.Show(p.TextContent); // Display text
response.Close();

How to properly get the content of a website?

I'm trying to read the content of the page and extract some information. But sometimes I got stuff like : nbsp;Aur& eacute;lie (Verschuere)
I already do this:
string siteContent = "";
using (System.Net.WebClient client = new System.Net.WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
siteContent = client.DownloadString(edtReadFromUrl.Text);
}
It works when there are UTF-8 characters. Can't I get a readable text? with no HTML in it? It would be even easier.
Edit: It's not the same as someone marked it. It does return strange characters with the other solution too.

You could use an html parser to extract meaning. For instance, with HtmlAgilityPack, you could:
HtmlDocument doc=new HtmlDocument();
string html;
using(var wc=new WebClient())
{
html=wc.DownloadString("http://www.bbc.co.uk/news");
}
doc.LoadHtml(html);
doc.DocumentNode.Element("html").Element("body").InnerText

jquery using c# webclient .net namespace?

I have the following JQUERY code that relates to a html document from a website.
$
Anything is appreciated,
Salute.

From what I can remember using the HtmlAgilityPack
var rawText = "<html><head><head><body><div id='container'><article><p>stuff<p></article><article><p>stuff2</p></article></div></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(rawText);
var stuff = doc.DocumentNode.Descendants("div")
.SelectMany(div => div.Descendants("article"));
var length = stuff.Count();
var textValues = stuff.Select(a => a.InnerHtml).ToList();
Output:
length: 2
textValues: List<String> (2 items)
<p>stuff<p>
<p>stuff2</p>
To get the HTML, instead of hardcoding it as above, use the WebClient class since it has a simplier API than WebRequest.
var client = new WebClient();
var html = client.DownloadString("http://yoursite.com/file.html");

To answer your question specifically related to using the System.Net namespace you would do this:
go here to see the way to use the WebRequest class itself to get the
content.
http://msdn.microsoft.com/en-us/library/456dfw4f%28v=vs.110%29.aspx
Next after you get the content back you need to parse it using HTMLAgility pack found here: http://htmlagilitypack.codeplex.com/
How would one code the JQUERY into C#, this is an untested example:
var doc = new HtmlDocument();
doc.Load(#"D:\test.html"); //you can also use a memory stream instead.
var container = doc.GetElementbyId("continer");
foreach (HtmlNode node in container.Elements("img"))
{
HtmlAttribute valueAttribute = node.Attributes["value"];
if (valueAttribute != null) Console.WriteLine(valueAttribute.Value);
}
In your case the attributes you want after you find the element are alt, src, and href
It will take you about 1 day to learn agilitypack but it's mature fast and well liked by the community.

Read specific div from HttpResponse

I am sending 1 httpWebRequest and reading the response.
I am getting full page in the response.
I want to get 1 div which is names ad Rate from the response.
So how can I match that pattern?
My code is like:
HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create("http://www.domain.com/");
HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
Stream response = WebResp.GetResponseStream();
StreamReader data = new StreamReader(response);
string result = data.ReadToEnd();
I am getting response like:
<HTML><BODY><div id="rate">Todays rate 55 Rs.</div></BODY></HTML>
I want to read data of div rate. i.e. I should get Content "Todays rate 55 Rs."
So how can I make regex for this???

The HTML Agility Pack can load and parse the file for you, no need for messy streams and responses:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/owobe3");
HtmlNode rateNode = doc.DocumentNode.SelectSingleNode("//div[#id='rate']");
string rate = rateNode.InnerText;

You should read the entire response and then use something like the Html Agility Pack to parse the response and extract the bits you want in an xpath-like syntax:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
var output = doc.DocumentNode.SelectSingleNode("//div[#id='rate']").InnerHtml;
Dont use regular expressions!

If you have only one Todays rate text then you can do it like this:
Todays rate \d+ Rs.
In other case you can add div tag in your regex.
Edit: Sorry, haven't installed regex locally
You need to use grouping and get value from the group. It will look like this
<div id="rate">(?<group>[^<])</div>
Don't know if it works, however use this idea.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get only plain text from HTML using C#? - c#

Related

Search for specific text in html response (ASP.NET Core)

C# load html source as string

How to properly get the content of a website?

jquery using c# webclient .net namespace?

Read specific div from HttpResponse

Categories

Resources