Parse HTML With C# [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'd like to parse html page using C#. There are html pages which contain a lot of html tags, here's a sample of one of them :
<span class=text14 id="article_content"><!-- RELEVANTI_ARTICLE_START --><span ></b>The
most important component for <a
class=bluelink href="http://www.ynetnews.com/articles/0,7340,L-
3284752,00.html%20"' onmouseover='this.href=unescape(this.href)'
target=_blank>Israel</a>'s
security is its special relations with the American administration, and especially with its generous purse. When the Netanyahu government launches a great outcry against the <a ...
but I'd only like to get the content wrapped by the <span class=text14 id="article_content"> tag.
At first I've thought about using preg match, but then realized it's not efficient at all.
I've later read about Html Agility Pack and FizzlerEx -
i'd like to know whether it's possible to get the text wrapped by the specific tag i've mentioned using these tools, and i'd be grateful if someone could tell me how fast this task could be performed.

It's pretty straight forward using Html Agility Pack:
var markup = #"<span class=text14 id=""article_content""><!-- RELEVANTI_ARTICLE_START --><span ></b>The most important component for <a class=bluelink href=""http://www.ynetnews.com/articles/0,7340,L-3284752,00.html%20""' onmouseover='this.href=unescape(this.href)' target=_blank>Israel</a>'s security is its special relations with the American administration, and especially with its generous purse. When the Netanyahu government launches a great outcry against the</span>";
var doc = new HtmlDocument();
doc.LoadHtml(markup);
var content = doc.GetElementbyId("article_content").InnerText;
Console.WriteLine(content);

Related

Getting data of HTML DIV tag without using Regular Expressions [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Hello to all respected Experts,
I've one question regarding to C#.net. What i wanna do is that basically i have one HTML page
and i wanna extract data from it's DIV tag this is the sample of HTML :
<div class="clr fleft">
<strong class="xx-large">033 111 22222</strong>
</div>
Now I wanna Get those numbers which are inside of "xx-large" Tag.
I want some help in doing it.
You can use HtmlAgilityPack
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlstring);
using XPATH,
var data = doc.DocumentNode.SelectSingleNode("//*[#class='xx-large']").InnerText;
using Linq,
var data = doc.DocumentNode.Descendants()
.Where(x => x.Attributes["class"] != null && x.Attributes["class"].Value == "xx-large")
.First()
.InnerText;
As i know, you cant access them just by c# (your server-side codes). You must write some javascript codes to do this. (your javascript code can have no regex)
All you need is a library with predefined parsers. You can use Beautiful Soup parser (originally written in python, can be interfaced with C#) see how it's done http://ashomtwit.espace-technologies.com/4499480-BeautifulSoup_and_ASP_NET_C_.html or you can choose an alternative package. These library has the predefined regular expression and has methods to open web pages to collect the information. It is so simple to use this.

reading div after javascript load [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having problems parsing information from a forum.
Heres some examples:
Easy
Hard
It would be really easy to get the information as they are displayed in the div where id = "poe-popup-container".
The problem is that that div is only populated when the browser allows you to see the information. That can be easily reproduced by making your browser height really small and looking in the HTML code for the . However, the div will be empty, but as soon you scroll down to see the item it will change.
I'm trying to read the nodes inside the with htmlagillitypack. The problem is that, as i explained, it only has information when the browser says that you need that information.
So, when you try to download the html, the div is empty.
I've tried to download the page with the web browser too, but the same thing happens.
I'm trying to use the following code:
string page = System.Text.Encoding.UTF8.GetString(Webclient.DownloadData("http://www.pathofexile.com/forum/view-thread/966384"));
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#id='poe-popup-container']");
MessageBox.Show(node.InnerHtml);
You're trying to do impossible. Javascript is executed in browser. HtmlAgilityPack is library just for parsing static html - it can't execute javascript.
So why don't you look into browser automation instead ? Try for example http://watin.org/

I need to extract the url inside the string [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need to extract the url inside the string.
In my case html text is in the db and when i get that text and need to find all url in the text and insert in to another table, can u give me a way to find the url's in SQL or C#.
This is reqular expression to find urls in text
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(txt);
One of the possible ways to do it is by using Regular expressions. First option is to extract HTML from the DB, then use Regular Expression to find the links directly. The second option is to locate link tags first, then extract url from them (again by using Regular expressions).
Here you can find information about how to use Regular Expressions in C#:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
On the other hand, writing the correct Regular Expression may not be so easy (it depends on how complex the URL is), but you should take a look at this question: regular expression for url
Also, here you can find a lot of information about regular expressions in general (keep in mind that there are some applications like RegexBuddy, that can help you a lot when it comes to testing your regular expressions): http://www.regular-expressions.info/

.Net how to play audio in the root of the website with System.Windows.Media.MediaPlayer() [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I use this code blow in .NET. It works fine. The problem is that I want this audio to play in the root of the website. What changes should I make for this? Thanks
var sample= new System.Windows.Media.MediaPlayer();
sample.Open(new System.Uri( #"D:\voices\1.wav");
sample.Play();
In a web application, this might look something like this:
sample.Open(new System.Uri(Server.MapPath("~/") + #"\voices\1.wav");
I say might because that all depends on whether or not the voices folder exists in the root of the website. Additionally, you should probably leverage Path.Combine instead:
var path = Path.Combine(Server.MapPath("~/"), "voices", "1.wav");
sample.Open(path);
Finally, I don't know what sample is, but the Open method may not work in a website. I'm making the assumption you know what Open does and whether or not it can work in a website.
Use server.mappath("~/") <- that's the filesystem root for your website.
dim path as string = server.mappath("~/") & "/voices/1.wav"
Note that backslashes, for filesystem path, not URI.
Hope it helps.

have to extract data from a word file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a peculiar problem in that I have to extract information from a word file. Say for example I have a resume and need to extract name, email address, phone no., address, university,Experience etc.
Every other person may be having their resume in a different format.So is there any way by which I can programmatically extract the information I need?
I need this information to fill-up a form for registration.
Even if at first you might be attracted by the idea of using Com Interop and Asp.net, don't do it.
http://support.microsoft.com/kb/257757
That said, it's important to know which version of word are we talking about. Newer formats allow treat them as a zip containing xml files and there are good&free libraries.
http://docx.codeplex.com/
Convert the word document to html, with aspose .net.
Then you can use regular expressions to search the word and/or pdf documents.
Or you can use HTMLAgilityPack to parse the created HTML documents, and search for specific sections/paths.
PS:
If you have a regex for email that's shorter than one page, then the regex is incorrect.
Phone should be manageable, as long as you have to support only one country.
As for name and address, good luck with that.
Edit:
Like this
VB.NET:
Dim doc As New Aspose.Words.Document("filename.docORdocx")
doc.Save("filename.html", Aspose.Words.SaveFormat.Html)
C#:
Aspose.Words.Document doc = new Aspose.Words.Document("filename.docORdocx");
doc.Save("filename.html", Aspose.Words.SaveFormat.Html);
The component is here:
http://www.aspose.com/.net/word-component.aspx
To find out what a valid email address is, read RFC 822:
http://www.faqs.org/rfcs/rfc822.html

Categories