Deserializing schema.org microdata from HTML in C#

Deserializing schema.org microdata from HTML in C# - c#

I am looking for a way to deserialize a set of offline HTML pages that have schema.org microdata embedded. How could I do this in C#? I have found Bam.Net.Schema.Org but there is almost no code that teaches me how to use it.
I have found several "parsers" for node.js but they are imperfect and not something I could use from C# - at least not in way I would prefer (semantic-schema-parser and node-microdata-scraper).
Suggestions are welcome. Should I simply create my own?

You can use HtmlAgilityPack to parse the html and than query the DOM for schema.org values.

Related

c# - HTML Parsing

I got a webpage that i need to parse through the entire HTML code to find any special tagging.
For example i want to pull out all the *.*.* element on that page. Whats the best way to do this in c#?
However these strings are dynamic because they are resulting from a search query. So i can't just pull out the source code and look for that string since they are in scripts that gets pulled in dynamically.
Is there a way to get these strings? I just need to check if they are in my already existing list. Maybe Selenium? or some other engine that i'm not aware off or a good approach to do this?
Thanks!

Locating HTML Tags

I am trying to automate the testing of web forms. To that end I need to know how to use C# to dynamically locate input tags within the HTML page then assign values to them. I don't want to use XPath, because each time I will be using a different web form. I want to pass the web form's URL to Selenium and then automatically populate the fields. I've heard of HTMLAgilityPack. Would that help me? If so, how can I use it?
I appreciate your help.

I may have missed a crucial part of your question, however, have you looked at Selenium WebDriver?
If you write a test that handles a generic web form you can back your test by data that is dynamic. Therefore you can cater for changes in the page by using Data Driven Tests. I've written tests for many pages and there are always common actions, but I cater for each page differently though as there are different things on that page!
[EDIT]
Following on from your comments, I think looking into Selenium would be a good idea. The way to handle different pages is to have these element definitions ready in a 'definitions' class for each page. That way once you know what the page is, you just use the correct class for your definitions. It is best to know what elements you are going to be interacting with in your tests before the tests run. The point of automated UI testing is for a known set of actions to be performed and a correct result achieved.
I would suggest you look up some tutorials such as this and you can see my blog
though I wrote this when I was initially learning WatiN and then replaced it with Selenium (I like it better :P).

Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
foreach (HtmlNode input in doc.DocumentNode.SelectNodes("//input"))
// Your Code...

How can I make an application in c# collect data from a website?

First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)

First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.

If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).

Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.

Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.

HTML to DOM Library

I am looking for a C# library that would translate the HTML code (and the css specified in the code) into a DOM tree for simpler parsing. I am looking for something similar to this one (which is in PHP):
http://simplehtmldom.sourceforge.net/
Of course I know I could embed a browser control, but I am looking for something more efficient.

Check out the HTML Agility Pack. It hasn't been updated in a while, but it still works very well.

I second Mr. Dorman on the HtmlAgilityPack. I did a brief blog post on web scraping some time ago; it mentions the 'pack, but mostly discusses other details. Depending on your application, it might be of some use.

We have used HTMLAgility here in our project to extract specific html tags with a given set of attributes using XPath and it has never failed us.

There is no way to get DOM with styles like that. Only option is "Selenium" framework that works with real browser.

C# XML language file

I want to have my ASP C# application to be multi-language. I was planned to do this with a XML file. The thing is, i don't have any experience with this. I mean how, do i start? Is it a good idea to store the languages in an xml file? And how in the code do i set the values for ie my menu buttons? I'd like to work with XML because i never worked before with XML, i want to learn how to deal with cases like this.

You want to look into RESX resource files. These are XML files that can contain texts (and images) and they have standardized handling of localization/translations.
Support for this is built right into ASP.NET. There is a guide for how to use it and set it up at: http://msdn.microsoft.com/en-us/library/fw69ke6f(VS.80).aspx.
The walkthough is pretty detailed and should help you to understand the concepts. My preferred is method described a bit down in the document in the section "Explicit Localization with ASP.NET". Using this you will get a set of XML files with your texts and translations in a fully standardized format.

Do you know about the .Net From automatic translatation (based on .resx) resources ?

You're in luck, this sort of stuff is built directly into .Net
The way it's done is that for every page you have a language specific resx file.
eg
Homepage.aspx
Homepage.aspx.cs
Homepage.aspx.en.resx
Homepage.aspx.fr.resx
you simply dynamically figure out what resource file to use, and all the appropriate labels come through in French for example.
Helpful Tutorials and Videos
A Simple Example
Good luck.

If internationalization in .net is something you want to get into seriously, you might want to consider this
(and no - I have no stake in it)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Deserializing schema.org microdata from HTML in C# - c#

You can use HtmlAgilityPack to parse the html and than query the DOM for schema.org values.

Related

c# - HTML Parsing

Locating HTML Tags

How can I make an application in c# collect data from a website?

HTML to DOM Library

C# XML language file

Categories

Resources