Parse and extract required text from text files using C# - c#

I have some text files with some useful data wrapped in between HTML tags like <td>, <span>, etc. I want to write a program which extracts the data in between the tags.
The text file contains other junk data too. I would also like to store these extracted data into SQL Table. Anyone who can guide me in right direction?

Don't mention Regex and HTML in the same question on this site -- it's a sin!! ;-)
You likely want the HTML Agility Pack.

Related

How to include all resources into one html file?

is there any c# library or any free tool which can convert a html file with many referenced resources into a one "all-in-one" html file?
The main task is to have only one file, it means I need to include
Javascript external files - this will probably mean replace all 'script' tags
with 'src' attribute by 'script' tags with content read from referenced file.
Images - replace src="picture.png" with data uri - something like src="data:image/png;base64,encodedContent..."
CSS files
may be i forgot something :)
This HTML file must be readable in all browsers, that's why I cannot use MHT file format (unreadable on Safari, iPad...)
You can use HTML Agility Pack to go read/write the html document. HTML Agility supports XPath so you can get a list of nodes you want to modify.
Using this, changing the attribute value of image tags should be easy. You can also get a list of external js references, read them and then update the script tag accordingly.

Parse HTML Page, include all css styles

I want to send a complete html page as an email and want to include all the css styles into the email. Is there any library that creates me one html page with all the css styles correctly included. (Conside you can import css files which also have to be opened and included.)
Any help is appreciated.
The closest I came to this was having to send a control without includes, and I built this as a server control, read the css files ( and js files ) and write them out. However for an entire page, you might have more difficulty.
I do not believe that there is anything to do this. If you can read the entire page code, doing find and replaces may be your easiest answer. Find the csss tag, replace the inner contents with the values from the file in the tag.
You could try using this: http://martinnormark.com/move-css-inline-premailer-net
Im in the process of testing it myself in order to generate some Word documents with inline styles...i will add the results later...
Update:
It works althought Word 2010 applies some of the inline styles as it likes...didnt try with previous versions of word

Convert a website to Html to Excel

I have the below as part of a web application(asp.net) is there any way to convert it to excel? The problem is printing it. Landscape is the desired format and persons in the organization are very novice so to improve usability i want to allow it to be in landscape. I tried activex and sendkeys commands. This works but not what i desire..
Please help....
Two ideas:
(1) Copy + paste. If there is too much noise then try (2).
(2) Copy the HTML into a text file, read the text file into a program, have the program parse the HTML in the way that you need it. From there you should be able to format it as needed or just flat out copy/paste it into Excel.
Are your pages table-driven (<table><tr>...) and not laid out with CSS? If so, this has been asked and answered before. Excel can consume tabular HTML like a champ.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.
If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.
As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>
Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx
If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.
Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.
If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

How to extract keywords from HTML page in C#?

Basically I want to extract keywords or words or tokens that are present in the webpage after removing the stopwords. Does anybody know how to do this? Code in C# would be appreciated.
Use an HTML parsing library like the HTML Agility Pack.
Once you load an HTML document with it, you can query it with Xpath syntax - it exposes the HTML in a similar way to an XmlDocument.
The HTML Agility Pack that Oded mentions will help you get at the plain text inside the HTML, but to extract keywords from the webpage after removing the stopwords you'll need to do more work. There's a good informative answer from Joseph Turian to this question: How do I extract keywords used in text?

Categories