Get a subsection of HTML document

Get a subsection of HTML document - c#

I am trying to get a subsection of an HTML page. The functionality I am looking for is similar to the one implemented on most blogs. Usually, on the main page of the blog, you only see a section of the post, and when you click on the title you get the full content of that blog post.
There must be code that exists to get that subsection without breaking the HTML.
Does anyone know of good .NET code that does that?
EDIT: I need to keep the HTML formatting of the content, so stripping all the HTML isn't really an option. I wouldn't mind taking a fixed-length substring of the content (i.e. the first 800 characters or so) but then not breaking the HTML would be a nightmare.
Thanks!

I would strip the html first from the content string (How can I strip HTML tags from a string in ASP.NET?) then do a left on the resulting string.

Usually this works by taking a substring of the contents of that blog post before the blog post is rendered into html.

That wouldn't be done by cutting the page output directly (messing with the HTML).
Handle that with server-side code displaying a trim of the blog content.

Usually the way that's done isn't by chunking off a piece of the HTML. Rather, There's a database that contains the blog posts, and the Main page has it's own HTML/CSS which dynamically loads only the first X paragraphs of each blog post.

To my mind the "simplest thing that could possibly work" would be to scan the blog post that you want to summarize until you get to the first close-paragraph </p> tag.
Don't be tempted to scan the HTML with a regex.

Related

Allowing and finding links while removing HTML

I recently asked a couple of questions on here related to two subjects
1) Stopping HTML that may be posted by a user in a text field to then render as HTMl on a web page
2) Detect links in a string and where they start and end
I am having problems trying to put the two together.
Over all, I have a text box that a user can type into. They are allowed to type in anything they want.
When posted to the server, I want to seek out all links that are in that text and save them to a database table. Then show on the webpage the text they have typed without any HTML except that I put in myself
So if they type www.google.com, i will turn it to http://www.google.com
I can do that no problem. However if they type something like <p style="margin-left:50px">www.google.com</p> it will find the link, change the link, but the web page will turn the margin bit into actual HTML.
I was recommended to use HTML encoding, however if I do it AFTER I have saved the links into the database, the indices are off (start and length of where the links are in the text).
If I do the HTML encoding BEFORE I save the links, the links may get messed up. If they type in
www.google.com
It will encode the text and the link my regex expression will find is
www.google.com">www.google.com</a&gt
I either need to improve my regex, or find another way
For reference my regex is
#"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])"

If I understood this correctly, you need to display any other html tag the user may type in as-is. Try replacing the < and > characters with < and > respectively.
If you do this before you run the regex replace, it should sort out your issue.

Parsing HTML - Getting the paragraph with the most text

I am trying to parse a HTML page (The page isn't known and changes often, however they are always news sites). Basically, I need to pull the news out of a bunch of code downloaded from the site, which i'm trying to do with a regex like this:
Match m = Regex.Match(x.Result, #"<p>(.+?)</p>");
Obvious bad idea - it pulls down anything tagged as a paragraph.
Any better ways to pull a news article or large body of text, separated from the code, from a website?

Well, this may not be exactly what you want (you haven't provided a lot of detail), but you can strip all tags from a page with a pair of simple regex's.
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
Credit goes to this existing answer. What you will be left with is the "plain text" from the page.

Best method or control to display text from a file in an asp.net webpage

This may be a totally newbie question, but here it goes. I have a asp.net web page that I need to display text from a .txt file. I am trying to figure what would be the best control to do this with or the best method. I looked at using an iframe, but this does a very poor job of displaying the text from the file (for instance no word wrap for an iframe). I don't really expect anyone to solve this for me completely, but if you have any suggestions or know of any links to tutorials or explanations where someone has done this, I would be very greatful.
Thanks

You can for example add a Literal control, assign File.ReadAllLines("yourfile.txt") to the Text property and replace \r\n with <br />.

You should just read the text-file in code (using a streamreader for example). Once you have that text, just output it to your web page.
If you're using web forms you could place a label and then set the text of that label.
If you're using MVC you could put it in the ViewBag and then in your view output the value from the ViewBag (or use a custom viewmodel)

You could use a Literal or Label control. Make sure that the control that you use encodes the text in order to avoid XSS vulnerabilities (or encode the text manually if necessary).
It might as well be necessary to substitute line endings with <br/> tags.

unclosed tags in html ASP.Net MVC

i have a description of a product who fill by ckeditor. i show a part of the description on the page. the problem is that they create a problem.
suppose ckeditor created <p>blahblah</p> and i cut the text to the limit code have then logically p tag is not closed. so here is something i can do.
close the tag. are i can get the text from them and append inside the div i create. well how i can do that.

So the issues is that you need to display a excerpt of the full description? Is it feasible, in that case, to just strip HTML and just display a certain amount of characters?

If you need extract text from html or fix the html you can try using Html Agility Pack

Safe HTML in ASP.NET Controls

Im sure this is a common question...
I want the user to be able to enter and format a description.
Right now I have a multiline textbox that they can enter plain text into. It would be nice if they could do a little html formatting. Is this something I am going to have to handle? Parse out the input and only validate if there are "safe" tags like <ul><li><b> etc?
I am saving this description in an SQL db. In order to display this HTML properly do I need to use a literal on the page and just dump it in the proper area or is there a better control for what I am doing?
Also, is there a free control like the one on SO for user input/minor editing?

Have a look at the AntiXSS library. The current release (3.1) has a method called GetSafeHtmlFragment, which can be used to do the kind of parsing you're talking about.
A Literal is probably the correct control for outputting this HTML, as the Literal just outputs what's put into it and lets the browser render any HTML. Labels will output all the markup including tags.
The AJax Control Toolkit has a text editor.

Also, is there a free control like the
one on SO for user input/minor
editing?
Stackoverflow uses the WMD control and markdown as explained here:
https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/

You will need to check what tags are entered to avoid Cross side scripting attacks etc. You could use a regex to check that any tags are on a 'whitelist' you have and strip out any others.
You can check out this link for a list of rich text editors.

In addition to the other answers, you will need to set ValidateRequest="false" in the #Page directive of the page that contains the textbox. This turns off the standard ASP.NET validation that prevents HTML from being posted from a textbox. You should then use your own validation routine, such as the one #PhilPursglove mentions.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get a subsection of HTML document - c#

I would strip the html first from the content string (How can I strip HTML tags from a string in ASP.NET?) then do a left on the resulting string.

Usually this works by taking a substring of the contents of that blog post before the blog post is rendered into html.

That wouldn't be done by cutting the page output directly (messing with the HTML). Handle that with server-side code displaying a trim of the blog content.

Usually the way that's done isn't by chunking off a piece of the HTML. Rather, There's a database that contains the blog posts, and the Main page has it's own HTML/CSS which dynamically loads only the first X paragraphs of each blog post.

To my mind the "simplest thing that could possibly work" would be to scan the blog post that you want to summarize until you get to the first close-paragraph </p> tag. Don't be tempted to scan the HTML with a regex.

Related

Allowing and finding links while removing HTML

Parsing HTML - Getting the paragraph with the most text

Best method or control to display text from a file in an asp.net webpage

unclosed tags in html ASP.Net MVC

Safe HTML in ASP.NET Controls

Categories

Resources