Parsing HTML - Getting the paragraph with the most text - c#

I am trying to parse a HTML page (The page isn't known and changes often, however they are always news sites). Basically, I need to pull the news out of a bunch of code downloaded from the site, which i'm trying to do with a regex like this:
Match m = Regex.Match(x.Result, #"<p>(.+?)</p>");
Obvious bad idea - it pulls down anything tagged as a paragraph.
Any better ways to pull a news article or large body of text, separated from the code, from a website?

Well, this may not be exactly what you want (you haven't provided a lot of detail), but you can strip all tags from a page with a pair of simple regex's.
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
Credit goes to this existing answer. What you will be left with is the "plain text" from the page.

Related

Storing a document with html to Elastic Search

I am trying to index elastic search with some content after stripping out the html. I fail to find proper examples after searching.
I have seen this:
http://elasticsearch-users.115913.n3.nabble.com/Strip-HTML-on-indexing-does-not-store-results-td3039614.html
and this:
https://github.com/elastic/elasticsearch/issues/1026
No follow up. My question, should i strip of html before indexing or is there something custom in Elastic search to get this done?
You can use html strip filter. It makes sure what you search is only on the text of the HTML ( and not the div or body tag texts ) and gives you back the HTML as result

Allowing and finding links while removing HTML

I recently asked a couple of questions on here related to two subjects
1) Stopping HTML that may be posted by a user in a text field to then render as HTMl on a web page
2) Detect links in a string and where they start and end
I am having problems trying to put the two together.
Over all, I have a text box that a user can type into. They are allowed to type in anything they want.
When posted to the server, I want to seek out all links that are in that text and save them to a database table. Then show on the webpage the text they have typed without any HTML except that I put in myself
So if they type www.google.com, i will turn it to http://www.google.com
I can do that no problem. However if they type something like <p style="margin-left:50px">www.google.com</p> it will find the link, change the link, but the web page will turn the margin bit into actual HTML.
I was recommended to use HTML encoding, however if I do it AFTER I have saved the links into the database, the indices are off (start and length of where the links are in the text).
If I do the HTML encoding BEFORE I save the links, the links may get messed up. If they type in
www.google.com
It will encode the text and the link my regex expression will find is
www.google.com">www.google.com</a&gt
I either need to improve my regex, or find another way
For reference my regex is
#"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])"
If I understood this correctly, you need to display any other html tag the user may type in as-is. Try replacing the < and > characters with < and > respectively.
If you do this before you run the regex replace, it should sort out your issue.

Implementing paging in Sitecore content pages

I have a section on my website where I plan to add a lot of text-based content, and rather than display this all at once it would be nice if I could add paging on just these pages. If possible, I would like to put all of my content within one content item and have the paging work automatically, building a URL along the lines of http://example.org/articles/title?page=2 or similar.
I've stumbled across an article that mentions paging with Sitecore items and this seems rather close to what I require, although mine requires pagination on a single content item, rather than multiple items. Can someone help me adapt this article towards my needs (if it's on the right track of where I should be looking)?
Is it possible to do this with a Sitecore content item?
http://briancaos.wordpress.com/2010/09/10/create-a-google-style-paging-component-in-c/
I think you'd either want to create your own WebControl and define a custom Render() method that reads the query string to write out the correct information, or you could even do it all in a Sublayout (a user control ASCX file). I've done this before by adding in a custom tag in the Rich text editor via Sitecore (I think I used <hr class="page-break" />) then in my ASCX I'd look for that HTML tag and split the content into chunks from that. I think my solution also used jQuery for some of it but you could probably do it with C# too.
Edit:
You'd want to split the tasks up and have the "paged" content as well as a list of pages (like the article you referenced) so you can easily generate the page buttons. Both of these could be done in two separate repeaters.
You can split the text from a single field into the different pages using approach described here: Split html string to page. All you need to do after that - read the query string and display appropriate block.
If I understand you correctly you have an Item in Sitecore that has x number of text fields and you only want a subset of those displayed depending on input in the querystring ?
In it's simplest form you want a sublayout that handles that.
Basically I'd imagine you having fields called Text1, Text2, text3 etc.
This .ascx could then retrieve the data for fields the fields you'd want using the control and adding them.
Then you could use the code from the article to generate the paging links.
This should be simple enough, but I'd say it would be a better idea to have an item in sitecore and use it's children as the data you want viewed and paged.
It's nicer because if you start out with 5 "page" fields and suddenly want 10, your item will keep on growing, where children can be added without bloating the parent page. Plus the user could then order the children how he sees fit.
I hope this helps a bit.

Safe HTML in ASP.NET Controls

Im sure this is a common question...
I want the user to be able to enter and format a description.
Right now I have a multiline textbox that they can enter plain text into. It would be nice if they could do a little html formatting. Is this something I am going to have to handle? Parse out the input and only validate if there are "safe" tags like <ul><li><b> etc?
I am saving this description in an SQL db. In order to display this HTML properly do I need to use a literal on the page and just dump it in the proper area or is there a better control for what I am doing?
Also, is there a free control like the one on SO for user input/minor editing?
Have a look at the AntiXSS library. The current release (3.1) has a method called GetSafeHtmlFragment, which can be used to do the kind of parsing you're talking about.
A Literal is probably the correct control for outputting this HTML, as the Literal just outputs what's put into it and lets the browser render any HTML. Labels will output all the markup including tags.
The AJax Control Toolkit has a text editor.
Also, is there a free control like the
one on SO for user input/minor
editing?
Stackoverflow uses the WMD control and markdown as explained here:
https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
You will need to check what tags are entered to avoid Cross side scripting attacks etc. You could use a regex to check that any tags are on a 'whitelist' you have and strip out any others.
You can check out this link for a list of rich text editors.
In addition to the other answers, you will need to set ValidateRequest="false" in the #Page directive of the page that contains the textbox. This turns off the standard ASP.NET validation that prevents HTML from being posted from a textbox. You should then use your own validation routine, such as the one #PhilPursglove mentions.

Get a subsection of HTML document

I am trying to get a subsection of an HTML page. The functionality I am looking for is similar to the one implemented on most blogs. Usually, on the main page of the blog, you only see a section of the post, and when you click on the title you get the full content of that blog post.
There must be code that exists to get that subsection without breaking the HTML.
Does anyone know of good .NET code that does that?
EDIT: I need to keep the HTML formatting of the content, so stripping all the HTML isn't really an option. I wouldn't mind taking a fixed-length substring of the content (i.e. the first 800 characters or so) but then not breaking the HTML would be a nightmare.
Thanks!
I would strip the html first from the content string (How can I strip HTML tags from a string in ASP.NET?) then do a left on the resulting string.
Usually this works by taking a substring of the contents of that blog post before the blog post is rendered into html.
That wouldn't be done by cutting the page output directly (messing with the HTML).
Handle that with server-side code displaying a trim of the blog content.
Usually the way that's done isn't by chunking off a piece of the HTML. Rather, There's a database that contains the blog posts, and the Main page has it's own HTML/CSS which dynamically loads only the first X paragraphs of each blog post.
To my mind the "simplest thing that could possibly work" would be to scan the blog post that you want to summarize until you get to the first close-paragraph </p> tag.
Don't be tempted to scan the HTML with a regex.

Categories