Storing a document with html to Elastic Search - c#

I am trying to index elastic search with some content after stripping out the html. I fail to find proper examples after searching.
I have seen this:
http://elasticsearch-users.115913.n3.nabble.com/Strip-HTML-on-indexing-does-not-store-results-td3039614.html
and this:
https://github.com/elastic/elasticsearch/issues/1026
No follow up. My question, should i strip of html before indexing or is there something custom in Elastic search to get this done?

You can use html strip filter. It makes sure what you search is only on the text of the HTML ( and not the div or body tag texts ) and gives you back the HTML as result

Related

Allowing and finding links while removing HTML

I recently asked a couple of questions on here related to two subjects
1) Stopping HTML that may be posted by a user in a text field to then render as HTMl on a web page
2) Detect links in a string and where they start and end
I am having problems trying to put the two together.
Over all, I have a text box that a user can type into. They are allowed to type in anything they want.
When posted to the server, I want to seek out all links that are in that text and save them to a database table. Then show on the webpage the text they have typed without any HTML except that I put in myself
So if they type www.google.com, i will turn it to http://www.google.com
I can do that no problem. However if they type something like <p style="margin-left:50px">www.google.com</p> it will find the link, change the link, but the web page will turn the margin bit into actual HTML.
I was recommended to use HTML encoding, however if I do it AFTER I have saved the links into the database, the indices are off (start and length of where the links are in the text).
If I do the HTML encoding BEFORE I save the links, the links may get messed up. If they type in
www.google.com
It will encode the text and the link my regex expression will find is
www.google.com">www.google.com</a&gt
I either need to improve my regex, or find another way
For reference my regex is
#"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])"
If I understood this correctly, you need to display any other html tag the user may type in as-is. Try replacing the < and > characters with < and > respectively.
If you do this before you run the regex replace, it should sort out your issue.

Parsing HTML - Getting the paragraph with the most text

I am trying to parse a HTML page (The page isn't known and changes often, however they are always news sites). Basically, I need to pull the news out of a bunch of code downloaded from the site, which i'm trying to do with a regex like this:
Match m = Regex.Match(x.Result, #"<p>(.+?)</p>");
Obvious bad idea - it pulls down anything tagged as a paragraph.
Any better ways to pull a news article or large body of text, separated from the code, from a website?
Well, this may not be exactly what you want (you haven't provided a lot of detail), but you can strip all tags from a page with a pair of simple regex's.
Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>
Credit goes to this existing answer. What you will be left with is the "plain text" from the page.

Decoding Anchor Tag in String

I have a string returned from a 3rd party API, that contains fully formed anchor tags (on most occasions). The html appears to be fully formed and correct.
I want to decode this and output into an MVC view as a valid anchor tag, however HTMLDecode does not seem to convert the anchor tag into a link.
I am simply outputting the text as such;
<p>#HttpUtility.HtmlDecode(Model.Description)</p>
but the text comes out with anchor tag included, like this;
This is a test description. Check here - New York Times for more information
Am I expecting to much of HtmlDecode?
Use #Html.Raw()
<p>#Html.Raw(Model.Description)</p>
http://msdn.microsoft.com/en-us/library/gg480740%28v=vs.118%29.aspx

unclosed tags in html ASP.Net MVC

i have a description of a product who fill by ckeditor. i show a part of the description on the page. the problem is that they create a problem.
suppose ckeditor created <p>blahblah</p> and i cut the text to the limit code have then logically p tag is not closed. so here is something i can do.
close the tag. are i can get the text from them and append inside the div i create. well how i can do that.
So the issues is that you need to display a excerpt of the full description? Is it feasible, in that case, to just strip HTML and just display a certain amount of characters?
If you need extract text from html or fix the html you can try using Html Agility Pack

Get a subsection of HTML document

I am trying to get a subsection of an HTML page. The functionality I am looking for is similar to the one implemented on most blogs. Usually, on the main page of the blog, you only see a section of the post, and when you click on the title you get the full content of that blog post.
There must be code that exists to get that subsection without breaking the HTML.
Does anyone know of good .NET code that does that?
EDIT: I need to keep the HTML formatting of the content, so stripping all the HTML isn't really an option. I wouldn't mind taking a fixed-length substring of the content (i.e. the first 800 characters or so) but then not breaking the HTML would be a nightmare.
Thanks!
I would strip the html first from the content string (How can I strip HTML tags from a string in ASP.NET?) then do a left on the resulting string.
Usually this works by taking a substring of the contents of that blog post before the blog post is rendered into html.
That wouldn't be done by cutting the page output directly (messing with the HTML).
Handle that with server-side code displaying a trim of the blog content.
Usually the way that's done isn't by chunking off a piece of the HTML. Rather, There's a database that contains the blog posts, and the Main page has it's own HTML/CSS which dynamically loads only the first X paragraphs of each blog post.
To my mind the "simplest thing that could possibly work" would be to scan the blog post that you want to summarize until you get to the first close-paragraph </p> tag.
Don't be tempted to scan the HTML with a regex.

Categories