Remove HTML from string - c#

I am trying to clear the HTML coding from my RSS feed. I can not work out how to set the below to take out the HTML encoding.
var rssFeed = XElement.Parse(e.Result);
var currentFeed = this.DataContext as app.ViewModels.FeedViewModel;
var items = from item in rssFeed.Descendants("item")
select new ATP_Tennis_App.ViewModels.FeedItemViewModel()
{
Title = item.Element("title").Value,
DatePublished = DateTime.Parse(item.Element("pubDate").Value),
Url = item.Element("link").Value,
Description = item.Element("description").Value
};
foreach (var item in items)
currentFeed.Items.Add(item);

Just use the following code:
var withHtml = "<p>hello <b>there</b></p>";
var withoutHtml = Regex.Replace(withHtml, "<.+?>", string.Empty);
This will clean the html leaving only the text, so "hello there"
So, you can just copy and use this function:
string RemoveHtmlTags(string html) {
return Regex.Replace(html, "<.+?>", string.Empty);
}
Your code will look something like this:
var rssFeed = XElement.Parse(e.Result);
var currentFeed = this.DataContext as app.ViewModels.FeedViewModel;
var items = from item in rssFeed.Descendants("item")
select new ATP_Tennis_App.ViewModels.FeedItemViewModel()
{
Title = RemoveHtmlTags(item.Element("title").Value),
DatePublished = DateTime.Parse(item.Element("pubDate").Value),
Url = item.Element("link").Value,
Description = RemoveHtml(item.Element("description").Value)
};

You can use this code sample, it works fine on my side
public static string RemoveHTMLTags(string value)
{
string step1 = Regex.Replace(value, "<[^>]*>", " ");
string step2 = HttpUtility.HtmlDecode(step1);
return step2;
}
I hope, this code helps you.

Use the following class utility:
HttpUtility.HtmlDecode(string);
Please don't refer this answer no more.

Related

How can I get all HTML attributes with GeckoFX/C#

In C# viaGeckoFx, I have not found a method to find all attributes of an element.
To do this, I made ​​a JavaScript function. Here is my code
GeckoWebBrowser GeckoBrowser = ....;
GeckoNode NodeElement = ....; // HTML element where to find all HTML attributes
string JSresult = "";
string JStext = #"
function getElementAttributes(element)
{
var AttributesAssocArray = {};
for (var index = 0; index < element.attributes.length; ++index) { AttributesAssocArray[element.attributes[index].name] = element.attributes[index].value; };
return JSON.stringify(AttributesAssocArray);
}
getElementAttributes(this);
";
using (AutoJSContext JScontext = new AutoJSContext(GeckoBrowser.Window.JSContext)) { JScontext.EvaluateScript(JStext, (nsISupports)NodeElement.DomObject, out JSresult); }
Do you have others suggestions to achieve this in C# (with no Javascript)?
The property GeckoElement.Attributes allows access to an elements attributes.
So for example (this is untested and uncompiled code):
public string GetElementAttributes(GeckoElement element)
{
var result = new StringBuilder();
foreach(var a in element.Attributes)
{
result.Append(String.Format(" {0} = '{1}' ", a.NodeName, a.NodeValue));
}
return result.ToString();
}

Working with Regex to get 2 strings out of a source code [duplicate]

I am using webrequest to download a source from a page and then I need to use Regex to grab the string and store it in a string:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
also need:
bpvsid=nvnN2JFJqJc.&dcz=1
Both out of:
<td style="cursor:pointer;" class="" onclick="NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')" onmouseout="HideHover();"><img src="gfx/info.gif" alt="" tipwidth="450" ajaxtip="openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=" /></td>
It keep giving me errors like not enough )'s?
Thanks in advance.
Current code, probably wrong in every way. Really new to this:
Regex rx = new Regex("(?<=class=\"\" onclick=\"NewWindow(').*(?=')");
longId = (rx.Match(textBox2.Text).Value);
textBox1.Text = longId;
var match = Regex.Match(s, #"onclick=""NewWindow\('([^']*)',\s*'([^']*)',.*");
if (match.Success)
{
string longId = match.Groups[1].Value;
string other = match.Groups[2].Value;
}
That will give you two groups with values:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
bpvsid=nvnN2JFJqJc.&dcz=1
The regex NewWindow\('([^']*)', '([^']*) will match what you require. The two strings required will be in Groups[1] and Groups[2].
var match = Regex.Match(textBox2.Text, "NewWindow\('([^']*)', '([^']*)");
var id1 = match.Groups[1].Value;
var id2 = match.Groups[2].Value;
Note that you could also use simply string functions instead of a regex:
var s = "<td style=\"cursor:pointer;\" class=\"\" onclick=\"NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')\" onmouseout=\"HideHover();\"><img src=\"gfx/info.gif\" alt=\"\" tipwidth=\"450\" ajaxtip=\"openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=\" /></td>";
var tmp = s.Substring(s.IndexOf("NewWindow('")).Split('\'');
var value1 = tmp[1]; // U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
var value2 = tmp[3]; // bpvsid=nvnN2JFJqJc.&dcz=1
I would use HtmlAgilityPack to parse HTML, then this non-regex approach works:
string html = // get your html ...
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // doc.Load can also consume a response-stream directly
var result = Enumerable.Empty<string>();
var firstTD = doc.DocumentNode.SelectNodes("//td").FirstOrDefault();
if (firstTD != null)
{
if (firstTD.Attributes.Contains("onclick"))
{
string onclick = firstTD.Attributes["onclick"].Value;
int newWindowIndex = onclick.IndexOf("newWindow(", StringComparison.OrdinalIgnoreCase);
if (newWindowIndex >= 0)
{
string functionBody = onclick.Substring(newWindowIndex + "newWindow(".Length);
string[] tokens = functionBody.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
result = tokens.Take(2).Select(s => s.Trim(' ', '\''));
}
}
}

OptionOutputOriginalCase not working in HtmlAgilityPack

I am trying to replace some text using HtmlAgilityPack in Html string and placing ASP.net user controls but I am getting lower case in output html. Any Idea how to get original case output.
Code :
public static string ConvertPageTitlesToCMSTitle(string htmlstring, string themeSlug)
{
var htmlDoc = new HtmlAgilityPack.HtmlDocument()
{
OptionOutputOriginalCase = true,
OptionWriteEmptyNodes = true
};
htmlDoc.LoadHtml(htmlstring);
var stPageTitleTags = htmlDoc.DocumentNode.SelectNodes("//stpagetitle");
foreach (var stPageTitleTag in stPageTitleTags)
{
var pageTitle = Strings.StripHTML(stPageTitleTag.InnerText);
pageTitle = pageTitle.Trim();
var pageId = CreateUpdateContentPageInDb(pageTitle, themeSlug, null, null);
var widgetControl = string.Format("<widget:PageTitleDisplay runat=\"server\" PageId=\"{0}\" Editable=\"True\" />", pageId);
htmlDoc.DocumentNode.InnerHtml = htmlDoc.DocumentNode.InnerHtml.Replace(stPageTitleTag.OuterHtml, widgetControl);
}
return htmlDoc.DocumentNode.OuterHtml;
}
As a workaround you could create a text node instead of HTML node. See:
foreach (var stPageTitleTag in stPageTitleTags)
{
var pageTitle = Strings.StripHTML(stPageTitleTag.InnerText);
pageTitle = pageTitle.Trim();
var pageId = CreateUpdateContentPageInDb(pageTitle, themeSlug, null, null);
var widgetControl = string.Format("<widget:PageTitleDisplay runat=\"server\" PageId=\"{0}\" Editable=\"True\" />", pageId);
// creating a text node
var widget = htmlDoc.CreateTextNode(widgetControl);
// replacing <sppagetitle> node with the new one
stPageTitleTag.ReplaceChild(widget, stPageTitleTag);
}
This should get the output you want.

C# HtmlDecode Specific tags only

I have a large htmlencoded string and i want decode only specific whitelisted html tags.
Is there a way to do this in c#, WebUtility.HtmlDecode() decodes everything.
`I am looking for an implementaiton of DecodeSpecificTags() that will pass below test.
[Test]
public void DecodeSpecificTags_SimpleInput_True()
{
string input = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
string output = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
List<string> whiteList = new List<string>(){ "strong","br" } ;
Assert.IsTrue(DecodeSpecificTags(whiteList,input) == output);
}`
You could do something like this
public string DecodeSpecificTags(List<string> whiteListedTagNames,string encodedInput)
{
String regex="";
foreach(string s in whiteListedTagNames)
{
regex="<"+#"\s*/?\s*"+s+".*?"+">";
encodedInput=Regex.Replace(encodedInput,regex);
}
return encodedInput;
}
A better approach could be to use some html parser like Agilitypack or csquery or Nsoup to find specific elements and decode it in a loop.
check this for links and examples of parsers
Check It, i did it using csquery :
string input = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
string output = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
var decoded = HttpUtility.HtmlDecode(output);
var encoded =input ; // HttpUtility.HtmlEncode(decoded);
Console.WriteLine(encoded);
Console.WriteLine(decoded);
var doc=CsQuery.CQ.CreateDocument(decoded);
var paras=doc.Select("strong").Union(doc.Select ("br")) ;
var tags=new List<KeyValuePair<string, string>>();
var counter=0;
foreach (var element in paras)
{
HttpUtility.HtmlEncode(element.OuterHTML).Dump();
var key ="---" + counter + "---";
var value= HttpUtility.HtmlDecode(element.OuterHTML);
var pair= new KeyValuePair<String,String>(key,value);
element.OuterHTML = key ;
tags.Add(pair);
counter++;
}
var finalstring= HttpUtility.HtmlEncode(doc.Document.Body.InnerHTML);
finalstring.Dump();
foreach (var element in tags)
{
finalstring=finalstring.Replace(element.Key,element.Value);
}
Console.WriteLine(finalstring);
Or you could use HtmlAgility with a black list or white list based on your requirement. I'm using black listed approach.
My black listed tag is store in a text file, for example "script|img"
public static string DecodeSpecificTags(this string content, List<string> blackListedTags)
{
if (string.IsNullOrEmpty(content))
{
return content;
}
blackListedTags = blackListedTags.Select(t => t.ToLowerInvariant()).ToList();
var decodedContent = HttpUtility.HtmlDecode(content);
var document = new HtmlDocument();
document.LoadHtml(decodedContent);
decodedContent = blackListedTags.Select(blackListedTag => document.DocumentNode.Descendants(blackListedTag))
.Aggregate(decodedContent,
(current1, nodes) =>
nodes.Select(htmlNode => htmlNode.WriteTo())
.Aggregate(current1,
(current, nodeContent) =>
current.Replace(nodeContent, HttpUtility.HtmlEncode(nodeContent))));
return decodedContent;
}

How to separate img tags in description of xml (RSS FEED)

I am unable to retrieve images from RSS feeds i.e., in description.
I am using the following code to retrieve information.
var rssFeed = from el in doc.Elements("rss").Elements("channel").Elements("item")
orderby datetime(el.Element("pubDate").Value) descending
select new
{
Title = el.Element("title").Value,
Link = el.Element("link").Value,
Description =el.Element("description").Value,
PubDate = datetime(el.Element("pubDate").Value),
};
When Description is being displayed, both text and image are being displayed togather
I want to separate text and image in description. Can you please let me know how to proceed.
RSS Feed used : http://news.yahoo.com/rss/
var rssFeed = from el in doc.Elements("rss").Elements("channel").Elements("item")
orderby datetime(el.Element("pubDate").Value) descending
select new
{
Title = el.Element("title").Value,
Link = el.Element("link").Value,
Description =replace_other(el.Element("description").Value),
Image= regex(el.Element("description").Value),
PubDate = datetime(el.Element("pubDate").Value),
};
lvFeed.DataSource = rssFeed;
lvFeed.DataBind();
}
protected string regex(string source)
{
var reg1 = new Regex("src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|bmp|gif|png))
(?:\"|\')?");
var match1 = reg1.Match(source);
if (match1.Success)
{
Uri UrlImage = new Uri(match1.Groups["imgSrc"].Value, UriKind.Absolute);
return UrlImage.ToString();
}
else
{
return null;
}
}

Categories