How to parse the text out of html in c#

How to parse the text out of html in c# - c#

I have an html expression like this:
"This is <h4>Some</h4> Text" + Environment.NewLine +
"This is some more <h5>text</h5>
And I want only to extract the text. So the result should be
"This is Some Text" + Environment.NewLine +
"This is some more text"
How do I do this?

Use HtmlAgilityPack
string html = #"This is <h4>Some</h4> Text" + Environment.NewLine +
"This is some more <h5>text</h5>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var str = doc.DocumentNode.InnerText;

Simple using regex: Regex.Replace(source, "<.*?>", string.Empty);

Related

C# creating an HTML line with escaping

I'm creating a loop in which each line is a pretty long HTML line on the page. I've tried various combinations of # and """ but I just can't seem to get the hang of it
This is what I've got now, but the single quotes are giving me problems on the page, so I want to change all the single quotes to double quotes, just like a normal HTML line would use them for properties in the elements:
sOutput += "<div class='item link-item " + starOrBullet + "'><a href='" + appSet + linkID + "&TabID=" + tabID + "' target=’_blank’>" + linkText + "</a></div>";
variables are:
starOrBullet
appSet
LinkID
tabID (NOT $TabID=)
linkText
BTW, appSet="http://linktracker.swmed.org:8020/LinkTracker/Default.aspx?LinkID="
Can someone help me here?

You have to escape the double quotes (") with \"
For your case:
sOutput += "<div class=\"item link-item " + starOrBullet + "\"><a href=\"" + appSet + linkID + "&TabID=" + tabID + "\" target=’_blank’>" + linkText + "</a></div>";
If you concat many strings, you should use StringBuilder for performance reasons.

You can use a verbatim string and escape a double quote with a double quote. So it will be a double double quote.
tring mystring = #"This is \t a ""verbatim"" string";
You can also make your string shorter by doing the following:
Method 1
string mystring = #"First Line
Second Line
Third Line";
Method 2
string mystring = "First Line \n" +
"Second Line \n" +
"Third Line \n";
Method 3
var mystring = String.Join(
Environment.NewLine,
"First Line",
"Second Line",
"Third Line");

You must make habit to use C# class to generate Html instead concatenation. Please find below code to generate Html using C#.
Check this link for more information
https://dejanstojanovic.net/aspnet/2014/june/generating-html-string-in-c/
https://learn.microsoft.com/en-us/dotnet/api/system.web.ui.htmltextwriter
Find below code for your question
protected void Page_Load(object sender, EventArgs e)
{
string starOrBullet = "star-link";
string appSet = "http://linktracker.swmed.org:8020/LinkTracker/Default.aspx?LinkID=";
string LinkID = "2";
string tabID = "1";
string linkText = "linkText_Here";
string sOutput = string.Empty;
StringBuilder sbControlHtml = new StringBuilder();
using (StringWriter stringWriter = new StringWriter())
{
using (HtmlTextWriter htmlWriter = new HtmlTextWriter(stringWriter))
{
//Generate container div control
HtmlGenericControl divControl = new HtmlGenericControl("div");
divControl.Attributes.Add("class", string.Format("item link-item {0}",starOrBullet));
//Generate link control
HtmlGenericControl linkControl = new HtmlGenericControl("a");
linkControl.Attributes.Add("href", string.Format("{0}{1}&TabID={2}",appSet,LinkID,tabID));
linkControl.Attributes.Add("target", "_blank");
linkControl.InnerText = linkText;
//Add linkControl to container div
divControl.Controls.Add(linkControl);
//Generate HTML string and dispose object
divControl.RenderControl(htmlWriter);
sbControlHtml.Append(stringWriter.ToString());
divControl.Dispose();
}
}
sOutput = sbControlHtml.ToString();
}

Extract value from input element string array

i have an string array read from <td> of a datatable like this
"<input id=\"item_Job_ID\" name=\"item.Job_ID\" type=\"text\" value=\"5036\">"
how can i get only the value from it in c#.
i tried Split("\\") which doesn't work. can i use linq to extract the value ?
Thank You in Advance

I think, It's work for you
string inputstr = "< input id =\"item_Job_ID\" name=\"item.Job_ID\" type=\"text\" value=\"5036\">";
var splitdataList = inputstr.Split(new string[] { "\"", "=", " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
var value = splitdataList.Contains("value") ? splitdataList[splitdataList.IndexOf("value") + 1] : ""; // Return 5036

use Html Agility Pack.
HtmlDocument doc = new HtmlDocument();
string htmlContent = "<input id=\"item_Job_ID\" name=\"item.Job_ID\" type=\"text\" value=\"5036\">";
doc.LoadHtml(htmlContent);
HtmlNode inputNode = doc.DocumentNode.FirstChild;
string value = inputNode.GetAttributeValue("value", "0");

Parse XML With Additional String

I need to support parsing xml that is inside an email body but with extra text in the beginning and the end.
I've tried the HTML agility pack but this does not remove the non-xml texts.
So how do I cleanse the string w/c contains an entire xml text mixed with other texts around it?
var bodyXmlPart= #"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
//How do I clean up the body xml part before loading into xml
//This will fail:
var xDoc = XDocument.Parse(bodyXmlPart);

If you mean that body can contain any XML and not just ac_application. You can use the following code:
var bodyXmlPart = #"Hi please see below client " +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
StringBuilder pattern = new StringBuilder();
Regex regex = new Regex(#"<\?xml.*\?>", RegexOptions.Singleline);
var match = regex.Match(bodyXmlPart);
if (match.Success) // There is an xml declaration
{
pattern.Append(#"<\?xml.*");
}
Regex regexFirstTag = new Regex(#"\s*<(\w+:)?(\w+)>", RegexOptions.Singleline);
var match1 = regexFirstTag.Match(bodyXmlPart);
if (match1.Success) // xml has body and we got the first tag
{
pattern.Append(match1.Value.Trim().Replace(">",#"\>" + ".*"));
string firstTag = match1.Value.Trim();
Regex regexFullXmlBody = new Regex(pattern.ToString() + #"<\/" + firstTag.Trim('<','>') + #"\>", RegexOptions.None);
var matchBody = regexFullXmlBody.Match(bodyXmlPart);
if (matchBody.Success)
{
string xml = matchBody.Value;
}
}
This code can extract any XML and not just ac_application.
Assumptions are, that the body will always contain XML declaration tag.
This code will look for XML declaration tag and then find first tag immediately following it. This first tag will be treated as root tag to extract entire xml.

I'd probably do something like this...
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Test {
class Program {
static void Main(string[] args) {
var bodyXmlPart = #"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
Regex regex = new Regex(#"(?<pre>.*)(?<xml>\<\?xml.*</ac_application\>)(?<post>.*)", RegexOptions.Singleline);
var match = regex.Match(bodyXmlPart);
if (match.Success) {
Debug.WriteLine($"pre={match.Groups["pre"].Value}");
Debug.WriteLine($"xml={match.Groups["xml"].Value}");
Debug.WriteLine($"post={match.Groups["post"].Value}");
}
}
}
}
This outputs...
pre=Hi please see below client
xml=<?xml version="1.0" encoding="UTF-8"?><ac_application> <primary_applicant_data> <first_name>Ross</first_name> <middle_name></middle_name> <last_name>Geller</last_name> <ssn>123456789</ssn> </primary_applicant_data></ac_application>
post= thank you,
john

Regex replace and getting: CS1056 Unexpected character '$'

I am trying to convert html code to bbcode and I found this nifty little class packed with regexes that does just that.
public static string ConvertBBCodeToHTML(string str)
{
Regex exp;
// format the bold tags: [b][/b]
// becomes: <strong></strong>
exp = new Regex(#"[b](.+?)[/b]");
str = exp.Replace(str, "<strong>$1</strong>");
// format the italic tags: [i][/i]
// becomes: <em></em>
exp = new Regex(#"[i](.+?)[/i]");
str = exp.Replace(str, "<em>$1</em>");
// format the underline tags: [u][/u]
// becomes: <u></u>
exp = new Regex(#"[u](.+?)[/u]");
str = exp.Replace(str, "<u>$1</u>");
// format the strike tags: [s][/s]
// becomes: <strike></strike>
exp = new Regex(#"[s](.+?)[/s]");
str = exp.Replace(str, "<strike>$1</strike>");
// format the url tags: [url=www.website.com]my site[/url]
// becomes: <a href="www.website.com">my site[/url]
exp = new Regex(#"[url=([^]]+)]([^]]+)[/url]");
str = exp.Replace(str, "<a href="$1">$2[/url]");
// format the img tags:
// becomes: <img src="www.website.com/img/image.jpeg">
exp = new Regex(#"[img]([^]]+)[/img]");
str = exp.Replace(str, "<img src="$1">");
// format img tags with alt: [img=www.website.com/img/image.jpeg]this is the alt text[/img]
// becomes: <img src="www.website.com/img/image.jpeg" alt="this is the alt text">
exp = new Regex(#"[img=([^]]+)]([^]]+)[/img]");
str = exp.Replace(str, "<img src="$1" alt="$2">");
//format the colour tags: [color=red][/color]
// becomes: <font color="red"></font>
// supports UK English and US English spelling of colour/color
exp = new Regex(#"[color=([^]]+)]([^]]+)[/color]");
str = exp.Replace(str, "<font color="$1">$2</font>");
exp = new Regex(#"[colour=([^]]+)]([^]]+)[/colour]");
str = exp.Replace(str, "<font color="$1">$2</font>");
// format the size tags: [size=3][/size]
// becomes: <font size="+3"></font>
exp = new Regex(#"[size=([^]]+)]([^]]+)[/size]");
str = exp.Replace(str, "<font size=" +$1">$2</font>");
// lastly, replace any new line characters with
str = str.Replace("rn", "rn");
return str;
}
The problem is that I'm getting the CS1056 Unexpected character '$' error when doing the regex replace even if it seems to be perfectly valid.

You need to escape the embedded double quotes " in strings like:
"<a href="$1">$2[/url]"
They should be:
"<a href=\"$1\">$2[/url]"
Or with verbatim string literals:
#"<a href=""$1"">$2[/url]"

You should use single quotes to embed values within the string like below:
exp = new Regex(#"[url=([^]]+)]([^]]+)[/url]");
str = exp.Replace(str, "<a href='$1'>$2[/url]");

C# Beginner: Delete ALL between two characters in a string (Regex?)

i have a string with an html code. i want to remove all html tags. so all characters between < and >.
This is my code snipped:
WebClient wClient = new WebClient();
SourceCode = wClient.DownloadString( txtSourceURL.Text );
txtSourceCode.Text = SourceCode;
//remove here all between "<" and ">"
txtSourceCodeFormatted.Text = SourceCode;
hope somebody can help me

Try this:
txtSourceCodeFormatted.Text = Regex.Replace(SourceCode, "<.*?>", string.Empty);
But, as others have mentioned, handle with care.

According to Ravi's answer, you can use
string noHTML = Regex.Replace(inputHTML, #"<[^>]+>| ", "").Trim();
or
string noHTMLNormalised = Regex.Replace(noHTML, #"\s{2,}", " ");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse the text out of html in c# - c#

I have an html expression like this: "This is <h4>Some</h4> Text" + Environment.NewLine + "This is some more <h5>text</h5> And I want only to extract the text. So the result should be "This is Some Text" + Environment.NewLine + "This is some more text" How do I do this?

Use HtmlAgilityPack string html = #"This is <h4>Some</h4> Text" + Environment.NewLine + "This is some more <h5>text</h5>"; HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); var str = doc.DocumentNode.InnerText;

Simple using regex: Regex.Replace(source, "<.*?>", string.Empty);

Related

C# creating an HTML line with escaping

Extract value from input element string array

Parse XML With Additional String

Regex replace and getting: CS1056 Unexpected character '$'

C# Beginner: Delete ALL between two characters in a string (Regex?)

Categories

Resources