I am using HtmlAgilityPack to format html for a text file. <br> nodes are replaced with '\r\n' so it stays formatted in the text file. I want all the breaklines before the first actual char to be removed but my code doesnt do that. The final output for the test should be: Original:HelloCheck Expected:HelloCheck
html = "<br><br><br>Hello<br>Check";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//br");
if (nodes != null)
{
foreach(var node in nodes)
{
node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);
}
}
html = doc.Documentode.InnerText.TrimStart('r','n');
OutputLog.WriteLine("trimmed: " + html);
I want all the breaklines before the first actual char to be removed
but my code doesnt do that.
You could easily do this using Regex.
html = "<br><br><br>Hello<br>Check";
Regex.Replace(s,"^(?:<br>)+","\r\n") //Returns Hello<br>Check
Then you can process your html as you want.
Related
I have below html string, where i am trying to identify the <br> tag start and end of the whole text inside an html string using the below code
var htmlstring = "<p><span><br> text <b>text <br></b>text <br></span></p>"
var document = new HtmlDocument();
document.LoadHtml(htmlString);
var nodes= rootNode.SelectNodes("//br")
but it is giving all <br> tags nodes where i want only at the start and at the end of whole html text string in below html string
<p><span><br> text <b> text <br></b>text <br></span></p>
I am looking for nodes should be 2 instead of 3 but getting as 3 as it counts the <br> tag presented in between text as well.
Could any one please help on this how can i achieve this, many thanks in advance
You can use the Split method to solve your problem. I have a suggestion for you as follows. It prints text between <br> tags which are start and end tags. In addition, you can modify the output according to your requirements. Maybe it can be solved by using the regex pattern.
const string tag = "<br>";
var splitedHtmlString = htmlString.Split(tag);
StringBuilder builder = new StringBuilder();
for (int i = 1; i < splitedHtmlString.Length - 1; i++)
{
builder.Append(splitedHtmlString[i]);
builder.Append(tag);
}
builder.Remove(builder.ToString().Length - tag.Length, tag.Length);
Console.WriteLine(builder.ToString());
Output: text <b>text <br></b>text
You can convert your string to an HtmlDocument and filter by nodes, using HtmlAgilityPack library
HtmlDocument document = new HtmlDocument();
document.LoadHtml("your html code");
var htmlTag = document.DocumentNode.SelectNodes("//br");
Hi I have Resume in the html format,
I am reading file using StreamReader ,and I am removing tags using below method.
using (StreamReader sr = new StreamReader("\\Myfile.html"))
{
String line = sr.ReadToEnd();
string jj = Regex.Replace(line, "<.*?>", String.Empty);
}
Its working Damn Cool
But however as per my requirement I need the data only inside the body tag.
but no body tag, and with no tags inside.
Don't use Regex for HTML/XML parsing. Use Html/Xml parser. Here is explain well why you should not use it.
RegEx match open tags except XHTML self-contained tags
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
You can load the string in Html document using HTML Agility pack
Here little example of how to do it:
public string ReplacePElement()
{
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
foreach(HtmlNode p in doc.DocumentNode.SelectNodes("body"))
{
}
return doc.DocumentNode.OuterHtml;
}
//This code is not using HTML Agility Pack HtmlDocument .
1) I have html of some elements in string
2) I want to write it in System.Windows.Form.HtmlDocument but it not allow because its constructor is not allowed
System.Windows.Form.HtmlDocument document=new HtmlDocument()//not allowed
document.write(htmlstring);
foreach (HtmlElement element in document.All)
{
string size = element.GetAttribute("font-size");
string font = element.GetAttribute("color");
string fontfamily = element.GetAttribute("font-family");
}
Question 1:How to define Constructor in line number 1 in code.
Question 2:I did little research ,I found out that htmlAgilitypack is related to defining constructor but it is so confusing because htmlAgilitypack also contains defination of HtmlDocument. How to use htmlAgilityPack to get attributes ?
It can be solved entirely using HtmlAgilityPack ,Replaced the code of System.Windows.Form.HtmlDocument with HtmlAgilityPack.HtmlDocument document
HtmlAgilityPack.HtmlDocument document =new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(HTML);
IEnumerable<HtmlNode> links = document.DocumentNode.Descendants("span");
foreach (var element in links)
{
string style = element.Attributes["style"].Value;
string[] styles=style.Split(';');
richTextBox1.Text += "\n" + styles[0].Replace("font-family:", "");
richTextBox1.Text += "\n" + styles[1].Replace("font-size:", "");
richTextBox1.Text += "\n" + styles[2].Replace("color:", "");
}
I have a XML file that i am using to loop through an on matching of a child node getting the value of a an attribute.The thing is matching these values with a * character or ? character like some regex style..can someone tell me how to do this .So if a request comes like g.portal.com it should match the second node .I am using .net 2.0
Below is my XML file
<Test>
<Test Text="portal.com" Sample="1" />
<Test Text="*.portal.com" Sample="201309" />
<Test Text="portal-0?.com" Sample="201309" />
</Test>
XmlDocument xDoc = new XmlDocument();
xDoc.Load(PathToXMLFile);
foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
{
if (node.Attributes["Sample"].InnerText == value)
{
}
}
What you need to do is first convert each Text attribute into a valid Regex pattern and then use it to match your input. Something like this:
string input = "g.portal.com";
XmlNode foundNode = null;
foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
{
string value = node.Attributes["Text"].Value;
string pattern = Regex.Escape(value)
.Replace(#"\*", ".*")
.Replace(#"\?", ".");
if (Regex.IsMatch(input, "^" + pattern + "$"))
{
foundNode = node;
break; //remove if you want to continue searching
}
}
After executing the above code, foundNode should contain the second node from the xml file.
So you have an XML file that sets up patterns, right? You'll want to feed those patterns into Regexes and then stream a number of requests through them. Did I get that correct?
Assuming the XML file doesn't change it only needs to be processed into according Regexes. For example *.portal.com would translate to
new Regex("\\w+\\.portal\\.com");
You'll just have to escape the dots, replace * with \\w+ and ? with \\w if i guessed the semantics of you match patterns correctly.
Look up the correct replacements at http://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
I am trying to compare 2 strings but i just realized that one has some html formatting already.
How can i get these two strings to match when doing string1 == string2. (NOTE: i dont know what the HTML formatting is going to be upfront)
string1 = "This is a test";
string1 = "<font color=\"black\" size=\"1\">This is a test</font>";
Load the html into Html Agility Pack, and extract only the text.
string html = "<html><body><div>test</div></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html)
string text = document.DocumentNode.InnerText;
This will not remove the content of <script> nodes, but you can easily remove the script nodes first.
string newText = System.Text.RegularExpressions.Regex.Replace(OldHtmlTextHere, "<[^>]*>", string.Empty);
Check out system.web.Httputility.HTMLdecode