Render or convert Html to 'formatted' Text (.NET)

Render or convert Html to 'formatted' Text (.NET) - c#

I'm importing some data from another test/bug tracking tool into tfs, and I would like to convert it's description, which is in simple HTML, so a plain string, where the 'layout' of the HTML is preserved.
For example:
<body>
<ol>
<li>Log on with user Acme & Co.</li>
<li>Navigate to the details tab</li>
<li>Check the official name</li>
</ol>
<br>
<br>
Expected Result:<br>
official name is filled in<br>
<br>
Actual Result:<br>
The &-sign is not shown correctly<br>
See attachement.
</body>
Would become plain text with newlines inserted and HTML-entities translated like:
1. Log on with user Acme & Co.
2. Navigate to the details tab
3. Check the official name
Expected Result:
official name is filled in
Actual Result:
The &-sign is not shown correctly
See attachment
I can currently replace some tags with newlines using a regex and strip the rest, but replacing the HTML-entities and stuff like <ol> and <ul> seemed like I'm re-inventing something (browser?). So I was wondering if someone has done this before me. I can't find it using Google.

Rather than regex, you could try loading it into the HTML agility pack? If it was xhtml, then an xslt transformation might be a good option.

In the end, once I got more comfortable with TFS, I customized the work item type to include a new HTML Field, and just copied the contents into that field.
This solution was so much better, because we could now see the intended formatting of the field.

Related

Not recognizing <strong> tag in #Html.Raw in ASP.NET MVC C#

I am using ASP.NET MVC, when I want to use the tag in #Html.Raw, this tag does not appear in the desired <div>.
As shown here:
<div class="mt-4 current-cursor">
#Html.Raw("<strong>OKK</strong> <p><ul><li style='font-size:18px;'>1.Test1</li><li>2.Test2</li></p>")
</div>
The result that it displays for me is as below, that is, it does not recognize the <strong> tag at all.

Html.Raw does not interpret anything at all. It just spews the given string unencoded into the output docuument.
So if it doesn't look right in your case, possible you have some CSS in that page that causes it to look as it does. You could use F12 (Developer Tools, depending on your browser) to inspect the "OKK" for details.
BTW, the other tags in your example also look wrong (which could also be an issue given existing CSS in the page).
In my case, for example, using some (other) arbitrary styles, your code looks like this:

Selenium - Discerning Between Identical <articles>, C#

I'm trying to have my program sit on a webpage and wait for specific tagName within an article to appear. Problem is, I need Selenium to check the article contains two tagNames before clicking it, that's where I'm stumped. The way I have my code setup right now, it doesn't click anywhere. It just sits on the page, I suspect because there's more than one article with the same main tagName that I'm trying to find. Here's the HTML:
<article>
<div class ="inner-article">
<a href ="/shop/shirts/iycbmgtqw/x9vdawcjg" style="height:150px;">
<img alt="Xrtqh7ar444" height="150" src="//d17ol771963kd3.cloudfront.net/120885/vi/xrTQH7Ar444.jpg" width="150">
</a>
<h1>
EXAMPLE_CODE
</h1>
<p>
EXAMPLE_COLOUR
</p>
</div>
</article>
All other items on this page have an identical class, and some have identical tagNames. I want to search for when there's a specific combination of two tagNames in an article. I realize xPath is an option, but I would like to code it before knowing an xPath, where the name of the item is the only available information.
And here's the code I'm working with at the moment:
driver.Manage().Timeouts().ImplicitlyWait(TimeSpan.FromMinutes(10));
IWebElement test = driver.FindElement(By.TagName(textBox12.Text));
test.Click();
where textBox12.Text is "EXAMPLE_CODE". Am I correct in assuming that WebDriver doesn't click anything because there is more than one element with the tagName "EXAMPLE_CODE", and is there a possible way to make it first look for "EXAMPLE_CODE" and then check the secondary: "EXAMPLE_COLOUR"?
Thanks!!

You are using By.TagName incorrectly. Tag refers to the type of element you are trying to find. In this case for the link it is 'a'. Or in case of a div it is 'div'. Te correct way of finding with tagname for a link would be - By.TagName("a").
You need to match text and you will need to use xpath. Assuming that the code is unique you should try.
XPath to get the code href -- //div[class='inner-article']/h1/a[.=EXAMPLE_CODE]
XPath to get the color href -- //div[class='inner-article']/h1/a[.=EXAMPLE_CODE]/following-sibling::a

How to find a matching closing tag in html string?

Imagine the following HTML:
<div>
<b></b>
<div>
<table>...</table>
</div>
</div> <!-- this one -->
...
How could I find the matching closing tag for the first opening div tag? Is there a reg ex that could find it? I guess this is quite a common requirement but I'm struggling to find anything straightforward, just full blown HTML parsers.

No.
Use a full blown HTML parser. There's a reason they exist.

Use Html Agility Pack.

I'm assuming that you have tokeinized the html tags... Now create a stack and every time you see an opening tag push and everytime you see a closing tag pop... and see if the ones you pop macth the closing tag...
But there are already HTML parsers for this so search for one on codeplex.

Well, You need to have a 'clear' view of the syntax ! However, regexp are very limited in scope and I would'nt recommand using it for multi-line/tag syntax.
You rather need to track each tag (open/close) and use a 'handler' to deal with your request. You could use some Lex/Yacc tools but this may be overkilling. Depending on the language you use, you may already have modules for this purpose (like HTMLParser in Python).

There's always LinqToXml if you want to parse HTML and don't need every little detail.

HTML Agility Pack Fix <li> list order

I have been trying to use the HTML Agility Pack to parse HTML into valid XHTML to go into a larger XML file. This for the most part works however lists become formatted like:
<ul>
<li>item1
<li>item2
</li></li>
</ul>
As oppose to what I would expect:
<ul>
<li>item1</li>
<li>item2</li>
</ul>
Unfortunately this format with nested li tags doesn't pass the schema validation which I have no control over. Does anyone know a simple way to correct this either through the HTML Agility Pack or an alternative. Preferably in .NET.

I found an alternative to the agility pack called HTML Tidy http://tidy.sourceforge.net/ I actually used the .NET port called Tidy.NET http://sourceforge.net/projects/tidynet/ this seemed to fix my issue.

I found your questio on other sites as well. The HTML you are trying to parse is:
<UL>
<LI>NVQ Level 3 in Fabrication and Welding Engineering
<LI>Level 3 Certificate in Engineering
<LI>Level 2 Key Skill in Application of Number
<LI>Level 2 Key Skill in Communication
<LI>Level 2 Key Skill in Information Technology
<LI>Level 2 Key Skill in Working with Others
<LI>Level 2 Key Skill in Improving Own Learning & Performance</LI></UL>
What I notice is that the first <li> is parent to the other <li>'s.
One aproach I would take at this is to take the first <li> and the text (it's a TextNode for HAP), save the other <li> children and remove the children, inserting them (while formating them) after the parent node.
You might have to take the recursive way at this. Here is a peek at my solutuion for a HTML Sanitizer class: HTML Agility Pack strip tags NOT IN whitelist

HtmlNode ul = _sourceForm.SelectSingleNode("//ul");
HtmlNodeCollection childList = ul.ChildNodes;
Then you can loop though the child list to grab the text elements you are interested in.

Parsing HTML and pulling down a drop down

I am writing some code that connects to a website, and using C#, and System.IO, reads the html file into my application, and then I continue to parse it.
What I am wanting to do now is, there is a drop down (combobox) on this site, that has 2 static values. I am wanting to have my code pick the 2nd option in the combo box and then parse the resulting html on the post back.
Any Ideas?
Ya the 2 selects are always the same.
Spamming software? Uh... No. It parses a video game website for player stats and I have full permission from the vendor to do so.
Yes I agree about the webservices, and they dont exist. I have already written the HTML parser and it works great. However, I need to pop this drop down for more data

I'd use HtmlAgilityPack and the HtmlAgilitypPack.AddOns.FormProcessor for that.

Say the code looks like this:
What color is your favorite?: <br/>
<form method="post" action="post.php">
<select name="color">
<option>AliceBlue</option>
<option>AntiqueWhite</option>
<option>Aqua</option>
</select><br/>
<input type="submit" value="Submit"/>
</form>
You would want to POST to post.php the argument "color" with the value "Aqua" (or whatever select value you want).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Render or convert Html to 'formatted' Text (.NET) - c#

Rather than regex, you could try loading it into the HTML agility pack? If it was xhtml, then an xslt transformation might be a good option.

In the end, once I got more comfortable with TFS, I customized the work item type to include a new HTML Field, and just copied the contents into that field. This solution was so much better, because we could now see the intended formatting of the field.

Related

Not recognizing <strong> tag in #Html.Raw in ASP.NET MVC C#

Selenium - Discerning Between Identical <articles>, C#

How to find a matching closing tag in html string?

HTML Agility Pack Fix <li> list order

Parsing HTML and pulling down a drop down

Categories

Resources