Comparison of HTML including handling singleton elements

Comparison of HTML including handling singleton elements - c#

I know that this has to be something with a simple solution, but I'm finding myself banging my head against it. I'm trying to write regression tests for some HTML pages generated by my company's application. They're unlikely to change frequently, but we do want checks to ensure that the correct page is displayed for every country. My impulse is to pull the HTML from the approved pages and then use Selenium to check the values. The problem I'm running into is that pulling the HTML up on different browsers yields different results when it comes to singleton elements, both the void ones and the ones that simply don't require an ending tag such as <P> and <HR>. Thus, I can't just do a text compare, and even packages such as HtmlDiff show that there's a change.
Due to the occasional lack of closing tags, my attempt to fix things by pulling the text into an XML document and then re-exporting it failed. I've had some small success with monkeying with the input to add closing tags, but I'm not an HTML or XML expert, so it feels like I'm trying to patch things with band-aids that may or may not distort the results.
Is there a simple and free solution I can use for comparing two HTML pages with the same style and check for actual equivalence despite differences in singleton elements?

One approach is to use PhantomJS and write custom Javascript to check the conformance of the pages with what you want.
(In general for this task i think every headless browser can be helpful.)

Related

Selenium Tests fail for different reasons in C#

I am trying to fix a number of unit tests which use Selenium Edge Web Driver for C#.
Sometimes the tests run through once without falling, but if you run them again they fail. However, the failures are not consistent and the reasons are numerous. For example, it could time out, or cannot find an element or a title of a document on the page.
I have tried all sorts of things such using wait load until, but this is unreliable if the elements cannot be found.
Has anyone else experienced this and how was it solved?

This situation is called Fleaky tests. There could be several reasons:
Network latency(timeouts)
data related issues
Not stabil waitings
Dynamic HTML content that in your code doesnt have flexible selectors.
So, without actual code it is not possible to make any comment

Which web page element tag is better to use to locate and get value with Webdrive

I am using Selenium for UI tests with C#. When I am at the starting step to test the front-end, I have to decide how to "mark" my web elements for future easy test and maintenance.
So if I have <input> or <div> or any other element, what is better to use ? id="element_id", name="element_name", class="class_name", or just Xpath ?
Or something else?

Normally, the way to set a unique element in HTML would be with the id tag. Most sites take for granted that the id of a tag is unique.
Schema is mostly used if you are identifying "items" in your site, which can be usually described this way.
Note that class is not unique at all, as it is mostly used for styles and you can use same class on multiple elements and also multiple classes on a single element.

I would suggest speaking to your Development team.
From a Developer's perspective if I am going to make changes to the UI to fix a bug or enhance the look and feel , then these changes will affect your Selenium UI Tests. the most least preferred change would be the id="element_id". Developers usually play around with CSS and if you are finding an element Find(By.CssSelector("cssselectors")), if that has changed then your test is going to fail so will Find(By.XPath("//xpath")).
I would say go with Find(By.Id("element_id"));
However, given that everything is changed, then you will have to change everything in your Test.

How to do white-listed HTML encoding in ASP.NET MVC 5?

I'm working on a mini-CMS module for one of my projects, where users are allowed to edit content in markdown. I'm using markdown-it for parsing and showing a preview.
I was thinking a lot about how to send the input to the server, and also how to store it in the database. I came to a conclusion to avoid duplicating the markdown parsing at server-side, and send both markdown and the parsed HTML to the server. I think nowadays the added overhead is minimal, even on a site where edits are heavy.
So at final stage I still need to validate the HTML sent to the server, as it can be a security bottleneck of the system. I've read a lot about Microsoft's implementation of AntiXSS, and how it is (or was) quite unusable for such scenarios, as it was too gready. For example I've found this article with even a helper code (using HTMLAgilityPack) to give a usable sanitizing implementation.
Unfortunately I haven't found anything newer than 2013 on this topic. I'd like to ask at present how to do a proper HTML encoding where there are allowed tags and attributes, but still safe from any kind of XSS attacks? Is such a code like in the article still needed, or are there any built-in solutions?
Also, if my choice of client-side markdown parsing is not viable, what are some other options? What I want to avoid, is duplicating all kinds of markdown logic at both client and server. For example I've prepared several custom extensions for markdown-it, etc.

If you allow html to be edited on the client and stored to the server, you are basically opening up a can of worms. This applies to client side html editors, and also to your usecase where you want to save html generated from markdown. The problem is that a malicious user may send any html to your backend, not just one that can actually be generated from the markdown. Html code in this case will be plain user input and as such must not be trusted.
Say you want to implement whitelisting of tags and tag attributes, the HTMLAgilityPack way. Consider a simple link in html. You obviously want to allow the <a> tag, and also the href attribute to that so that links are possible. But what about <a href="javascript:alert(1)"> then? It will be vulnerable to obvious XSS, and this is just one example, it would be vulnerable in numerous ways.
Even worse is that you probably want to render user-given html on the client before a server roundtrip (something like a preview), and also save it to your database and render it after downloading it again. For this you have to turn off request validation, and also automatic encoding as those would make this impossible.
So you have a few options that could actually work to prevent XSS.
Client-side sanitization: You could use the client-side sanitizer from Google Caja (only the Javascript library, not the whole thing) to remove Javascript from any html content. The way this would work is before displaying any such html (before previewing html on the client, or before displaying html downloaded from the server), you would run it through Caja, and that would remove any Javascript, thus eliminating XSS. It works reasonably well in my experience, it removes Javascript from CSS too, and also the trivial ones like href, src, a script tag, event attributes (onclick, onmouseover, etc). Another similar library is HTML Purify, but that only works for new browsers and does not remove Javascript from CSS (because that does not work in newer browsers anyway).
Server-side sanitization: You could also use Caja on the server side properly, but that's probably way too difficult and hard to maintain for your usecase, and also if only this is implemented, preview on the client (without a server roundtrip) would still be vulnerable to DOM XSS.
Content-Security-Policy: You could use the Content-Security-Policy response header to disable all inline Javascript on your website. One drawback is that it has implications on your client-side architecture (you cannot have inline Javascript at all, obviously), and also browser support is limited, and in unsupported browsers your page will in fact be vulnerable to XSS. However, the latest version of current major browser all support Content-Security-Policy, so it is indeed a good option.
Separate frame: You could serve unsafe html from a different origin (ie. a different subdomain) and accept the risk of XSS on that origin. However, cross-frame communication would still be a problem, and so would authentication and/or CSRF depending on the solution. This is kind of the old school way, options above are probably better for your usecase.
You could also use a combination of these for defense in depth.

I ended up using the code in the article. I made an important change, I removed style attributes from the whitelist completely. I don't need them, the styling which I allow can be achieved by classes. Also, style attributes are also dangerous and hard to encode/escape properly. Now I feel that the code is safe enough for my current purposes.

What is the best practice to handle dangerous characters in asp.net?

What is the best practice to handle dangerous characters in asp.net?
see example: asp.net sign up form
Should you:
use a JavaScript to prevent them from entering it into the textbox in the 1st place?
have a general function that does a find and replace on the server side?
The problem with #1, is it will increase page load time.

ASP .NET handles potentially dangerous characters for you, by default since ASP .NET 2.0. From Request Validation in ASP.NET:
Request validation is a feature in ASP.NET that examines an HTTP
request and determines whether it contains potentially dangerous
content. In this context, potentially dangerous content is any HTML
markup or JavaScript code in the body, header, query string, or
cookies of the request. ASP.NET performs this check because markup or
code in the URL query string, cookies, or posted form values might
have been added for malicious purposes.
Request validation helps prevent this kind of attack. If ASP.NET
detects any markup or code in a request, it throws a "potentially
dangerous value was detected" error and stops page processing.
Perhaps the most important bit of this is that it happens on the server; regardless of the client accessing your application they can not just turn of JavaScript to work around it.

Solution number 1 won't increment load time by much.
You should ALWAYS use solution number 2 along with solution number one, because users can turn off javascript in their browsers.

You accept them like regular characters on the write-side. When rendering you encode your output. You have to encode it anyway regardless of security so that you can display special characters.

What is the best practice to handle dangerous characters in asp.net?
I did not watch the screencast you link to (questions should be self-contained anyway), but there are no dangerous characters. It all depends on the context. Take Stack Overflow for example, it lets me input the characters Dangerous!'); DROP TABLE Questions--. Nothing dangerous there.
ASP.NET itself will do its best to prevent malicious input at the HTTP level: it won't let any user access files like web.config or files outside your web root.
As soon as you start doing something with user input, it's up to you. There's no silver bullet, no one rule that fits them all. If you're going to display the user input as HTML, you'll have to make sure you only allow harmless markup tags without any scriptable attributes. If you're allowing users to upload images, make sure only images get uploaded. If you're going to send input to an RDBMS, be sure to escape characters that have meaning for the database manipulation language.
And so on.

ALWAYS validate input on the server, this should not even be a discussion, just do it!
Client-side validation is just eye candy for the user, but the server is where it counts!

Thinking that
ASP .NET handles potentially dangerous characters for you, by default since ASP .NET 2.0. From Request Validation in ASP.NET:
is like thinking that a solid door will keep a thief out. It won't. It will only slow him. You have to know what are the most common vectors and what are the possible solutions. You must comprehend that every EVERY EVERY variable (field/property) you write in an HTML/CSS/Javascript is a potential attack vector that must be sanitized (through the use of appropriate libraries, like some methods included in newer MVC.NET, or at least the <%: %> of ASP.NET 4.0), no exceptions, every EVERY EVERY query you execute is a potential attach vector that must be sanitized through the exclusive use of ORM and parameterized queries, no exceptions. No passwords must be saved in the db. And tons of other similar things. It isn't very difficult, but laziness, complacence, ignorance will make it harder (if not nearly impossible). If it isn't you that will introduce the hole then it's the programmer on your left, or the programmer on your right. There is not hope.

Reducing a large single page AJAX application (jQuery, ASP.net)

I'm currently building a single page AJAX application. It's a large "sign-up form" that's been built as a multi-step wizard with multiple branches and different verbiage based on what choices the user makes. At the end of the form is an editable review page. Once the user submits the form, it sends a rather large email to us, and a small email to them. It's sort of like a very boring choose your own adventure book.
Feature creep has pushed the size of this app beyond the abilities of the current architecture, and it's too slow to work in any slower computers (not good for a web app), especially those using Internet Explorer. It currently has 64 individual steps, 5400 DOM elements and the .aspx file alone weighs in at 300kb (4206 LOC). Loading the app takes anywhere from 1.5 seconds on a fast machine running FireFox 3, to 20 seconds on a slower machine running IE7. Moving between steps takes about the same amount of time.
So let's recap the features:
Multi-Step, multi-path wizard style
form (64 steps)
Current step is shown in a fashion similar to this: http://codylindley.com/CSS/325/css-step-menu
Multiple validated fields
Changing verbiage based on user
choices
Final, editable review page
I'm using jQuery 1.3.2 and the following plugins:
jQuery Form Wizard Plugin
jQuery clueTip plugin
jQuery sexycombo
jQuery meioMask plugin
As well as some custom script for loading the verbiage from an XML file, running the review page and some aesthetic accoutrements.
I don't have this posted anywhere public, but I'm mostly looking for some tips on how to approach this sort of project and make it light weight and extensible. If anyone has any ideas as far as tools, tutorials or technologies, that's what I'm looking for. I'm a pretty novice programmer (I'm mostly a CSS/xHTML/Design guy), so speak gently. I just need a good plan of attack to make this app faster. Any ideas?

One way would be to break apart the steps into multiple pages / requests. To do this you would have to store the state of the previous pages somewhere. You could use a database to do this or some other method.
Another way would be to dynamically load the parts you need via AJAX. This won't help with the 54000 DOM elements though, but it would help with the initial page load.
Based on the question comments a quick way to "solve" this problem is to make a C# class that mirrors all the fields in your question. Something like this:
public class MySurvey
{
public string FirsName { get; set; }
public string LastName { get; set; }
// and so on...
}
Then you would store this in the session (too keep it easy... I know it's not the "best" way) like this
public MySurvey Survey
{
get
{
var survey = Session["MySurvey"] as MySurvey;
if (survey == null)
{
survey = new MySurvey();
Session["MySurvey"] = survey;
}
return survey;
}
}
This way you'll always have a non-null Survey object you can work with.
The next step would be to break that big form into smaller pages, let's say: step1.aspx, step2.aspx, step3.aspx etc. All these pages would inherit from a common base page that would include the property above. After this all you'd need to do is send the request from step1.aspx back and save it to Survey, similar to what you're doing now but for each small piece. When done redirect (Response.Redirect("~/stepX.aspx")) to the next page. The info from the previous page would be saved in the session object. If they close the browser page they won't be able to get back though.
Rather than saving it to the session you could save it in a database or in a cookie, but you're limited to 4K for cookies so it may not fit.

I agree with PBZ, saving the individual steps would be ideal. You can, however, do this with AJAX. If you did, though, it'd require some stuff that sounds like it might be outside of your skillset of mostly front-end development, you'd need to probably create a new database row and tie it to the user's session ID, and every time they click to the next step have it update that row. Possibly even tie it to their IP address so if the whole thing blows up they can come back and hit "remember me?" for your application to retrieve it.
As far as optimizing the existing structure, jQuery is fairly heavy when it comes to optimization, and adding a lot of jQuery modules doesn't help that. I'm not saying it's bad, because it saves you a lot of time, but there are some instances where you are using a module for one of its many functionalities, and you can replace that entire module with a few lines of jQuery enabled javascript.
As far as minimizing the individual DOM elements, the step above I mentioned could help slim that down, because you're probably loading a lot of extensible functions for those modules that you may or may not need.
On the back end, I'd have to see the source to see how to tell you to optimize it, but it sounds like there's a lot of redundancy in individual steps, some of that can probably be trimmed down into functions that include a little recursion, or at the least delegate some of the tasks to one another.
I wish I could help more but without digging through your source I can only suggest basic strategies. Best of luck, though!

Agree, break up the steps. 5400 elements is too many.
There are a few options if you need to keep it on one page.
AJAX requests to get back either raw HTML, or an array of objects to parse into HTML or DOM
Frames or Iframes
JavaScript to set innerHTML or manipulate the DOM based on the current step. Note with this option IE7 and especially IE6 will have memory leaks. Google IE6 JavaScript memory leaks for more info.
Use document.write to include only the .js file(s) needed for the current step.
HTH.

Sounds like mostly a JQuery optimization problem.
First suggestion would be switch as many selects into ID selectors as you can. I've had speedups of over 200-300x by being able to move to id attribute selection only.
Second suggestion is more of a plan of attack. Since IE is your main problem area, I suggest using the IE8 debugger. You just need to hit f12 in IE8... Tabs 3 and 4 are script and profiler respectively.
Once you've done as much of #1 as you think you can, to get a starting point, just go to profiler, hit start profiling, do some slow action on the webpage, and then stop profiling. You will see your longest method calls, and just work your way through it.
For finer testing/dev, go to the script tab. Breakpoints locals etc are there for analysis. You can dev/test changes via the immediate window... i.e. put a break point where you want to change a function, trigger the function, execute your javascript instead of the defined javascript in the immediate window.
When you think you have something figured out, profile your changes to make sure they are really improvements. Just start the profiler, run the old code, stop it and note your benchmark. Then re-start the profiler and use the immediate window to execute your altered function.
That's about it. If that flow can't take you far enough, as mentioned above, JQuery itself (and hence its plugins) are not terribly performant, and replacing with standard javascript will speed everything up. If your plugins benchmark slow, look at replacing them with other plugins.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.