Decode HTML 5 Character set

Decode HTML 5 Character set - c#

I am unable to decode the following HTMl 5 code 10&colon;00 AM in my c# code, after using HttpUtility.HtmlDecode("10&colon;00 AM"); i get the same Output instead of seried output "10:00 AM".
However when i use other HTML character sets like & or > then HttpUtility.HtmlDecode gives the desired output, is there a way to decode HTML5 character sets in c#
I have also tried with System.Net.WebUtility.HtmlDecode, System.Uri.UnescapeDataString yet the same output

As commented by Svein this is an issue with the .NET Framework not supporting HTML5 entities.
Since the .NET Framework has gone open source, you can check the code and change it to reflect the necessary changes, as someone did already. If you check out that pull request, you see the problem: there is a breaking change between HTML4 entities and HTML5 entities, which they didn't agree on how to fix. That simply means that the .NET Framework will not support HTML5 entities until a design decision is made.
For you, in the meantime, you could take the diff of the commit, and create your own HTML5 entity parser (which is simply a string replacement and some dictionary lookup).

Created a custom decoder https://github.com/rolwincrasta/HTML5Decode
Reference https://github.com/dotnet/corefx/pull/13152

Related

How to stop .NET Core from mutilating file URIs

The Uri constructor seems to be doing a lot of additional work when handling file: URIs, sometimes unfortunately to one's disadvantage. For example, file:///a%A4b is interpreted as file:///a%A4b/a%A4b via AbsoluteUri (and file://%2Fa%A4b/a%A4b in ToString() for some reason), and so is apparently every file URI that does not start with a drive letter and contains non-ASCII (even percent-encoded) characters.
Is it possible to disable this behaviour of file: URIs? It seems it has to be done globally, since I tried using different parameters in the constructor and it didn't work as well. I am fine with disabling any sort of special handling of file: URIs, since even (valid to my knowledge) URIs like file:a throw an exception due to that.
The issue seems to only crop up only in .NET Core up to 3.1. In .NET Framework or .NET 5, new Uri("file:///a%A4b") works as expected. Is there a way to get around this issue without upgrading or switching to .NET Framework?

This is a known issue in .NET Core prior to .NET 5. You will need to update to .NET 5.
Relevant links:
https://github.com/dotnet/runtime/issues/1031
https://github.com/dotnet/runtime/pull/36429
https://github.com/dotnet/docs/issues/19965

No luck with HTML decoding based on safe HTML tags (vb.net or c#)

I've spent quite a bit of time trying to figure out the best way to handle this. I'm HTML encoding rich text from untrusted user input prior to storing it in the database.
I've bounce back and forth between multiple discussions, and it seems the safest method is to:
HTML encode absolutely everything, and only decode based on a white/safe list prior to sending it back to the client.
However, I'm also seeing strong suggestions for using http://htmlagilitypack.codeplex.com/
This compares user input against your safe/white list.
I've read:
C# HtmlDecode Specific tags only
https://eksith.wordpress.com/2011/06/14/whitelist-santize-htmlagilitypack/
And really, about 10 other posts and have become frustrated because now I can't figure out the best way to handle this.
I've tried using regular expressions to use regex replace methods:
For Each tag In AcceptableTags.Split(CChar("|")).ToList()
pattern = "<" + "\s*/?\s*" + tag + ".*?" + ">"
Regex = New Regex(pattern)
input = Regex.Replace(input, pattern)
Next
This doesn't seems to work well at all.
Is there someone out there who has a tried and true method with an example implementation they wouldn't mind sharing? I'll take c# or vb.net.

Depends on your data. Whitelist on the initial validation is fine if, for example, you're trying to avoid HTML in a phone number. On the other hand, if you can't be specific about what's in and what's out then just leave it "raw".
It's highly unlikely that storing encoded data in a database is the correct thing to do.
Any system of even marginal complexity will have non-HTML clients it will have to serve data to. When you do have an HTML client, you need to escape the output appropriate to HTML. Same for XML. Similarly, if you decide today you like JSON better, you'll encode to that. CSV? No problem - put quotes around your values (and escape any quotes) in case they have commas. Use parameters when doing SQL. Get the idea?
TL;DR;
Whitelist input if you can
Saving specifically encoded data is probably wrong
Always, always, always escape appropriate to your output
Never try and do your own escaping - always use a trusted library. You will never do a good enough job.

Xml Reading with backward Compatibility

The XML that i am currently working is directly formed using XML serializer (Serializing Class and its nested counter parts)
Also if there is an addition of a new Property is directly handled by the serializer but the problem comes when there is a deletion of property (value Type) or removal of and entire class or addition of class
I wish to read the old as well as the new XML files.... I cant seem to figure out how..
Process
Some ways
But i don't think these are good for a maintainable code
1) Make the custom XML parser (this will be less flexible as every time the change is done the parser has to be updated and hence tested again).
2) Use multiple Models then migrate from old to new (Taking essential components)
3) Export Old file and import the new file (This will also require another XML file and may b related to point 2)
4) Any other means (Please suggest)
I am not well versed with XML and its versioning.
Also is XML a good choice for this or Any other file type/DB that i can use in place of XML
Any help in this regard would be helpful.

In most ways, XmlSerializer already has pretty good version support built in. In most cases, if you add or remove elements it isn't a problem: extra (unexpected) data will be silently ignored - or put into the [XmlAnyElement] / [XmlAnyAttribute] member (if one) for round-trip. Any missing data just won't be initialized. The only noticeable problem is with sub-types, but adding and removing sub-types (or entire types) is going to be fairly fundamental to any serializer. One common option in the case of sub-types is: use a single model, but just don't remove any sub-types (adding sub-types is fine, assuming you don't need to be forwards compatible). However if this is not possible, the multiple models (model per revision) is not a bad approach.

I usually follow your solution "#2" where I namespace version my models (Myapp.Models.V1.MyModel), this way you can maintain backward compatibility with clients still using the older schema (or in your case, loading an older file).
As suggested in the comments, you can use a simple attribute on the root node to determine the version, and use either xmlreader, or even a simple regex on the first line of the file to read the version number.
As far as your second question, about file type/db, depending on your needs, I would highly recommend looking at a document database like MongoDB or RavenDB, as implementation is straightforward/simple, and does not require the use of an ORM tool like entity framework to handle proper separation of concerns. If you need something portable, in the cases such as desktop app "save file", SqlLite is a good file based databases, but you will likely want to use an ORM for mapping your model to your database.
Links:
MongoDB: http://www.mongodb.org/
RavenDB: http://ravendb.net/
Sqllite: http://www.sqlite.org/

Decoding all HTML Entities

I'm looking for some function that will decode a good amount of HTML entities.
Reason is I am working on some code to take HTML content and turning it into plain text, the issue that I have is a lot of entities do not get converted using HttpUtility.HtmlDecode.
Some examples of entities I'm concerned about are , &, ©.
This is for .net 3.5.

Then maybe you will need the HttpUtility.HtmlDecode?.
It should work, you just need to add a reference to System.Web.
At least this was the way in .Net Framework < 4.
For example the following code:
MessageBox.Show(HttpUtility.HtmlDecode("&©"));
Worked and the output was as expected (ampersand and copyright symbol).
Are you sure the problem is within HtmlDecode and not something else?
UPDATE: Another class capable of doing the job, WebUtility (again HtmlDecode method) came in the newer versions of .Net. However, there seem to be some problems with it. See the HttpUtility vs. WebUtility question.

Use WebUtility.HtmlDecode included in .Net 4
For example, if I run in a console app:
Console.WriteLine(WebUtility.HtmlDecode(" , &, ©"));
I get , &, c

Email templating engine

I am about to write a simple email manager for the site I'm working on (asp.net/c#); the site sends out various emails, like on account creation, news, some user actions, etc. So there will be some email templates with placeholders like [$FirstName] which will be replaced by actual values. Pretty standard stuff. I'm just wondering if someone can advise on existing code - again, i need something very simple, without many bells/whistles, and obviously with source code (and free)
Any ideas/comments will be highly appreciated!
Thanks,
Andrey

There are several threads on Stack Overflow about this already, but I ended up rolling my own solution from various suggestions there.
I used this FormatWith extension method to take care of simple templating, and then I made a basic Email base class to take care of common tasks, like pulling in an appropriate template and replacing all the requisite info, as well as providing a Send() method.
All the emails I need to send have their own subclass deriving from the base, and define things unique to them, such as TemplateText, BindingData, Recipients, and Subject. Having them each in their own class makes them very easy to unit test idependently of the rest of the app.
So that your app can work with these email classes without really caring which one it's using, it's also a good idea to implement an interface, with any shared methods (the only one I cared about was Send()), so then your app can instantiate whatever email class it wants and work with them in the same way. Maybe generics could be used, too, but this was what I came up with.
IEmail email = new MyEmailClass();
email.Send();
Edit: There are many more suggestions here: Can I set up HTML/Email Templates with ASP.NET?

I always do the following. Templates = text string with {#} placeholders. To use a template I load the string (from whatever store) and then call string.Format(template,param1,param2..)
Simple and works well. When you need something stronger you can move to a framework of some kind but string.format has always worked well for me.
note
Alison R's link takes this method to the next step using 3.5's anonymous types to great effect. If you are 3.5 I recommend using the FormatWith there (I will) otherwise this way works well.

Having just done this myself, there is some great information at: Sending Email Both in HTML and Plain Text. Best part is, you don't need anything other than .NET.
Essentially, you create an HTML page (AKA, your formatted e-mail) with the tags that you want to replace (in the case of this solution, tags will be in the format of: <%TAGNAME%>). You then utilize the information found at the above website to create a mail template with the tags filled with the appropriate data, and the injections will be done for you into your HTML template. Then, you just use the SMTP classes built into .NET and send the mail on its way. It's very simple and straightforward.
Let me know if you have any additional questions. Hope that helps!

If you are using ASP.NET, you already have a templating engine available to you. Simply create an ASP.NET page that will produce the results for you (using whatever controls, etc, etc) you want, as well as setting the ContentType of the response to the appropriate type (either text or HTML, depending on the email format)
Make sure that this url is not publically exposed.
Then, in your code, you would create an HttpWebRequest/HttpWebResponse or WebClient and then fetch the URL and get the contents. The ASP.NET engine will process the request and return your formatted results, which you can then email.
If you want something simpler, why not use a RegEx and match? Just make sure you have a fairly unique identifer for your fields (prefix and suffix, which you can guarantee will never be used, or, you can at least write an escape sequence for it) and you could easily use the Match method to do the replace.
The only "gotcha" to the RegEx approach is that if you have to do embedded templating, then that's going to require a little more work.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Decode HTML 5 Character set - c#

Created a custom decoder https://github.com/rolwincrasta/HTML5Decode Reference https://github.com/dotnet/corefx/pull/13152

Related

How to stop .NET Core from mutilating file URIs

No luck with HTML decoding based on safe HTML tags (vb.net or c#)

Xml Reading with backward Compatibility

Decoding all HTML Entities

Email templating engine

Categories

Resources