Decoding all HTML Entities - c#

I'm looking for some function that will decode a good amount of HTML entities.
Reason is I am working on some code to take HTML content and turning it into plain text, the issue that I have is a lot of entities do not get converted using HttpUtility.HtmlDecode.
Some examples of entities I'm concerned about are , &, ©.
This is for .net 3.5.

Then maybe you will need the HttpUtility.HtmlDecode?.
It should work, you just need to add a reference to System.Web.
At least this was the way in .Net Framework < 4.
For example the following code:
MessageBox.Show(HttpUtility.HtmlDecode("&©"));
Worked and the output was as expected (ampersand and copyright symbol).
Are you sure the problem is within HtmlDecode and not something else?
UPDATE: Another class capable of doing the job, WebUtility (again HtmlDecode method) came in the newer versions of .Net. However, there seem to be some problems with it. See the HttpUtility vs. WebUtility question.

Use WebUtility.HtmlDecode included in .Net 4
For example, if I run in a console app:
Console.WriteLine(WebUtility.HtmlDecode(" , &, ©"));
I get , &, c

Related

How to stop .NET Core from mutilating file URIs

The Uri constructor seems to be doing a lot of additional work when handling file: URIs, sometimes unfortunately to one's disadvantage. For example, file:///a%A4b is interpreted as file:///a%A4b/a%A4b via AbsoluteUri (and file://%2Fa%A4b/a%A4b in ToString() for some reason), and so is apparently every file URI that does not start with a drive letter and contains non-ASCII (even percent-encoded) characters.
Is it possible to disable this behaviour of file: URIs? It seems it has to be done globally, since I tried using different parameters in the constructor and it didn't work as well. I am fine with disabling any sort of special handling of file: URIs, since even (valid to my knowledge) URIs like file:a throw an exception due to that.
The issue seems to only crop up only in .NET Core up to 3.1. In .NET Framework or .NET 5, new Uri("file:///a%A4b") works as expected. Is there a way to get around this issue without upgrading or switching to .NET Framework?
This is a known issue in .NET Core prior to .NET 5. You will need to update to .NET 5.
Relevant links:
https://github.com/dotnet/runtime/issues/1031
https://github.com/dotnet/runtime/pull/36429
https://github.com/dotnet/docs/issues/19965

Decode HTML 5 Character set

I am unable to decode the following HTMl 5 code 10&colon;00 AM in my c# code, after using HttpUtility.HtmlDecode("10&colon;00 AM"); i get the same Output instead of seried output "10:00 AM".
However when i use other HTML character sets like & or > then HttpUtility.HtmlDecode gives the desired output, is there a way to decode HTML5 character sets in c#
I have also tried with System.Net.WebUtility.HtmlDecode, System.Uri.UnescapeDataString yet the same output
As commented by Svein this is an issue with the .NET Framework not supporting HTML5 entities.
Since the .NET Framework has gone open source, you can check the code and change it to reflect the necessary changes, as someone did already. If you check out that pull request, you see the problem: there is a breaking change between HTML4 entities and HTML5 entities, which they didn't agree on how to fix. That simply means that the .NET Framework will not support HTML5 entities until a design decision is made.
For you, in the meantime, you could take the diff of the commit, and create your own HTML5 entity parser (which is simply a string replacement and some dictionary lookup).
Created a custom decoder https://github.com/rolwincrasta/HTML5Decode
Reference https://github.com/dotnet/corefx/pull/13152

.Net 2.0 HttpUtility.UrlEncode issue

I am working with a project in .Net 2.0, this must stay in .Net 2.0 I have no way around this as this is what the customer wants.
I am trying to create a string that is going to url encode this
HttpUtility.UrlEncode(key);
However, I get the message
HttpUtility does not contain a definition for UrlEncode
Looking at MSDN here https://msdn.microsoft.com/en-us/library/system.web.httputility.urlencode(v=vs.80).aspx I see that this should be easily possible.
I have my using statement bringing in System.Web and it is in my references too.
Any ideas on what I need to do?
If you are having trouble with system.web use the alternate method as shown in this blog. Html-and-Uri-String-Encoding-without-SystemWeb

What is __argvalue?

Also, there is one other thing that is an lvalue in VC#, though it's a language extension - __argvalue().
Source
That was the only Google result for __argvalue.
I tried it in LINQPad and it doesn't seem to exist.
I can definitively state that there is no __argvalue in C# as of .NET Framework 4.0. The compiler binary contains a table of tokens. You can find the other hidden __ prefixed keywords starting at 0x00009840. However, a search of the entire binary shows that there is no __argvalue token.
The author of that comment may have been referring to __arglist, which can be an lvalue.

c# - error compiling targeting Compact Net Framework 3.5 - No overload for method 'GetString' takes '1' arguments

I actually have two questions regarding the same problem but I think it is better to separate them since I don't think they are related.
Background:
I am writing a Windows Mobile software in VB.NET which among its tasks needs to connect to a mail-server for sending and retrieving e-mails. As a result, I also need a Mime-parser (for decoding and encoding) the e-mails in order to get the attachments. First I thought, I would write a small "hack" to handle this issue (using ordinary string-parsing) but then I saw a project, written in C#, at CodeProject which I thought I would implement into my solution. I don't know much about C# so I simply made a class-library out of the classes and used it in my VB.NET-project. This library works very nicely when I am targeting the Net Framework on normal windows-computers however when I was going to make the same library targeting the Compact Net Framework, I ran into troubles. This is natural since the Compact Net Framework has more limits but I actually didn't get that many errors - only two although repeated in various places in the code.
One of the errors is the one cited in the subject of this question i.e. "No overload for method 'GetString' takes '1' arguments". As mentioned above, I don't know much about C# so I converted the class with the error online into VB-NET but still I don't understand much.. Here is the function which gives above indicated error:
public virtual string DecodeToString(string s)
{
byte[] b = DecodeToBytes(s);
if(m_charset != null)
{
//ERROR ON THIS LINE
return System.Text.Encoding.GetEncoding(m_charset).GetString(b);
}
else
{
m_charset = System.Text.Encoding.Default.BodyName;
//ERROR ON THIS LINE
return System.Text.Encoding.Default.GetString(b);
}
}
If the complete source-code is needed for this class, then I can post it in another message in this thread or you can find it by downloading the code at the web-site mentioned above and by having a look at the class named MimeCode.cs.
Anyone who can help me out? Can I rewrite above function somehow to overcome this problem?
I thank you in advance for your help.
Kind regards and a Happy New Year to all of you.
Rgds,
moster67
CF .NET requires you to use the signature: Encoding.GetString Method (array[], Int32 index, Int32 count) so try using:
...GetString(b, 0, b.Length);
If you look up the Encoding class on MSDN you'll find information about the availability of methods in the compact framework.
http://msdn.microsoft.com/en-us/library/system.text.encoding.default.aspx
In your case the System.Text.Encoding.Default property is supported by the .NET Compact Framework 3.5, 2.0, 1.0 so you should be all set.
But here is the thing. MS sometimes drop's of methods from the class implementation, or to be precise overloads.
Looking at the documentation
http://msdn.microsoft.com/en-us/library/system.text.encoding.getstring.aspx you can tell by looking at the icons (small images to the left) that while the .NET Compact Framework supports the encoding class, some overloads got removed.
When you pass the byte[] array to the GetString method, it cannot find that overload so you have to add an int offset and int count.
The compact framework probably doesn't support the overload that takes only a byte array. Try the overload that takes the byte array, a start index and a count and give it 0 as the start index and b.Length as the length.
Thanks to Michael, John and Rune for replying to my question. By using your suggestions, I resolved the problem and I managed to compile/build the library targeting the CF.NET 3.5. Thanks also to ctacke for editing my question and making it more readable.
BTW, as mentioned in my first post, I had another problem which I meant to ask in another thread and which did not permit me to build the library for CF.NET, namely the line:
m_charset = System.Text.Encoding.Default.BodyName;
In this case, the problem is that the CF.NET does not recognize "BodyName". I couldn't find any alternative ways or workarounds to get the character-set used (BodyName retrieves this information) so at the end I simply assigned it a fixed value (iso-8859-1). This means unfortunately that the library won't handle all different character-sets out there but at least the code won't break and I was able to compile it. In any case, for me it's enough since my application does not care about the text-messages - it's using e-mails in order to send and get attachments (similar to Gmail-drive but with my own provider).
Thank you once again.
Kind regards,
moster67

Categories