Decode HTML string in c# [duplicate]

Decode HTML string in c# [duplicate] - c#

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".

Regex.Unescape did the trick:
System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen");
Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.

If you landed on this question because you see "Sch\u00f6nen" (or similar \uXXXX values in string constant) - it is not encoding. It is a way to represent Unicode characters as escape sequence similar how string represents New Line by \n and Return by \r.
I don't think you have to decode.
string unicodestring = "Sch\u00f6nen";
Console.WriteLine(unicodestring);
Schönen was outputted.

Wrote a code that covnerts unicode strings to actual chars. (But the best answer in this topic works fine and less complex).
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

Related

Removing Escape Characters for a string

I am having a bit of a problem with Escape characters is a string that I am reading from a txt file,
They are causing an error later in my program, they need to be removed but I can't seem to filter them out
public static List<string> loadData(string type)
{
List<string> dataList = new List<string>();
try
{
string path = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "Data");
string text = File.ReadAllText(path + type);
string[] dataArray = text.Split(',');
foreach (var data in dataArray)
{
string dataUnescaped = Regex.Unescape(data);
if (!string.IsNullOrEmpty(dataUnescaped) && (!dataUnescaped.Contains(#"\r") || (!dataUnescaped.Contains(#"\n"))))
{
dataList.Add(data);
}
}
return dataList;
}
catch(Exception e)
{
Console.WriteLine(e);
return dataList;
}
}
I have tried text.Replace(#"\r\n")
and an if statement but I just cant seem to remove them from my string
Any ideas will be appreciated

If you add the # Sign before a string that means you specify that you want a string without having to escape any characters.
So if you wanted a path without # you would need to do this:
string s = "c:\\myfolder\\myfile.txt"
But if you add the # before your \n\r isntead of the escaped sequence Windows New Line you would instead get the string "\n\r".
So this will result in you removing all occurrences of the string "\n\r". Instead of NewLines like you want to:
text.Replace(#"\r\n")
To fix that you would need to use:
text = text.Replace(Environment.NewLine, string.Empty);
You can use Environment.NewLine as well instead of \r and \n, because Environment knows which OS you are currently on and change the replaced character depeding on that.

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.

I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.

sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

Replacing anchor/link in text

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards

If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].

Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

Auto quotes around string in c# - build in method?

Is there some build in method that add quotes around string in c# ?

Do you mean just adding quotes? Like this?
text = "\"" + text + "\"";
? I don't know of a built-in method to do that, but it would be easy to write one if you wanted to:
public static string SurroundWithDoubleQuotes(this string text)
{
return SurroundWith(text, "\"");
}
public static string SurroundWith(this string text, string ends)
{
return ends + text + ends;
}
That way it's a little more general:
text = text.SurroundWithDoubleQuotes();
or
text = text.SurroundWith("'"); // For single quotes
I can't say I've needed to do this often enough to make it worth having a method though...

string quotedString = string.Format("\"{0}\"", originalString);

Yes, using concatenation and escaped characters
myString = "\"" + myString + "\"";
Maybe an extension method
public static string Quoted(this string str)
{
return "\"" + str + "\"";
}
Usage:
var s = "Hello World"
Console.WriteLine(s.Quoted())

No but you can write your own or create an extension method
string AddQuotes(string str)
{
return string.Format("\"{0}\"", str);
}

Using Escape Characters
Just prefix the special character with a backslash, which is known as an escape character.
Simple Examples
string MyString = "Hello";
Response.Write(MyString);
This would print:
Hello
But:
string MyString = "The man said \"Hello\"";
Response.Write(MyString);
Would print:
The man said "Hello"
Alternative
You can use the useful # operator to help escape strings, see this link:
http://www.kowitz.net/archive/2007/03/06/the-c-string-literal
Then, for quotes, you would use double quotes to represent a single quote. For example:
string MyString = #"The man said ""Hello"" and went on his way";
Response.Write(MyString);
Outputs:
The man said "Hello" and went on his way

I'm a bit C# of a novice myself, so have at me, but I have this in a catch-all utility class 'cause I miss Perl:
// overloaded quote - if no quote chars spec'd, use ""
public static string quote(string s) {
return quote(s, "\"\"");
}
// quote a string
// q = two quote chars, like "", '', [], (), {} ...
// or another quoted string (quote-me-like-that)
public static string quote(string s, string q) {
if(q.Length == 0) // no quote chars, use ""
q = "\"\"";
else if(q.Length == 1) // one quote char, double it - your mileage may vary
q = q + q;
else if(q.Length > 2) // longer string == quote-me-like-that
q = q.Substring(0, 1) + q.Substring(q.Length - 1, 1);
if(s.Length == 0) // nothing to quote, return empty quotes
return q;
return q[0] + s + q[1];
}
Use it like this:
quote("this with default");
quote("not recommended to use one char", "/");
quote("in square brackets", "[]");
quote("quote me like that", "{like this?}");
Returns:
"this with default"
/not recommended to use one char/
[in square brackets]
{quote me like that}

In my case I wanted to add quotes only if the string was not already surrounded in quotes, so I did:
(this is slightly different to what I actually did, so it's untested)
public static string SurroundWith(this string text, string ends)
{
if (!(text.StartsWith(ends) && text.EndsWith(ends)))
{
return string.Format("{1}{0}{1}", text, ends);
}
else
{
return text;
}
}

There is no such built in method to do your requirement
There is SplitQuotes method that does something
Input - This is a "very long" string
Output - This, is, a, very long, string
When you get a string from textbox or some control it comes with quotes.
If still you want to place quotes then you can use this kind of method
private string PlaceQuotes(string str, int startPosition, int lastPosition)
{
string quotedString = string.Empty;
string replacedString = str.Replace(str.Substring(0, startPosition),str.Substring(0, startPosition).Insert(startPosition, "'")).Substring(0, lastPosition).Insert(lastPosition, "'");
return String.Concat(replacedString, str.Remove(0, replacedString.Length));
}

Modern C# version below. Using string.Create() we avoid unnecessary allocations:
public static class StringExtensions
{
public static string Quote(this string s) => Surround(s, '"');
public static string Surround(this string s, char c)
{
return string.Create(s.Length + 2, s, (chars, state) =>
{
chars[0] = c;
state.CopyTo(chars.Slice(1));
chars[^1] = c;
});
}
}

Replacing each letter of the alphabet in a string?

That's what I've written so far:
string omgwut;
omgwut = textBox1.Text;
omgwut = omgwut.Replace(" ", "snd\\space.wav");
omgwut = omgwut.Replace("a", "snd\\a.wav");
Now, the problem is that this code would turn
"snd\space.wav"
into
"snd\spsnd\a.wavce.wsnd\a.wavv"
in line four. Not what I'd want! Now I know I'm not good at C#, so that's why I'm asking.
Solutions would be great! Thanks!

You'll still need to write the getSoundForChar() function, but this should do what you're asking. I'm not sure, though, that what you're asking will do what you want, i.e., play the sound for the associated character. You might be better off putting them in a List<string> for that.
StringBuilder builder = new StringBuilder();
foreach (char c in textBox1.Text)
{
string sound = getSoundForChar( c );
builder.Append( sound );
}
string omgwut = builder.ToString();
Here's a start:
public string getSoundForChar( char c )
{
string sound = null;
if (sound == " ")
{
sound = "snd\\space.wav";
}
... handle other special characters
else
{
sound = string.Format( "snd\\{0}.wav", c );
}
return sound;
}

The problem is that you are doing multiple passes of the data. Try just stepping through the characters of the string in a loop and replacing each 'from' character by its 'to' string. That way you're not going back over the string and re-doing those characters already replaced.
Also, create a separate output string or array, instead of modifying the original. Ideally use a StringBuilder, and append the new string (or the original character if not replacing this character) to it.

I do not know of a way to simultaneously replace different characters in C#.
You could loop over all characters and build a result string from that (use a stringbuilder if the input string can be long). For each character, you append its replacement to the result string(builder).
But what are you trying to do? I cannot think of a useful application of appending file paths without any separator.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Decode HTML string in c# [duplicate] - c#

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".

Regex.Unescape did the trick: System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen"); Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.

Related

Removing Escape Characters for a string

Remove control characters sequence from string EOT comma ETX

Replacing anchor/link in text

Auto quotes around string in c# - build in method?

Replacing each letter of the alphabet in a string?

Categories

Resources