HTML ASCII Code for £ - c#

In my XML document, I have to pass the query string with "£" character and need to retrieve the Job counts result.
While passing query string with "£", does not calculate the correct job postings and gives as 0 for all time.
Please let me know what I need to pass in that xml document to replace "£".
Part of Query string with £:
( [salary]: \\\"less than £10,000 \\\" )

I had a look here: http://htmlarrows.com/currency/
Try one of these:
£
U+000A3
£
£

In a query string in a URI, non-ASCII characters should be escaped using the %NN convention. The Unicode codepoint for a pound sign (£) -- not to be confused with # which some Americans refer to as a pound sign -- is 163. The UTF-8 encoding of that is the two byte sequence C2-A3, so in a URI you should write %C2%A3.

Related

22021: invalid byte sequence for encoding "UTF8": 0x00

I am doing a bulk import to PostgreSQL from C# and one of the records gives me this error:
22021: invalid byte sequence for encoding "UTF8": 0x00
I googled it and the general advice is that this refers to a null field but in my instance this is not the case. I tracked down the string that causes the error and it is this:
Addresses the following: Let $A$ be a Banach algebra, and let $\sum:\0\rightarrow I\rightarrow\mathfrak A\overset\pi\to\longrightarrow A\rightarrow 0$ be an extension of $A$, where $\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\mathfrak A$.
I am reading this from an XML file and have UTF-8 defined on the file stream.
The escaped string on my deserialized C# class is:
"Addresses the following: Let $A$ be a Banach algebra, and let $\\sum\\:\\0\\rightarrow I\\rightarrow\\mathfrak A\\overset\\pi\\to\\longrightarrow A\\rightarrow 0$ be an extension of $A$, where $\\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\\mathfrak A$."
Obviously something is not right with the string. I am guessing some sort of mathmatical symbols should be there but what exactly about this is breaking the import and making PostgreSQL report that it is a null field? What format should that be read in?
If I manually overwite this field the import works so it is 100% an issue with this string.
Since it's a bulk import, I'm assuming you're creating a file or some kind of big string to send to Postgres? In that case the strings probably have escape characters enabled, as opposed to executing this via, say, a prepared statement. So it's probably that \0 in your string that Postgres is escaping and interpreting as a 0x00.
from the docs: https://www.postgresql.org/docs/8.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g. E'foo'. (When continuing an escape string constant across lines, write E only before the first opening quote.) Within an escape string, a backslash character () begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represents a special byte value. \b is a backspace, \f is a form feed, \n is a newline, \r is a carriage return, \t is a tab. Also supported are \digits, where digits represents an octal byte value, and \xhexdigits, where hexdigits represents a hexadecimal byte value. (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\). Also, a single quote can be included in an escape string by writing \', in addition to the normal way of ''.
So if your bulk statement is prepending strings with E, like E'hello', don't do that.

How do I replace all special characters with their respective hex codes?

I have a XML file and it contains multiple special characters.
I want to replace all the special characters with their respective hex codes.
So & becomes &#x0026 and so on. But only special characters.
Please help.
You can use HttpUtility.HtmlDecode to decode special characters. More in the official documentation: https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode
But you cannot use this method on the whole XML string, because < and > will be replaced. So you need to apply it only on the text nodes and attributes values

WriteAllText, Character Encoding, £ and?

Take the following example:
string testfile1 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test1.txt");
if (!System.IO.File.Exists(testfile1))
{
System.IO.File.WriteAllText(testfile1, "£100", System.Text.Encoding.ASCII);
}
string testfile2 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test2.txt");
if (!System.IO.File.Exists(testfile2))
{
System.IO.File.WriteAllText(testfile2, "£100", System.Text.Encoding.UTF8);
}
Note the encoding. The first outputs ?100. The second outputs £100.
I know the encoding is different, but can somebody explain why ASCII encoding can't write the £ sign?
ASCII doesn't include the "£" character. That is - there is no byte value (nor a multiple byte value - they don't exist in ASCII) that denotes that symbol. So it shows you a "?" to tell you that. UTF8, on the other hand, does include it.
See here a list of all of the printable characters in ASCII.
If you must use ASCII, consider using "GBP" as mentioned here for Pound sterling. (Also might be relevant: Extended ASCII.)
To deal with ASCII and certain characters it depended largely on what code page you're using. £ isn't a character that is required or used universally within the latin alphabet so didn't appear in the standard ASCII set.
Look at this article or this one on code pages to see how the character limitation was resolved and for an idea as to why it won't show up everywhere.
As Hans pointed out, ASCII is designed to Americans using only code points 0-127, the negligible rest of the English speaking world can live with that unless they try to use obscure symbols like £ with code points outside the range 0-127. I presume you live in the UK and aim only at customers from the UK, or Western Europe. Don't use Encoding.ASCII but Encoding.Default which would be code page 1252 in the UK, not in Turkey of course. You get real ASCII for every character in the ASCII range 0-127 but can also use characters in the range 128-255 where the pound symbol lives. But note, if someone tries to read the file assuming it is encoded in UTF8, the £ sign will obscure the content since it includes a byte that is non-existing in UTF8. This is indicated by some weird glyph like �.

Detect Special Characters in a text in C#

In my program, I'm going to process some strings. These strings can be from any language.(eg. Japanese, Portuguese, Mandarin, English and etc.)
Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.
Then I am going to generate an Excel sheet with these details. But when these is a special character, even though the excel file is created it can not be open since it is appeared to be corrupted.
So what I did is encode string before writing into excel. But what happened next is, all the strings except from English were encoded. The picture shows that asset description which is a Japanese language text is also converted into encoded text. But I wanted to encoded special characters only.
゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で is converted to ゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で But I wanted only to encoded special characters.
So what I need is to identify whether the string contains that kind of special character.Since I am dealing with multiple languages, is there any possible way to identify whether the string contain a HTML special characters?
Try this using the Regex.IsMatch Method:
string str = "*!#©™®";
var regx = new Regex("[^a-zA-Z0-9_.]");
if (regx.IsMatch(str))
{
Console.WriteLine("Special character(s) detected.");
}
See the Demo
Try the Regex.Replace method:
// Replace letters and numbers with nothing then check if there are any characters left.
// The only characters will be something like $, #, ^, or $.
//
// [\p{L}\p{Nd}]+ checks for words/numbers in any language.
if (!string.IsNullOrWhiteSpace(Regex.Replace(input, #"([\p{L}\p{Nd}]+)", "")))
{
// Do whatever with the string.
}
Detection demo.
I suppose that you could start by treating your string as a Char array
https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx
Then you can examine each character in turn. Indeed on a second read of that manual page why not use this:
string s = "Sometime these strings may contain some HTML special characters like trademark symbol(™), registered symbol(®), Copyright symbol(©) and etc.゜祌づ りゅ氧廩, 駤びょ菣 鏥こ埣槎で";
Char[] ca = s.ToCharArray();
foreach (Char c in ca){
if (Char.IsSymbol(c))
Console.WriteLine("found symbol:{0} ",c );
}

Escape string from file

I have to parse some files that contain some string that has characters in them that I need to escape. To make a short example you can imagine something like this:
var stringFromFile = "This is \\n a test \\u0085";
Console.WriteLine(stringFromFile);
The above results in the output:
This is \n a test \u0085
, but I want the text escaped. How do I do this in C#? The text contains unicode characters too.
To make clear; The above code is just an example. The text contains the \n and unicode \u00xx characters from the file.
Example of the file contents:
Fisika (vanaf Grieks, \u03C6\u03C5\u03C3\u03B9\u03BA\u03CC\u03C2,
\"Natuurlik\", en \u03C6\u03CD\u03C3\u03B9\u03C2, \"Natuur\") is die
wetenskap van die Natuur
Try it using: Regex.Unescape(string)
Should be the right way.
Att.
Don't use the # symbol -- this interprets the string as 100% literal. Just take it off and all shall be well.
EDIT
I may have been a bit hasty with my reply. I think what you're asking is: how can I have C# turn the literal string '\n' into a newline, when read from a file (similar question for other escaped literals).
The answer is: you write it yourself. You need to search for "\\n" and convert it to "\n". Keep in mind that in C#, it's the compiler not the language that changes your strings into actual literals, so there's not some library call to do this (actually there could be -- someone look this up, quick).
EDIT
Aha! Eureka! Behold:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.unescape.aspx
Since you are reading the string from a file, \n is not read as a unicode character but rather as two characters \ and n.
I would say you probably need a search an replace function to convert string "\n" to its unicode character '\n' and so on.
I don't think there's any easy way to do this. Because it's the job of lexical analyzer to parse literals.
I would try generating and compiling a class via CodeDOM with the string inserted there as constant. It's not very fast but it will do all escaping.

Categories