Algorithm to code URL [duplicate] - c#

This question already has answers here:
URL Encoding using C#
(14 answers)
Closed 9 years ago.
is there some algorithm in C# to encode url with symbols that can correct display in web-browser?
something like Base64.

The Standard (RFC 3986 aka STD 66) lays it out for you. In particular, §2 and 2.1:
2. Characters
The URI syntax provides a method of encoding data, presumably for the
sake of identifying a resource, as a sequence of characters. The URI
characters are, in turn, frequently encoded as octets for transport
or presentation. This specification does not mandate any particular
character encoding for mapping between URI characters and the octets
used to store or transmit those characters. When a URI appears in a
protocol element, the character encoding is defined by that protocol;
without such a definition, a URI is assumed to be in the same
character encoding as the surrounding text.
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII]. Because a URI is a sequence of characters, we must invert
that relation in order to understand the URI syntax. Therefore, the
integer values used by the ABNF must be mapped back to their
corresponding characters via US-ASCII in order to complete the syntax
rules.
A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set
and those reserved characters not acting as delimiters, define each
component's identifying data.
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component. A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value. For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP). Section 2.4 describes when percent-encoding and
decoding is applied.
pct-encoded = "%" HEXDIG HEXDIG
The uppercase hexadecimal digits 'A' through 'F' are equivalent to
the lowercase digits 'a' through 'f', respectively. If two URIs
differ only in the case of hexadecimal digits used in percent-encoded
octets, they are equivalent. For consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all percent-
encodings.
In general, the only characters that may freely be represented in a URL without being percent-encoded are
The unreserved characters. These are the US-ASCII (7-bit) characters
A-Z
a-z
0-9
-._~
The reserved characters ... when in use as within their role in the grammar of a URL and its scheme. These reserved characters are:
:/?#[]#!$&'()*+,;=
Any other characters, per the standard must be properly percent-encoded.
Further note that a URL may only contains characters drawn from the US-ASCII character set (0x00-0x7F): If your URL contains characters outside that range of codepoints, those characters will need to be suitably encoded for representation in US-ASCII (e.g., via HTML/XML entity references). Further, you application is responsible for interpreting such.

Related

regex \X is not reggniced by .Net 6.0 regex.replace is there an alternative? [duplicate]

According to http://www.regular-expressions.info,
You can consider \X the Unicode version of the dot in regex engines that use plain ASCII.
Does this mean that it will match any possible Unicode code point?
The site's description is pretty good:
\X Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a "character". \X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc.
So, the thing that makes it Unicode-aware is that it can match several code points when those combine to a single visible "thing" (grapheme).
See Wikipedia's page on Combining Characters for more detail, it lists the U+0300 codepoint mentioned above, for instance.
From Perl regex manual:
This matches a Unicode extended grapheme cluster.
\X matches quite
well what normal (non-Unicode-programmer) usage would consider a
single character. As an example, consider a G with some sort of
diacritic mark, such as an arrow. There is no such single character in
Unicode, but one can be composed by using a G followed by a Unicode
"COMBINING UPWARDS ARROW BELOW", and would be displayed by
Unicode-aware software as if it were a single character.
Mnemonic: eXtended Unicode character.
And from PCRE man pages (2012):
PCRE implements a simpler version of \X than Perl, which changed to make \X match what Unicode calls an "extended grapheme cluster".
This is more complicated than an extended Unicode sequence, which is
what PCRE matches.
[...]
\X an extended Unicode sequence
[...]
The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
That is, it matches a character without the "mark" property,
followed by zero or more characters with the "mark" property, and
treats the sequence as an atomic group (see below). Characters with
the "mark" property are typically accents that affect the
preceding character. None of them have codepoints less than 256, so in
8-bit non-UTF-8 mode \X matches any one character.
Note that recent versions of Perl have changed \X to match what
Unicode calls an "extended grapheme cluster", which has a more
complicated definition.
Later version of PCRE man pages (2015):
Extended grapheme clusters
The \X escape matches any number of Unicode characters that form
an "extended grapheme cluster", and treats the sequence as an atomic
group (see below). Up to and including release 8.31, PCRE matched
an ear- lier, simpler definition that was equivalent to
(?>\PM\pM*)
That is, it matched a character without the "mark" property,
followed by zero or more characters with the "mark" property.
Characters with the "mark" property are typically non-spacing
accents that affect the preceding character.
This simple definition was extended in Unicode to include more
compli- cated kinds of composite character by giving each character a
grapheme breaking property, and creating rules that use these
properties to define the boundaries of extended grapheme
clusters. In releases of PCRE later than 8.31, \X matches one of
these clusters.
\X always matches at least one character. Then it decides whether
to add additional characters according to the following rules for
ending a cluster:
End at the end of the subject string.
Do not end between CR and LF; otherwise end after any control char- acter.
Do not break Hangul (a Korean script) syllable sequences. Hangul characters are of five types: L, V, T, LV, and LVT. An L
character may be followed by an L, V, LV, or LVT character; an LV or
V character may be followed by a V or T character; an LVT or T
character may be follwed only by a T character.
Do not end before extending characters or spacing marks. Characters with the "mark" property always have the "extend"
grapheme breaking property.
Do not end after prepend characters.
Otherwise, end the cluster.

why some vendors map Unicode characters to another character set(code page)?

I'm reading a book which talks about text encoding in .NET:
There are two categories of text encoding in .NET:
• Those that map Unicode characters to another character set
• Those that use standard Unicode encoding schemes
The first category contains legacy encodings such as IBM’s EBCDIC and 8-bit char‐acter sets with extended characters in the upper-128 region that were popular prior to Unicode (identified by a code page). In the second category are UTF-8, UTF-16, and UTF-32
I'm confused about the first one, code page part, I have read some questions on stackoverflow, none of them the same as the question I'm going to ask, my question is:
Why some vendors need to map Unicode characters to another character set? from my understanding on Unicode characters, Unicode can cover all characters of almost all language over the world, why reinvent the wheel to map Unicode characters to another character set? for example, line feed in unicode is U+000A, why would you want to map it to other character? just stick to the unicode standard, then you can use binary code to represent all kinds of character.

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

List of ignorable characters for string comparison

Culture sensitive comparison in C# does not take into account "ignorable characters":
Character sets include ignorable characters. The Compare(String, String) method does not consider such characters when it performs a culture-sensitive comparison. For example, a culture-sensitive comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent, as the following example shows.
Where can I find complete list of such characters and maybe some details of comparison of strings containing ignorable characters?
All Unicode code points have a "default ignorable" property that is specified by the Unicode consortium; I would be very surprised if the .NET concept of ignorable characters is in any way different from the value of that property.
The definitive resource on which characters are default-ignorable is the Unicode standard, specifically section 5.21 (link to chapter 5 PDF for Unicode v6.2.0).

How do I create a string with a surrogate pair inside of it?

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?
The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong 🀜 denotes the Four of circles";
The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.
(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);
In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.

Categories