human language test algorithm

human language test algorithm - c#

Let's assume that we have an encrypted byte stream with a suspected decrypting key. I want to decrypt the message with the key and validate the result.
How to validate result?
The only known thing about the plain text is it should contain a human language paragraph (one or more). We cannot assume anything more from this text.
I want to develop/use an algorithm that will test the output of the decryption and give me prediction whether the decryption was successful or not.
The algorithm must work with all human languages (won't be specific for one language).
Is this possible? What do you think?

Step 0
Decrypt the cipher-text (encrypted) byte array to obtain a plain-text (decrypted) byte array.
If authenticated encryption is used then decrypting with wrong key will fail outright.
If proper padding (PKCS#7/PKCS#5) is used then decrypting with wrong key will fail with the very high probability because the padding will not be decrypted properly.
Step 1
Decode the byte array into a char array using proper character encoding and DecoderExceptionFallback (CodingErrorAction.REPORT in Java).
If decrypted byte array contains a sequence of bytes that does not represent a valid character then decoding will fail. Assuming that initial data is proper text in the same encoding the decrypted byte array will contain invalid byte sequences only if the wrong key is used.
Step 2
Actually, the first two steps will expose the wrong key with the very high probability.
Now, in the unlikely situation when a wrong key is used and decryption miraculously resulted in the properly padded data and the decoded data contained only valid byte sequences for the selected character encoding, you have a textual data and can use two simple (but still empirical) ideas that do not require dictionaries or online access:
In most sane natural languages the words are separated by white-space.
In most sane natural languages the words are made of letters.
The Unicode General Category property is very helpful in determining the type of character without being specific for a single language, and most regex implementations allow to specify the regex pattern in terms of Unicode categories.
First, split the text by Separator and Punctuation Unicode categories. The result is a list of "words" devoid of white-space and punctuation.
Second, match each word with a Letter+ pattern. The rate of words that match to the words that do not match it is high for any natural text. It can be high for a specifically constructed text-like gibberish too but it certainly will be low for a random sequence of characters.

You can analyse the text and then calculate the letter frequencies. If the letter frequency is way of chart you could say that the encryption has gone wrong. And if you mix this with the occurrences of spaces you have a reasonable solid way of saying if the encryption was successful.
Wikipedia on Letter frequency

There is no way to tell whether a bytestream contains a message in a human language without any more assumptions. First and foremost, you absolutely need to know the encoding (or a few possible encodings).
Then I am 99.9% sure that there is no generic way to infer whether a large group of (e.g.) ASCII-chars has a meaning in any human language without the use of some kind of dictionary. If you could narrow it down to a language family, maybe you would be able to detect grammatic constructs - but I am really just speculating. Even if it is possible it will not be a trivial task to design the heuristics.
That said, I can only second the suggestions in the comments: Use wikipedia! Create your own dictionary from it or use it online - either way, I believe it's your best bet.

Related

Combining (some) Unicode nonspacing marks with associated letters for uniform processing

I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as á by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?

You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. á as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. á as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize() already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is 👩🏽‍🚒, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
U+1F469 is 👩, the Woman emoji.
U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone 👩🏽, rendered as 🏽 when it appears on its own.
U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
U+1F692 is 🚒, the Fire Engine emoji.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting 👩🏽‍🚒 into various editors and then using backspace on it. If it's rendered properly, some editors turn it into 👩🏽 and then 👩 and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So 🇺🇸 is 🇺 followed by 🇸. Paste the 🇸 after the 🇺 and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"👩🏽‍🚒".Length; // 7
// Number of Unicode codepoints
"👩🏽‍🚒".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("👩🏽‍🚒").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.

Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?

I am trying to figure out the best way to create a function that is equivalent to String.Replace("oldValue","newValue");
that can handle surrogate pairs.
My concern is that if there are surrogate pairs in the string and there is the possibility of a string that matches part of the surrogate pair that it would potentially split the surrogate and have corrupt data.
So my high level question is: Is String.Replace(string oldValue, string newValue); a safe operation when it comes to Unicode and surrogate pairs?
If not, what would be the best path forward? I am familiar with the StringInfo class that can split these strings into elements and such. I'm just unsure of how to go about the replace when passing in strings for the old and new values.
Thanks for the help!

It's safe, because strings in .NET are internally UTF-16. Unicode code point can be represented by one or two UTF-16 code units, and .NET char is one such code unit.
When code point is represented by two units, first unit is called high surrogate, and second is called low surrogate. What's important in context of this question is surrogate units belong to specific range, U+D800 - U+DFFF. This range is used only to represent surrogate pairs, single unit in this range has no meaning and is invalid.
For that reason, it's not possible to have valid utf-16 string which matches "part" of surrogate pair in another valid utf-16 string.
Note that .NET string can also represent invalid utf-16 string. If any argument to Replace is invalid - then it can indeed split surrogate pair. But - garbage in, garbage out, so I don't consider this a problem in given case.

How to Determine Unicode Characters from a UTF-16 String?

I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.

The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.

Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)

Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

C# How to encrypt string in which result are letters or numbers only without any other character?

I use some code to encrypt & decrypt string in C# but i want a good one that can generate encrypted string that contain only letters or numbers not any other ( + , / , ...)
Is there good one for that ?

You could use any encryption algorithm, then encode the result. Once you have binary data, you can push it out to any textual format. The result of an encryption algorithm is going to be a series of bytes, anyhow, so any textual representation is simply an encoding.
Hexadecimal would be fairly large, depending on your encrypted data. Base64 would almost encode it the way you want, except for the / and + symbols. Base32 would probably be the way to go, because it is A-Z, 2-7 and = for padding.
If you want to custom tailor your own encoding scheme, that is also an option, and would be very easy to implement. For example, you could take Base32, and replace the padding with 8, then you'd have just A-Z, 2-8.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.