is there a way to check if the bytes in a byte[] is a valid string, so if it contains ASCII characters only.
if (isValidASCII(myByteArray)) {
....
}
something that I could use like the above example but with functionality.
Well, if
contains ASCII characters only
means symbols with codes within [32..127] (corresponding characters are [' '..'~']) - standard ASCII table characters with command ones excluded:
Boolean isAscii = myByteArray.All(b => b >= 32 && b <= 127)
However, a valid string being defined like that can well apper to be
"$d|1 ?;)y" // this is a valid ASCII characters based string
or alike. If you want to wrap this simple Linq into a method:
public static bool isValidASCII(IEnumerable<byte> source) {
if (null == source)
return true; // or false, or throw exception
return source.All(b => b >= 32 && b <= 127);
}
...
if (isValidASCII(myByteArray)) {
...
}
Literally answering your question:
Yes. It's not only very easy, it's trivial:
Boolean isValidAscii(Byte[] bytes) {
return true;
}
This is most likely not what you're looking for.
ASCII is a table that maps each byte to a character. By definition every byte represents a valid ASCII character.
So the question really is, what are you looking for? What is a valid ASCII character in your opinion.
Once you define that, it's easy to code:
Boolean iFindThisAValidAsciiCharacter(Char c) {
//your logic here
}
Boolean isValidAscii(Byte[] bytes) {
return bytes.forAll((Byte b) => iFindThisAValidAscciCharacter( (char)b))
}
The trick, of course, is in your definition of what you consider valid.
I advice you to take a step back, and consider why you want "valid ascii" in the first place. In this brave new world of internationalization and unicode, it sounds very unlikely that what you are trying to do is going to accomplish what you want to accomplish.
First you can convert your bytes array into string using this
System.Text.Encoding.ASCII.GetString(BytesArray);
then check if it is valid or not.
string asciiString = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(BYTEVARIABLE));
now check if the the string has some chars that have been changed to '?' if yes it wasn't ASCII only.
Related
I Have this text Grou00dfbeerenstrau00dfe and I need to convert it to Großbeerenstraße
also Eichstu00e4tt to Eichstätt
But I don't completely understand and solve this because of these reasons:
ONLY some characters (special characters) are converted, not the whole text
Unicoded texts usually have Escape characters like \u00df instead of u00df
Could you please help me to convert correctly back to its original states?
Basically, how can I convert when there is no escape character?
NOTE: If you must know, I'm sending some special charactered strings into some system. I cannot touch this system but when I request back the same string from that system, it converts Großbeerenstraße to Grou00dfbeerenstrau00dfe and so on.
Based on David's idea of looking for u and checking if the following 4 characters are valid hex numbers, it would look something like this:
public string FixGermanUnicode(string input) {
var output = new StringBuilder();
for (var i = 0; i < input.Length; i++) {
if (i < input.Length - 4 && input[i] == 'u' && input[i + 1] == '0'
&& int.TryParse(input.Substring(i + 1, 4), NumberStyles.HexNumber, null, out var code)) {
try {
output.Append(char.ConvertFromUtf32(code));
i += 4;
} catch (ArgumentOutOfRangeException) {
//not a valid unicode character
output.Append(input[i]);
}
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
Console.WriteLine(FixGermanUnicode("Grou00dfbeerenstrau00dfe"));
Really, it checks for u0 to prevent cases where the next 4 characters are valid unicode, but should not have been replaced. That will work for German at least, since all the special characters in German have unicode codes starting with 0.
This will also catch scenarios where the follow 4 digits are valid hex numbers, but the resulting hex number is not a valid unicode character.
While I completely agree with #Gabriel Luci's answer, I would like to point out a more concise implementation of the same idea (it needs the ' System.Text.RegularExpression' namespace):
readonly static string unicodePattern = #"u0[0-9a-fA-F]{3}";
public static string FixGermanUnicode(string input)
{
return Regex.Replace(input, unicodePattern, match =>
{
var digits = match.Value.Substring(1);
try
{
return char.ConvertFromUtf32(int.Parse(digits, System.Globalization.NumberStyles.AllowHexSpecifier)).ToString();
}
catch (ArgumentOutOfRangeException)
{
//not a valid unicode character
return match.Value;
}
});
}
This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.
I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.
Let's say I have the string:
"Whatever - 1_夜_💦💦💦"
I need to convert that to something with only ASCII characters. For example, maybe something like:
"Whatever - 1_\u001cY_=???=???=???"
Then I want to replace the remaining illegal characters with substitution strings.
Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.
This is what I've tried:
static string ConvertToAscii(string str)
{
var return_string = "";
foreach (var c in str)
{
if ((int)c < 128)
{
return_string += c;
}
else
{
var charBytes = BitConverter.GetBytes(c);
var ascii = Encoding.ASCII.GetString(charBytes);
return_string += ascii;
}
}
return return_string;
}
When I use this with the string I mentioned above, I get:
"Whatever - 1_\u001cY_=???=???=???"
That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.
How can I convert any string into a collection of ASCII characters?
The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:
Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_💦💦💦"))
will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".
Here is similar code to what I ended up using to convert everything to Ascii:
internal static string ConvertToAscii(string str)
{
var returnStringBuilder = new StringBuilder();
foreach (var c in str)
{
if (char.IsControl(c))
{
// Control character
continue;
}
if (c < 127)
{
// ASCII Character
returnStringBuilder.Append(c);
}
else
{
returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
}
}
return returnStringBuilder.ToString();
}
I have a six digit unicode character, for example U+100000 which I wish to make a comparison with a another char in my C# code.
My reading of the MSDN documentation is that this character cannot be represented by a char, and must instead be represented by a string.
a Unicode character in the range U+10000 to U+10FFFF is not permitted in a character literal and is represented using a Unicode surrogate pair in a string literal
I feel that I'm missing something obvious, but how can you get the follow comparison to work correctly:
public bool IsCharLessThan(char myChar, string upperBound)
{
return myChar < upperBound; // will not compile as a char is not comparable to a string
}
Assert.IsTrue(AnExample('\u0066', "\u100000"));
Assert.IsFalse(AnExample("\u100000", "\u100000")); // again won't compile as this is a string and not a char
edit
k, I think I need two methods, one to accept chars and another to accept 'big chars' i.e. strings. So:
public bool IsCharLessThan(char myChar, string upperBound)
{
return true; // every char is less than a BigChar
}
public bool IsCharLessThan(string myBigChar, string upperBound)
{
return string.Compare(myBigChar, upperBound) < 0;
}
Assert.IsTrue(AnExample('\u0066', "\u100000));
Assert.IsFalse(AnExample("\u100022", "\u100000"));
To construct a string with the Unicode code point U+10FFFF using a string literal, you need to work out the surrogate pair involved.
In this case, you need:
string bigCharacter = "\uDBFF\uDFFF";
Or you can use char.ConvertFromUtf32:
string bigCharacter = char.ConvertFromUtf32(0x10FFFF);
It's not clear what you want your method to achieve, but if you need it to work with characters not in the BMP, you'll need to make it accept int instead of char, or a string.
As per the documentation for string, if you want to iterate over characters in a string as full Unicode values, use TextElementEnumerator or StringInfo.
Note that you do need to do this explicitly. If you just use ordinal values, it will check UTF-16 code units, not the UTF-32 code points. For example:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
Console.WriteLine(string.Compare(text, upperBound, StringComparison.Ordinal));
This prints out a value greater than zero, suggesting that text is greater than upperBound here. Instead, you should use char.ConvertToUtf32:
string text = "\uF000";
string upperBound = "\uDBFF\uDFFF";
int textUtf32 = char.ConvertToUtf32(text, 0);
int upperBoundUtf32 = char.ConvertToUtf32(upperBound, 0);
Console.WriteLine(textUtf32 < upperBoundUtf32); // True
So that's probably what you need to do in your method. You might want to use StringInfo.LengthInTextElements to check that the strings really are single UTF-32 code points first.
From https://msdn.microsoft.com/library/aa664669.aspx, you have to use \U with full 8 hex digits. So for example:
string str1 = "\U0001F300";
string str2 = "\uD83C\uDF00";
bool eq = str1 == str2;
using the :cyclone: emoji.
This is a spin-off from the discussion in some other question.
Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.
The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.
I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)
int startIdx, endIdx = 0;
while(true)
{
startIdx = endIdx;
// no find_first_not_of in C#
while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
if (startIdx == s.Length) break;
endIdx = s.IndexOf(' ', startIdx);
if (endIdx == -1) endIdx = s.Length;
// how to extract a double here?
}
There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.
Does anyone have an idea?
There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.
Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.
Getting a pointer to string is actually a fully supported operation:
string l_pos = new string(' ', 100); //don't write to a shared string!
unsafe
{
fixed (char* l_pSrc = l_pos)
{
// do some work
}
}
C# has special syntax to bind a string to a char*.
if you want to do it really fast, i would use a state machine
this could look like:
enum State
{
Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
switch(CurrentState)
{ // set new currentstate in dependence of ch and CurrentState
case Separator:
GotNewDouble(Prefix, Exponent, Mantisse);
}
}
Part of our app parses RTF documents and we've come across a special character that is not translating well. When viewed in Word the character is an elipsis (...), and it's encoded in the RTF as ('85).
In our vb code we converted the hex (85) to int(133) and then did Chr(133) to return (...)
Here's the code in C# - problem is this doesn't work for values above 127. Any ideas?
Calling code :
// S is Hex number!!!
return Convert.ToChar(HexStringToInt(s)).ToString();
Helper method:
private static int HexStringToInt(string hexString)
{
int i;
try
{
i = Int32.Parse(hexString, NumberStyles.HexNumber);
}
catch (Exception ex)
{
throw new ApplicationException("Error trying to convert hex value: " + hexString, ex);
}
return i;
}
This looks like a character encoding issue to me. Unicode doesn't include any characters with numbers in the upper-ASCII 128-255 range, so trying to convert character 133 will fail.
Need to convert it first to a character using the proper decoding, Convert.toChar appears to be using UTF-16.
Sometimes there's a manual bit manipulation hack to convert the character from upper ASCII to the appropriate unicode char, but since the ellipsis wasn't in most of the widely used extended ASCII codepages, that's unlikely to work here.
What you really want to do is use the Encoding.GetString(Byte[]) method, with the proper encoding. Put your value into a byte array, then GetString to get the C# native string for the character.
You can learn more about RTF character encodings on the RTF Wikipedia page.
FYI: The horizontal ellipsis is character U+2026 (pdf).
Your original code works prefectly fine for me. It is able to convert any Hex from 00 to FF into the appropriate character. Using vs2008.
private static int HexStringToInt(string hexString)
{
try
{
return Convert.ToChar(hexString);
}
catch (FormatException ex)
{
throw new ArgumentException("Is not a valid hex character.", "hexString", ex);
}
// Convert.ToChar() will throw an ArgumentException also
// if hexString is bad
}
My guess would be that a Char in .NET is actually two bytes (16 bits), as they are UTF-16 encoded. Maybe you are only catching/writing the first byte of the value?
Basically, are you doing something with the char value afterwards that assumes it is 8-bits instead of 16, and is therefore truncating it?
You are probably using the default character encoding when reading in the RTF file, which is UTF-8, when the RTF file is actually stored using the "windows-1252" extended ASCII latin encoding.
C# strings use a 16 unicode bit wide character format. Translating windows-1252 character 0x85 to its unicode equivalent involves a complicated mapping, since the the code points (character numbers) are very different. Luckily Windows can do the work for you.
You can change the way the characters are converted when reading in the text by explicitly specifying the source encoding when opening the stream.
using System.IO;
using System.Text.Encoding;
using (TextReader tr = new StreamReader(path_to_RTF_file, Encoding.GetEncoding(1252)))
{
// Read from the file as usual.
}
Here's some rough code that should work for you:
// Convert hex number, which represents an RTF code-page escaped character,
// to the desired character (uses '85' from your example as a literal):
var number = int.Parse("85", System.Globalization.NumberStyles.HexNumber);
Debug.Assert(number <= byte.MaxValue);
byte[] bytes = new byte[1] { (byte)number };
char[] chars = Encoding.GetEncoding(1252).GetString(bytes).ToCharArray();
// or, use:
// char[] chars = Encoding.Default.GetString(bytes).ToCharArray();
string result = new string(chars);
Just use this function I modified (very slightly) from Chris' website:
private static string charScrubber(string content)
{
StringBuilder sbTemp = new StringBuilder(content.Length);
foreach (char currentChar in content)
{
if ((currentChar != 127 && currentChar > 1))
{
sbTemp.Append(currentChar);
}
}
content = sbTemp.ToString();
return content;
}
You can modify the "current Char" condition to remove whatever character is needed to be eliminated (as appearing here, you will not get any 0x00 characters, or the (char)127, or 0x57 character).
ASCII/Hex table here: http://www.cs.mun.ca/~michael/c/ascii-table.html
Chris' site: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/
-Tom