Sorting through byte arrays - c#

My program sends data from one application to another in a byte array. I want to pull sections of the data out to store in different variables. for instance the first [7] in the byte array hold the symbol data, the next section is a number which i don't know the length of because it will vary with each msg it sends. Before i send the data i break it up with commas between each section of data i want. My issue is setting up a loop that will stop at the commas so i can add the data into another variable. If this makes sense please any ideas will help. Thanks.

You need to know what encoding you have, since comma is not always the same byte value in different encoding schemes. Also if you want efficiency, you can try to parse the byte array as a byte array, but this is easier. Also, you could create a class on both ends that has the properties you need and is [Serializable].
If for whatever reason you don't want to do that then you can easily parse the byte array like this:
UTF8Encoding encoding = new UTF8Encoding();
string s = encoding.GetString(byteArray);
string[] values = s.Split(new char[] {','});
//then do something with the values

The data is just complicated to handle as a byte array, as it's really encoded text. Just decode it (using the encoding that you used to turn it into a byte array) and split it:
string[] parts = Encoding.UTF8.GetString(data).Split(',');
Now ou can get each part and parse them:
int symbol = Int32.Parse(parts[0]);
int count = Int32.Parse(parts[1]);

I recommend defining an object model that represents the data that you need to send, and then using some serialization framework to convert this to/from a byte array.
See for example http://msdn.microsoft.com/en-us/library/ms973893.aspx
Another topic which may be interesting for you is data contracts in .Net.

Related

1:1 decoding of UTF-8 octets for visualization

I'm making a tool (C#, WPF) for viewing binary data which may contain embedded text. It's traditional for such data viewers to use two vertical columns, one displaying the hexadecimal value of each byte and the other displaying the ASCII character corresponding to each byte, if printable.
I've been thinking it would be nice to support display of embedded text using non-ASCII encodings as well, in particular UTF-8 and UTF-16. The issue is that UTF code points don't map 1:1 with octets. I would like to keep the output grid-aligned according to its location in the data, so I need every octet to map to something to appear in the corresponding cell in the grid. What I'm thinking is that the end octet of each code point will map to the resulting Unicode character, and lead bytes map to placeholders that vary with sequence length (perhaps circled forms and use color to distinguish them from the actual encoded characters), and continuation and invalid bytes similarly to placeholders.
struct UtfOctetVisualization
{
enum Classification
{
Ascii,
NonAscii,
LeadByteOf2,
LeadByteOf3,
LeadByteOf4,
Continuation,
Error
}
Classification OctetClass;
int CodePoint; // valid only when OctetClass == Ascii or NonAscii
}
The Encoding.UTF8.GetString() method doesn't provide any information about the location each resulting character came from.
I could use Encoding.UTF8.GetDecoder() and call Convert passing a single byte at a time so that the completed output parameter gives a classification for each octet.
But in both methods, in order to have handling of invalid characters, I would need to implement a DecoderFallback class? This looks complicated.
Is there a simple way to get this information using the APIs provided with .NET (in System.Text or otherwise)? Using System.Text.Decoder, what would the fallback look like that fills in an output array shared with the decoder?
Or is it more feasible to write a custom UTF-8 recognizer (finite state machine)?
How about decoding one character at a time so that you can capture the number of bytes each character occupies. Something like this:
string data = "hello????";
byte[] buffer = new byte[Encoding.UTF8.GetByteCount(data)];
int bufferIndex = 0;
for(int i = 0; i < data.Length; i++)
{
int bytes = Encoding.UTF8.GetBytes(data, i, 1, buffer, bufferIndex);
Console.WriteLine("Character: {0}, Position: {1}, Bytes: {2}", data[i], i, bytes);
bufferIndex += bytes;
}
Fiddle: https://dotnetfiddle.net/poohHM
Those ???" in the string are supposed to be multi-byte characters, but SO dosent let me paste them in. See the Fiddle.
I dont this this is going to workout the way you want when you mix binary stuff with characters as #Jon has pointed out. I mean you'll see something, but it may not be what you expect, because the encoder wont be able to distinguish what bytes are supposed to be characters.

Determine what Byte array is?

I have recently started learning C# Networking and I was wondering how would you tell if the received Byte array is a file or a string?
A byte array is just a byte array. It's just got data in.
How you interpret that data is up to you. What's the difference between a text file and a string, for example?
Fundamentally, if your application needs to know how to interpret the data, you've got to put that into the protocol.
A byte array is just a byte array. However, you could make the original byte array include a byte that describes what type it is (assuming you are the originator of it). Then you find this descriptor byte and use it to make decisions.
Strings are encoded byte arrays; files can contain strings and/or binary data.
ASCII strings use byte values between 0-127 to represent characters and control codes. For UTF8 people have written validation routines (https://stackoverflow.com/a/892443/884862).
You'd have to check the array for all of the string encoding characteristics before you could assume it's a binary file.
edit Here's an SO question about classifying a file type Using .NET, how can you find the mime type of a file based on the file signature not the extension using a signature (first X bytes) of the file to determine it's mimetype.
No you can't. Data is data, you must layer on top of your network communication form of protocol, it will need to say something like: "If the first byte I see is a 1 the next four bytes represent a int, if I see a 2 read the next byte and that is the length of the text string that follows that..."
A much easier solution than inventing your own protocol is use a prebuilt one that gives you a higher level abstraction like WCF so you don't need to deal with byte arrays.
Not quite a "file", an array contains data. You should loop through that array and write the data,
Try this:
foreach(string data in array)
{
Console.WriteLine(data);
}
Now, if it doesn't contain strings, but data, you can simply use a
foreach(var data in array)
{
Console.WriteLine(data.ToString());
}

StringBuilder append byte without formatting

DateTime todayDateTime = DateTime.Now;
StringBuilder todayDateTimeSB = new StringBuilder("0");
todayDateTimeSB.Append(todayDateTime.ToString("MMddyyyy"));
long todayDateTimeLongValue = Convert.ToInt64(todayDateTimeSB.ToString());
// convert to byte array packed decimal
byte[] packedDecValue = ToComp3UsingStrings(todayDateTimeLongValue);
// append each byte to the string builder
foreach (byte b in packedDecValue)
{
sb.Append(b); // bytes 56-60
}
sb.Append(' ', 37);
The above code takes the current date time, formats it into a long value and passes that to a method which converts it to a packed decimal format. I know that the above works since when I step though the code the byte array has the correct Hex values for all of the bytes that I am expecting.
However the above is the code I am having issues with, specifically I have researched and found that the string builder .Append(byte) actually does a ToString() for that byte. Which is altering the value of the byte when it adds it to the string. The question is how do I tell the StringBuilder to take the 'byte' as is and store it in memory without formatting/altering the value. I know that there is also a .AppendFormat() which has several overloads which use the IFormatProvider to give lots and lots of options on how to format things but I don't see any way to tell it to NOT format/change/alter the value of the data.
You can cast the byte to a char:
sb.Append((char)b);
You can also use an ASCIIEncoding to convert all the bytes at once:
string s = Encoding.ASCII.GetString(packedDecValue);
sb.Append(s);
As noted, in a Unicode world, bytes (octets) are not characters. The CLR works with Unicode characters internally and internally represents them in the UTF-16 encoding. A StringBuilder builds a UTF-16 encoded Unicode string.
Once you have that UTF-16 string, however, you can re-encode it, using, say UTF-8 or the ASCIIEncoding. However, in both of those, code points 0x0080 and higher will not be left as-is.
UTF-8 uses 2 octets for code points 0x0080–0x07FF; 3 octets for code points 0x0800–0xFFFF and so on. http://en.wikipedia.org/wiki/UTF-8#Description
The ASCII encoding is worse: per the documentation, code points outside 0x0000–0x007F are simply chucked:
If you use the default encoder returned by the Encoding.ASCII property or the
ASCIIEncoding constructor, characters outside that range are replaced with a
question mark (?) before the encoding operation is performed.
If you need to send a stream of octets unscathed, you are better off using a System.IO.MemoryStream wrapped in a StreamReader and StreamWriter.
You can then access the MemoryStream's backing store via its GetBuffer() method or its ToArray() method. GetBuffer() gives you a reference to the actual backing store. However it likely contains alloated, but unused, bytes — you need to check the stream's Length and Capacity. ToArray() allocates a new array and copies the actual stream content into it, so the array reference you recieve is the correct length.

Byte array replace byte with byte sequence efficiency: iterate and copy versus SelectMany

I'm dealing with a byte array that comprises a text message, but some of the characters in the message are control characters (i.e. less than 0x20) and I want to replace them with sequences of characters that are human readable when decoded into ASCII (for instance 0x0F would display [TAB] instead of actually being a tab character). So as I see it, I have three options:
Decode the whole thing into an ASCII string, then use String.Replace() to swap out what I want. The problem with this is that the characters seem to just be decoded as the unprintable box character or question marks, thus losing their actual byte values.
Iterate through the byte array looking for any of my control characters and performing an array insert operation (make new larger array, copy existing pieces in, write new pieces).
Use Array.ToList<byte>() to convert the byte array to a List, then use IEnumerable.SelectMany() to transform the control characters into sequences of readable characters which SelectMany will then flatten back out for me.
So the question is, which is the best option in terms of efficiency? I don't really have a good feel for the performance implications of the IEnumerable lambda operations. I believe option 1 is out as functionally unworkable, but I could be wrong.
Try
// your byte array for the message
byte[] TheMessage = ...;
// a string representation of your message (the character 0x01... 0x32 are NOT altered)
string MessageString = Encoding.ASCII.GetString(TheMessage);
// replace whatever you want...
MessageString = MessageString.Replace (" ", "x").Replace ( "\n", " " )...
// the replaced message back as byte array
byte[] TheReplacedMessage= Encoding.ASCII.GetBytes(MessageString.ToCharArray());
EDIT:
Sample for replacing an 8 Bit byte value
MessageString = MessageString.Replace ( Encoding.ASCII.GetString (new byte[] {0xF7}), " " )...
Regarding the performance
I am not 100% sure whether it is the fastest approach... we just tried several approaches though our requirement was to replace "byte array of 1-n bytes" whithin the original byte-array... this came out the fastet+cleanest for our use case (1 MB - 1 GB files).

create object with its property values from flat file, need implementation ideas

I got a flat file where the data is not delimetered or something else.
The file contains one large string and one row is represented by 180 chars.
Each column value is definied by a length of chars.
I have to create an object for each row, parse the 180 chars and fill
properties of the created object with the parsed values.
How can i solve this problem without permanent using substring or something else?
Maybe some nice solution with Linq?
Thanks a lot.
Solution 1 - Super fast but unsafe:
Create your class with [StructLayout(LayoutKind.Sequential)] and all other unmanaged code markings for length. Your strings will be char array but can be exposed as string after loading.
Read 180 bytes and create a byte array of the same size inside a fixed block
Change pointer to IntPtr and use Marshal.PtrToStructure() to load an onject of your class
Solution 2 - Loading logic in the class:
Create a constructor in your class that accepts byte[] and inside the objects using Covenrt.Toxxx or Encoding.ASCII.ToString() assuming it is ASCII
Read 180 bytes and create an object and pass it to .ctor
If you have to serialise back to byte[] then implement a ToByteArray() method and again use Covenrt.Toxxx or Encoding.ASCII.ToString() to write to byte.
Enhancement to solutions 2:
Create custom attributes and decorate your classes with those so that you can have a factory that reads metadata and inflates your objects using byte array for you. This is most useful if you have more than a couple of such classes.
Alternative to solutions 2:
You may pass stream instead of a byte array which is faster. Here you would use BinaryReader and BinaryWriter to read and write values. Strings however is a bit trick since it writes the length as well I think.
Use a StringReader to parse your text, then you won't have to use substring. Linq won't help you here.
I agree with OJ but even with StringReader you will still need the position of each individual value to parse it out of the string...there is nothing wrong with substring just make sure you use static constants when defining the begging and ending lengths. Example:
private static int VAR_START_INDEX = 0;
private static int VAR_END_INDEX = 4;
String data = "thisisthedata";
String var = data.Substring(VAR_START_INDEX,VAR_END_INDEX);
//var would then be equal to 'this'
This library can help you http://f2enum.codeplex.com/

Categories