I want to transmit binary data over a noisy channel.
I read that a good ECC algorithm to detect errors is Reed-Solomon.
The problem is i don't understand the input for this algorithm.
here is my naive failed attempt with zxing.net:
int[] toEncode = { 123,232,432};
var gf = GenericGF.AZTEC_DATA_12;
ReedSolomonEncoder rse = new ReedSolomonEncoder(gf);
rse.encode(toEncode, 2);
ReedSolomonDecoder rsd = new ReedSolomonDecoder(gf);
rse.encode(toEncode, 2);
please explain to me the input for the encoder and decoder.
Is this the implementation you are using here: ReedSolomonEncoder.cs?
If so, to encode N integers with M data correction integers, you need to pass an array of length N+M. Your data should be in the first N indices and the codes look to be added at the end in the final M entries.
Also, note the following restriction in the encoder:
Update: a more recent version is here: http://zxingnet.codeplex.com/. Its most recent version of ReedSolomonEncoder.cs does not have this restriction.
This class implements Reed-Solomon encoding schemes used in processing QR codes. A very brief description of Reed Solomon encoding is here: Reed-Solomon Codes.
An encoding choice of "QR_CODE_FIELD_256" (which is probably a reasonable choice for you) means that error correction codes are being generated on byte-sized chunks ("symbols") of your message, which means your maximum message length (data to encode plus error correction codes) is 255 bytes long. If you are sending more data you will need to break it into chunks.
Update 2: Using QR_CODE_FIELD_256, your integers need to be between 0 and 255 as well, so to encode a general byte stream, you need to put each byte into a separate integer in the integer array, pass the int array (plus space for error correction codes) through the encoder, then reconvert to a (larger) byte array. And the reverse for decoding.
Related
I would like to get the number of bytes in a string using UTF-8 encoding without explicitly creating an array of bytes (because I do not need to use the array, just the number of bytes). Is this possible? My question is almost exactly this one but with C# instead of Java.
Thanks!
You can use the method GetByteCount to get the number of bytes that the string would produce with a given encoding.
var byteCount = System.Text.Encoding.UTF8.GetByteCount("myString");
.Net's implementation of Convert.FromBase64String() can be fed an invalid input and will produce output without throwing a FormatException. Most invalid input is caught correctly but this type is not. In this case FromBase64String will ignore the last 4 bits of the input string. This can cause multiple inputs to decode to the same byte array.
The issue is demonstrated in this code:
var validBase64 = Convert.FromBase64String("TQ==");//assigned length 1 byte array with single byte 0x4d
var invalidBase64 = Convert.FromBase64String("TR==");//also assigned the same
var validConvertedBack = Convert.ToBase64String(validBase64);//assigned TQ==
var invalidConvertedBack = Convert.ToBase64String(validBase64);//assigned TQ==
This occurs for the reason explained in this image:
The first 2 bytes of 'Q' (0,1) are used but the next 4 are ignored. Any character that starts with 0,1 will therefore decode to the same byte array such as R (0,1,0,0,0,1).
The Base64 encoding rfc does not define decoding, only encoding so .Net is not in violation of that spec. However the spec does warn that a similar issue is dangerous:
> The padding step in base 64 and base 32 encoding can, if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
Is Microsoft aware of this functionality? If not, could this be a security issue?
I'm making a tool (C#, WPF) for viewing binary data which may contain embedded text. It's traditional for such data viewers to use two vertical columns, one displaying the hexadecimal value of each byte and the other displaying the ASCII character corresponding to each byte, if printable.
I've been thinking it would be nice to support display of embedded text using non-ASCII encodings as well, in particular UTF-8 and UTF-16. The issue is that UTF code points don't map 1:1 with octets. I would like to keep the output grid-aligned according to its location in the data, so I need every octet to map to something to appear in the corresponding cell in the grid. What I'm thinking is that the end octet of each code point will map to the resulting Unicode character, and lead bytes map to placeholders that vary with sequence length (perhaps circled forms and use color to distinguish them from the actual encoded characters), and continuation and invalid bytes similarly to placeholders.
struct UtfOctetVisualization
{
enum Classification
{
Ascii,
NonAscii,
LeadByteOf2,
LeadByteOf3,
LeadByteOf4,
Continuation,
Error
}
Classification OctetClass;
int CodePoint; // valid only when OctetClass == Ascii or NonAscii
}
The Encoding.UTF8.GetString() method doesn't provide any information about the location each resulting character came from.
I could use Encoding.UTF8.GetDecoder() and call Convert passing a single byte at a time so that the completed output parameter gives a classification for each octet.
But in both methods, in order to have handling of invalid characters, I would need to implement a DecoderFallback class? This looks complicated.
Is there a simple way to get this information using the APIs provided with .NET (in System.Text or otherwise)? Using System.Text.Decoder, what would the fallback look like that fills in an output array shared with the decoder?
Or is it more feasible to write a custom UTF-8 recognizer (finite state machine)?
How about decoding one character at a time so that you can capture the number of bytes each character occupies. Something like this:
string data = "hello????";
byte[] buffer = new byte[Encoding.UTF8.GetByteCount(data)];
int bufferIndex = 0;
for(int i = 0; i < data.Length; i++)
{
int bytes = Encoding.UTF8.GetBytes(data, i, 1, buffer, bufferIndex);
Console.WriteLine("Character: {0}, Position: {1}, Bytes: {2}", data[i], i, bytes);
bufferIndex += bytes;
}
Fiddle: https://dotnetfiddle.net/poohHM
Those ???" in the string are supposed to be multi-byte characters, but SO dosent let me paste them in. See the Fiddle.
I dont this this is going to workout the way you want when you mix binary stuff with characters as #Jon has pointed out. I mean you'll see something, but it may not be what you expect, because the encoder wont be able to distinguish what bytes are supposed to be characters.
Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3
This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.
If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.
Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)
I am working on C#, trying below code
byte[] buffer = new byte[str.Length];
buffer = Encoding.UTF8.GetBytes(str);
In str I've got lengthy data but I've got problem in getting complete encoded bytes.
Please tell me what's going wrong and how can I overcome this problem?
Why are you creating a new byte array and then ignoring it? The value of buffer before the call to GetBytes is being replaced with a reference to a new byte array returned by GetBytes.
However, you shouldn't expect the UTF-8 encoded version of a string to be the same length in bytes as the original string's length in characters, unless it's all ASCII. Any character over U+007F takes up at least 2 bytes.
What's the bigger picture here? What are you trying to achieve, and why does the length of the byte array matter to you?
The proper use is:
byte[] buffer = Encoding.UTF8.GetBytes(str);
In general, you should not make any assumptions about length/size/count when working with encoding, bytes and chars/strings. Let the Encoding objects do their work and then query the resulting objects for that info.
Having said that, I don't believe there is an inherent length restriction for the encoding classes. I have several production apps doing the same work in the opposite direction (bytes encoded to chars) which are processing byte arrays in the 10s of megabytes.