Shorten String from Byte Array

Shorten String from Byte Array - c#

I have a structure that I am converting to a byte array of length 37, then to a string from that.
I am writing a very basic activation type library, and this string will be passed between people. So I want to shorten it from length 37 to something more manageable to type.
Right now:
Convert the structure to a byte array,
Convert the byte array to a base 64 string (which is still too long).
What is a good way to shorten this string, yet still maintain the data stored in it?
Thanks.

In the general case, going from an arbitrary byte[] to a string requires more data, since we assume we want to avoid non-printable characters. The only way to reduce it is to compress before the base-whatever (you can get a little higher than base-64, but not much - and it certainly isn't any more "friendly") - but compression won't really kick in for such a short size. Basically, you can't do that. You are trying to fit a quart in a pint pot, and that doesn't work.
You may have to rethink your requirements. Perhaps save the BLOB internally, and issue a shorter token (maybe 10 chars, maybe a guid) that is a key to the actual BLOB.

Data compression may be a possiblity to check out, but you can't just compress a 40-byte message to 6 bytes (for example).
If the space of possible strings/types is limited, map them to a list (information coding).

I don't know of anything better than base-64 if you actually have to pass the value around and if users have to type it in.
If you have a central data store they can all access, you could just give them the ID of the row where you saved it. This of course depends on how "secret" this data needs to be.
But I suspect that if you're trying to use this for activation, you need them to have an actual value.
How will the string be passed? Can you expect users to perhaps just copy/paste? Maybe some time spent on clearing up superfluous line breaks that come from an email reader or even your "Copy from here" and "Copy to here" lines might bear more fruit!

Can the characters in your string have non-printable chars? If so, you don't need to base64-encode the bytes, you can simply create the string from them (saved 33%)
string str = new string(byteArray.Cast<char>().ToArray());
Also, are the values in the byte array restricted somehow? If they fall into a certain range (i.e., not all of the 256 possible values), you can consider stuffing two of each in each character of the string.

If you really have 37 bytes of non-redundant information, then you are out of luck. Compression may help in some cases, but if this is an activation key, I would recommend having keys of same length (and compression will not enforce this).
If this code is going to be passed over e-mail, then I see no problem in having an even larger key. Another option might be to insert hyphens every 5-or-so characters, to break it into smaller chunks (e.g. XXXXX-XXXXX-XXXXX-XXXXX-XXXXX).

Use a 160bit hash and hope no collisions? It would be much shorter. If you can use a look-up table, just use a 128 or even 64bit incremental value. Much much shorter than your 37 chars.

Related

What is the most significant byte of 160 bit hash for arithmetic operations?

Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.

You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.

You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET

Encrypt larger string to smaller string using C# like reverse md5

We are trying to convert text "HELLOWORLDTHISISALARGESTRINGCONTENT" into a smaller text. while doing it using MD5 hash we are getting the 16 byte, since it is a one way encryption we are not able to decrypt it. Is there any other way to convert this large string to smaller and revert back the same data? If so please let us know how to do it
Thanks in advance.

Most compression algorithms won't be able to do much with a sequence that short (or may actually make it bigger) - so no: there isn't much you can do to magically shrink it. Your best bet would probably be just generate a guid, and store the full value keyed against the guid (in a database or whatever), and then use the short value as a one-time usage key, to look up the long value (and then erase the record).

It heavily depends on the input data. In general - the worst case - you can't lessen the size of a string through compression if the input data is not long enough and has a high entropy.
Hashing is the wrong approach as a hashing function tries to map a large input data to a short one, but it does not guarantee (by itself) that you can't find a second set of data to map to the same string.
What you can try to do is to imlement a compression algorithm or a lookback table.
Compression can be done by ziplib or any other compression library (just google for it). The lookback approach requires a second place to store the lookup information. For example, when you get the first input string, you map it to the number 1 and save the information 1 maps to {input data} somewhere else. For every subsequent data set you add another mapping entry. If the input data set is finite, this approach may save you space.

Storing 16 bytes of String array in 4 bytes memory, (compression) in RFID Tags

I hope that this question will not produce some vagueness. Actually I am working on RFID project and I am using Passive Tags. These Tags store only 4 bytes of Data, 32bits. I am trying to store more information in String in Tag's Data Bank. I searched the internet for String compression Algorithms but I didn't find any of them suitable. Someone please guide me through this issue. How can I save more data in this 4 bytes Data Bank, should I use some other strategy for storing, if yes, then what? Moreover, I am using C# on Handheld Window CE device.
I'll appreciate if someone could help me...

It depends on your tag, for example alien tag http://www.alientechnology.com/docs/products/Alien-Technology-Higgs-3-ALN-9662-Short.pdf , has EPC memory , I think you use your EPC memory but You can also use User Memory in your tag. You don't have to compress anything, just use your User Memory. Furthermore, technically I rather not to save many data on my tag, I use my own coding on 32 bit and relates(map) it to the more Data on my Software, and save my data on my Hard Disk. It is more safe too.

There is obviously no compression that can reduce arbitrary 16 byte values to 4 byte values. That's mathematically impossible, check the Pidgeonhole principle for details.
Store the actual data in some kind of database. Have the 4 bytes encode an integer that acts as a key for the row your want to refer to. For example by using an auto-increment primary key, or an index into an array. Works with up to 4 billion rows.

If you have less than 2^32 strings, simply enumerate them and then save the strings index (in your "dictionary") inside your 4 byte "Data Bank".

A compression scheme can't guarantee such high compression ratios.
The only way I can think of with 32-bits is to store an int in the 32-bits, and construct a local/remote URL out of it, which points to the actual data.
You could also make the stored value point to entries in a local look-up table on the device.

Unless you know a lot about the format of your string, it is impossible to do this. This is evident from the pigeonhole principle: you have a theoretical 2^128 different 16-byte strings, but only 2^32 different values to choose from.
In other words, no compression algorithm will guarantee that an arbitrary string in your possible input set will map to a 4-byte value in the output set.
It may be possible to devise an algorithm which will work in your particular case, but unless your data set is sufficiently restricted (at most 1 in 79,228,162,514,264,337,593,543,950,336 possible strings may be valid) and has a meaningful structure, then your only option is to store some mapping externally.

Compress small string

Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.

http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)

I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.

You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

Dissolve string bytes into a fixed length formula based pattern by using keys, and even extract those bytes

Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3

This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.

If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.

Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.