C# Compress a string of characters

C# Compress a string of characters - c#

Say 1 character is 1Byte, so i have 10 characters, and that is 10Bytes.
I have a sentence which contains of 20 characters and i need to upload this sentence to a server. and the limit is only 10 Bytes, how do i compress this sentence which is 20Bytes to 10Bytes.
Is there anyway i can do this via C# ?
EDIT
I have a 170 character sentence, i need to compress it in a way that it seems to be like 130characters. i am uploading this sentence to a 3rd party server, so i dont have any control over the server. Can this be done ?

Well you can't do it in a guaranteed way, no. There are far more possible sequences of 20 bytes than there are sequences of 10 bytes - so you can't possibly compress every sequence of 20 bytes reversibly into 10 bytes.
In general compression doesn't typically work very well with very small input lengths.
If you know that all your input will actually be (say) A-Z and space (i.e. 27 characters) then that's 5 bits... so you only need 100 bits in total. That's still a bit more than the 80 bits you have available, so you still couldn't guarantee to represent all sentences. You could make "common" characters shorter than "unusual" characters though, and get many sentences to work that way.
It's hard to be more specific without knowing what you really need to achieve, given the impossibility of the original requirement.

What you want should be possible most of the time, but I can guarantee problems. If you wrote a method using the GZipStream class, it could take this 170 byte string you have and reduce it. Like most people have said, the compression ratio really depends on the content itself.
Just as a test:
I took a string of "0123456789" repeating 17 times (for 170 characters), gzipped it and it reduced to 21 characters.
If I take a string of 170 zeros and gzip it, it gets reduced to 12 characters.
I took 170 bytes of random code, and it gets reduced down to 79 characters.
So in these cases, it would compress it down to fit into your space requirements; but there's no way to predict when and how often it wouldn't. The compression ratio may end up being 1:1 and there is an inherent overhead in creating the block structure, so it could actually result in compressed length of slightly larger than the original. Then again, you may have to base64 encode the whole thing to make it store properly in the DB, so that would increase your overhead even more.

You can't, compression ratio depends on the content of the string itself.
And even if you can compress the sequence, you must implement the decompression on the server too. But if you have access to the server you can simply divide the sequence in many parts.

You have a serious problem here. Twenty bytes is 160 bits is 2^160 possible messages. Ten bytes is 80 bits is 2^80 possible messages. Unless you have some way to reduce your source message space to only containing 2^80 possible messages, you can not do this.

If the messages are static, pass indices into an array containing the different messages it could be sending instead of passing the messages. If they're dynamic, then it's simply not possible unless you can limit yourself to a limited subset of ASCII and store multiple characters in one byte, or the string is extremely repetitive in which case you could consider Run-Length Encoding.

Related

Identifying repeating sequences of data in byte array

Given a sample of hexadecimal data, I would like to identify UNKNOWN sequences of bytes that are repeated throughout the sample. (Not searching for a known string or value) I am attempting to reverse engineer a network protocol, and I am working on determining data structures within the packet. As an example of what I'm trying to do (albeit on a smaller scale):
(af:b6:ea:3d:83:02:00:00):{21:03:00:00}:[b3:49:96:23:01]
{21:03:00:00}:(af:b6:ea:3d:83:02:00:00):01:42:00:00:00:00:01:57
And
(38:64:88:6e:83:02:00:00):{26:03:00:00}:[b3:49:96:23:01]
{26:03:00:00}:(38:64:88:6e:83:02:00:00):01:42:00:00:00:00:00:01
Obviously, these are easy to spot by eye, but patterns that are hundreds of chars into the data are not. I'm not expecting a magic bullet for the solution, just a nudge in the right direction, or even better, a premade tool.
I'm currently needing this for a C# project, but I am open to any and all tools.

If you have no idea what you are looking for, you could get an idea of the layout of the data by performing a negative entropy analysis on a reasonably large enough sample of conversations to see the length of the records/sub-records.
If the data is structured with repeated sequences of roughly the same length and content type you should see clusters of values with nearly the same negative entropy around the length of the record and sub records.
For example if you put a basic file with a lot of the same data through that, you should see values around the average record length with comparable negentropies (ex: if you use a CSV file with an average line length of 117 bytes, you might see 115, 116, 117 & 119 with the highest negentropy), and values around the most common field lengths with the same negentropy.
You might do a byte occurence scan, to see which byte values are likely separators.
There is a free hex editor with sources which does that for you (hexplorer, in the Crypto/Find Pattern menu). You may have to change the default font through Options to actually something in the UI.

Compress small string

Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.

http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)

I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.

You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

Shorten String from Byte Array

I have a structure that I am converting to a byte array of length 37, then to a string from that.
I am writing a very basic activation type library, and this string will be passed between people. So I want to shorten it from length 37 to something more manageable to type.
Right now:
Convert the structure to a byte array,
Convert the byte array to a base 64 string (which is still too long).
What is a good way to shorten this string, yet still maintain the data stored in it?
Thanks.

In the general case, going from an arbitrary byte[] to a string requires more data, since we assume we want to avoid non-printable characters. The only way to reduce it is to compress before the base-whatever (you can get a little higher than base-64, but not much - and it certainly isn't any more "friendly") - but compression won't really kick in for such a short size. Basically, you can't do that. You are trying to fit a quart in a pint pot, and that doesn't work.
You may have to rethink your requirements. Perhaps save the BLOB internally, and issue a shorter token (maybe 10 chars, maybe a guid) that is a key to the actual BLOB.

Data compression may be a possiblity to check out, but you can't just compress a 40-byte message to 6 bytes (for example).
If the space of possible strings/types is limited, map them to a list (information coding).

I don't know of anything better than base-64 if you actually have to pass the value around and if users have to type it in.
If you have a central data store they can all access, you could just give them the ID of the row where you saved it. This of course depends on how "secret" this data needs to be.
But I suspect that if you're trying to use this for activation, you need them to have an actual value.
How will the string be passed? Can you expect users to perhaps just copy/paste? Maybe some time spent on clearing up superfluous line breaks that come from an email reader or even your "Copy from here" and "Copy to here" lines might bear more fruit!

Can the characters in your string have non-printable chars? If so, you don't need to base64-encode the bytes, you can simply create the string from them (saved 33%)
string str = new string(byteArray.Cast<char>().ToArray());
Also, are the values in the byte array restricted somehow? If they fall into a certain range (i.e., not all of the 256 possible values), you can consider stuffing two of each in each character of the string.

If you really have 37 bytes of non-redundant information, then you are out of luck. Compression may help in some cases, but if this is an activation key, I would recommend having keys of same length (and compression will not enforce this).
If this code is going to be passed over e-mail, then I see no problem in having an even larger key. Another option might be to insert hyphens every 5-or-so characters, to break it into smaller chunks (e.g. XXXXX-XXXXX-XXXXX-XXXXX-XXXXX).

Use a 160bit hash and hope no collisions? It would be much shorter. If you can use a look-up table, just use a 128 or even 64bit incremental value. Much much shorter than your 37 chars.

Using my own security algorithm on a tcp connection

I know there are several techniques out there to encrypt data. I am not familiar with them so I been thinking on a way to make my application more secure. I basically have a server application and a client application. The client application sends data to the server app. anyways if you are probably familiar with this protocol you'll know that every thing that get's written to the network stream will be received by the other party. I am basically sending bytes. so my algorithm is something like:
I have a byte array ready to be sent to the server. Modify that byte array. all the bytes that hapend to have values greater than 0 and less than 50 add 5 to them. all the bytes that are greater than 49 and less than 100 add 2 two them. and keep doing the same thing for the rest of the bytes.
and then on the server side I will have the reverse technique.
will this be secure? how will someone sniffing packages will be able to find what I am sending?
Edit
Thanks guys for the help. I been thinking about algorithms and I came up with several ones:
technique 1
Let's say I want to send the byte[] {5,89,167,233,23,48,79}
first step: I will add a random byte to index 0 of the array:
so now the new byte array is {X, 5, 89, 167, 233, 23,48,79}
let's assume that x came out to be 75
if is greater than -1 and less than 50 I will apply algorithm number 2 two it. If it is greater than 49 and less than 100 I will apply algorithm 3 two it... etc...
In this case we will use algorithm 3:
so algorithm 3 will basically change the order of every 3 consecutive bytes so the actual byte that I will send is: {X, 167 (last item of the three consecutive bytes), 5 (first item), 89 (second Item), 48 (last item of the next three consecutive bytes), 233 (fist), 48, null,79,null)
get read of the null bytes in order to get {X, 167, 5,89,48,233,48,79}
------->
now the server will get {X, 167, 5,89,48,233,48,79} recall that x was 75 therefore it will apply algorithm 3 to decrypt. it will be basically the same thing in reverse order.
so it will do { 5 (the second item of the first three consecutive bytes), 89 (the last item), 167 (first item of those first three bytes),
233 (the second item of the next three bytes), 23, 48,
79
then the server will have 5,89,167,233,23,48,79
if X will have been 1 for instance I will do the same thing but instead of doing that in chuks of three I would do it on chunks of 2. basically flipping bytes. if x would had been 130 then do the same thing in chunks of 4....
I am not going to place the next technique. I may come up with several techniques I love algorithms lol.
I have to say that I agree with all of you let me show you why...
I think I have to be thinking what a hacker will do. I will probably be a bad hacker since I don't know about encryption but I thought about this. Ok I am a hacker and I want to be able to see what is being sent through the network. so if I am the hacker and I see {X, 167, 5,89,48,233,48,79} I will not be able to say nothing right. But since I am a clever hacker I get the program that streams those bytes in order to try to figure it out. I will then use the program to send something easy such as a file that contains the bytes {0,1,2,3,4,5,6}
by sending this bytes several times the hacker is going to be able to see stuff like:
{ 45, 1,0,3,2,5,4,6}
then maybe
{44 1,0,3,2,5,4,6}
.... etc
from that point of view now I understand why it might be easier to figure it out.

A good encryption CAN'T depend on the algorithm, it must depends on the key! The encryption algorithms are well known standards and rely on the secretness of the key, not of the algo!

First off, your scheme can't decrypt since e.g. 47 becomes 52 and 50 becomes 52, also. Second, it's insecure since anybody with your algorithm can easily decode your ciphertext (well, at least as well as you can, given that not even you can decode all messages). Moreover, a simple frequency-based approach would work since this is essentially a substitution cipher...

I am not familiar with them so I been thinking on a way to make my
application more secure.
Stop right there. Being unfamiliar with solutions that already exist isn't a reason to try to invent, from scratch, a new solution that is at least as secure as those solutions that do already exist. Your efforts should be directed towards becoming familiar with at least one of those solutions, i.e. SSL. I assure you it is infinitely more secure than anything you are likely to come up with in the short term.
And of course as you have just published your algorithm, it is already insecure.

Dissolve string bytes into a fixed length formula based pattern by using keys, and even extract those bytes

Suppose there is a string containing 255 characters. And there is a fixed length assume 64-128 bytes a kind of byte pattern. I want to "dissolve" that string with 255 characters, byte by byte into the other fixed length byte pattern. The byte pattern is like a formula based "hash" or something similar into which a formula based algorithm dissolves the bytes into it. Later, when I am required to extract the dissolved bytes from that fixed length pattern, I would use the same algorithm's reverse, or extract function. The algorithm works through special keys or passwords and uses them to dissolve the bytes into the pattern, the same keys are used to extract the bytes in their original value from the pattern. I ask for help from the coders here. Please also guide me with steps so that I be able to understand what steps are to be taken, what to do. I only know VB .NET and C#.
For instance:
I have this three characters: "A", "B", "C"
The formula based fixed length super pattern (works like a whirlpool) is:
AJE83HDL389SB4VS9L3
Now I wish to "dissolve", "submerge" the characters "A", "B", "C", one by one into the above pattern to change it completely. After dissolving the characters, the super pattern changes drastically, just like the hash:
EJS83HDLG89DB2G9L47
I would be able to extract the characters from the last dissolved character to the first by using an extraction algorhythm and the original keys which were used to dissolve the characters into this super pattern. After the extraction of all the characters, the super pattern resets to the original initial state. Each character insert and remove has a unique pattern state.
After extraction of all characters, the super pattern goes back to the original state. This happens upon the removal of the character by the extraction algo:
AJE83HDL389SB4VS9L3

This looks a lot like your previous question(s). The problem with them is that you seem to start asking from a half-baked solution.
So, what do you really want? Input , Output, Constraints?
To encrypt a string, use Encryption (Reijndael). To transform the resulting byte[] data to a string (for transport), use base64.

If you're happy having the 'keys' for the individual bits of data being determined for you, this can be done similarly to a one-time-pad (though it's not one-time!) - generate a random string as your 'base', then xor your data strings with it. Each output is the 'key' to get the original data back, and the 'base' doesn't change. This doesn't result in output data that's any smaller than the input, however (and this is impossible in the general case anyway), if that's what you're going for.
Like your previous question, you're not really being clear about what you want. Why not just ask a question about how to achieve your end goals, and let people provide answers describing how, or tell you why it's not possible.

Here are 2 cases
Lossless compression (exact bytes are decoded from compressed info)
In this case Shannon Entropy
clearly states that there can't be any algorithm which could compress data to rates greater than information entropy predicts.
Loosy compression (some original bytes are lost forever in compression scheme,- such as used in JPG image files (Do you remember setting of 'image quality' ??))
In this type of compression, you however can make better and better compression scheme with penalty that you loose more and more original bytes.
(Down to example of compression to zero bytes, where zero bytes are restored after, but this compression is invented either - magical button DELETE - moves information to black hole (sorry for sarcasm );)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.