Using my own security algorithm on a tcp connection - c#

I know there are several techniques out there to encrypt data. I am not familiar with them so I been thinking on a way to make my application more secure. I basically have a server application and a client application. The client application sends data to the server app. anyways if you are probably familiar with this protocol you'll know that every thing that get's written to the network stream will be received by the other party. I am basically sending bytes. so my algorithm is something like:
I have a byte array ready to be sent to the server. Modify that byte array. all the bytes that hapend to have values greater than 0 and less than 50 add 5 to them. all the bytes that are greater than 49 and less than 100 add 2 two them. and keep doing the same thing for the rest of the bytes.
and then on the server side I will have the reverse technique.
will this be secure? how will someone sniffing packages will be able to find what I am sending?
Edit
Thanks guys for the help. I been thinking about algorithms and I came up with several ones:
technique 1
Let's say I want to send the byte[] {5,89,167,233,23,48,79}
first step: I will add a random byte to index 0 of the array:
so now the new byte array is {X, 5, 89, 167, 233, 23,48,79}
let's assume that x came out to be 75
if is greater than -1 and less than 50 I will apply algorithm number 2 two it. If it is greater than 49 and less than 100 I will apply algorithm 3 two it... etc...
In this case we will use algorithm 3:
so algorithm 3 will basically change the order of every 3 consecutive bytes so the actual byte that I will send is: {X, 167 (last item of the three consecutive bytes), 5 (first item), 89 (second Item), 48 (last item of the next three consecutive bytes), 233 (fist), 48, null,79,null)
get read of the null bytes in order to get {X, 167, 5,89,48,233,48,79}
------->
now the server will get {X, 167, 5,89,48,233,48,79} recall that x was 75 therefore it will apply algorithm 3 to decrypt. it will be basically the same thing in reverse order.
so it will do { 5 (the second item of the first three consecutive bytes), 89 (the last item), 167 (first item of those first three bytes),
233 (the second item of the next three bytes), 23, 48,
79
then the server will have 5,89,167,233,23,48,79
if X will have been 1 for instance I will do the same thing but instead of doing that in chuks of three I would do it on chunks of 2. basically flipping bytes. if x would had been 130 then do the same thing in chunks of 4....
I am not going to place the next technique. I may come up with several techniques I love algorithms lol.
I have to say that I agree with all of you let me show you why...
I think I have to be thinking what a hacker will do. I will probably be a bad hacker since I don't know about encryption but I thought about this. Ok I am a hacker and I want to be able to see what is being sent through the network. so if I am the hacker and I see {X, 167, 5,89,48,233,48,79} I will not be able to say nothing right. But since I am a clever hacker I get the program that streams those bytes in order to try to figure it out. I will then use the program to send something easy such as a file that contains the bytes {0,1,2,3,4,5,6}
by sending this bytes several times the hacker is going to be able to see stuff like:
{ 45, 1,0,3,2,5,4,6}
then maybe
{44 1,0,3,2,5,4,6}
.... etc
from that point of view now I understand why it might be easier to figure it out.

A good encryption CAN'T depend on the algorithm, it must depends on the key! The encryption algorithms are well known standards and rely on the secretness of the key, not of the algo!

First off, your scheme can't decrypt since e.g. 47 becomes 52 and 50 becomes 52, also. Second, it's insecure since anybody with your algorithm can easily decode your ciphertext (well, at least as well as you can, given that not even you can decode all messages). Moreover, a simple frequency-based approach would work since this is essentially a substitution cipher...

I am not familiar with them so I been thinking on a way to make my
application more secure.
Stop right there. Being unfamiliar with solutions that already exist isn't a reason to try to invent, from scratch, a new solution that is at least as secure as those solutions that do already exist. Your efforts should be directed towards becoming familiar with at least one of those solutions, i.e. SSL. I assure you it is infinitely more secure than anything you are likely to come up with in the short term.
And of course as you have just published your algorithm, it is already insecure.

Related

Identifying repeating sequences of data in byte array

Given a sample of hexadecimal data, I would like to identify UNKNOWN sequences of bytes that are repeated throughout the sample. (Not searching for a known string or value) I am attempting to reverse engineer a network protocol, and I am working on determining data structures within the packet. As an example of what I'm trying to do (albeit on a smaller scale):
(af:b6:ea:3d:83:02:00:00):{21:03:00:00}:[b3:49:96:23:01]
{21:03:00:00}:(af:b6:ea:3d:83:02:00:00):01:42:00:00:00:00:01:57
And
(38:64:88:6e:83:02:00:00):{26:03:00:00}:[b3:49:96:23:01]
{26:03:00:00}:(38:64:88:6e:83:02:00:00):01:42:00:00:00:00:00:01
Obviously, these are easy to spot by eye, but patterns that are hundreds of chars into the data are not. I'm not expecting a magic bullet for the solution, just a nudge in the right direction, or even better, a premade tool.
I'm currently needing this for a C# project, but I am open to any and all tools.
If you have no idea what you are looking for, you could get an idea of the layout of the data by performing a negative entropy analysis on a reasonably large enough sample of conversations to see the length of the records/sub-records.
If the data is structured with repeated sequences of roughly the same length and content type you should see clusters of values with nearly the same negative entropy around the length of the record and sub records.
For example if you put a basic file with a lot of the same data through that, you should see values around the average record length with comparable negentropies (ex: if you use a CSV file with an average line length of 117 bytes, you might see 115, 116, 117 & 119 with the highest negentropy), and values around the most common field lengths with the same negentropy.
You might do a byte occurence scan, to see which byte values are likely separators.
There is a free hex editor with sources which does that for you (hexplorer, in the Crypto/Find Pattern menu). You may have to change the default font through Options to actually something in the UI.

What is the most significant byte of 160 bit hash for arithmetic operations?

Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.
You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.
You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET

Compress small string

Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.
http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)
I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.
You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

C# Compress a string of characters

Say 1 character is 1Byte, so i have 10 characters, and that is 10Bytes.
I have a sentence which contains of 20 characters and i need to upload this sentence to a server. and the limit is only 10 Bytes, how do i compress this sentence which is 20Bytes to 10Bytes.
Is there anyway i can do this via C# ?
EDIT
I have a 170 character sentence, i need to compress it in a way that it seems to be like 130characters. i am uploading this sentence to a 3rd party server, so i dont have any control over the server. Can this be done ?
Well you can't do it in a guaranteed way, no. There are far more possible sequences of 20 bytes than there are sequences of 10 bytes - so you can't possibly compress every sequence of 20 bytes reversibly into 10 bytes.
In general compression doesn't typically work very well with very small input lengths.
If you know that all your input will actually be (say) A-Z and space (i.e. 27 characters) then that's 5 bits... so you only need 100 bits in total. That's still a bit more than the 80 bits you have available, so you still couldn't guarantee to represent all sentences. You could make "common" characters shorter than "unusual" characters though, and get many sentences to work that way.
It's hard to be more specific without knowing what you really need to achieve, given the impossibility of the original requirement.
What you want should be possible most of the time, but I can guarantee problems. If you wrote a method using the GZipStream class, it could take this 170 byte string you have and reduce it. Like most people have said, the compression ratio really depends on the content itself.
Just as a test:
I took a string of "0123456789" repeating 17 times (for 170 characters), gzipped it and it reduced to 21 characters.
If I take a string of 170 zeros and gzip it, it gets reduced to 12 characters.
I took 170 bytes of random code, and it gets reduced down to 79 characters.
So in these cases, it would compress it down to fit into your space requirements; but there's no way to predict when and how often it wouldn't. The compression ratio may end up being 1:1 and there is an inherent overhead in creating the block structure, so it could actually result in compressed length of slightly larger than the original. Then again, you may have to base64 encode the whole thing to make it store properly in the DB, so that would increase your overhead even more.
You can't, compression ratio depends on the content of the string itself.
And even if you can compress the sequence, you must implement the decompression on the server too. But if you have access to the server you can simply divide the sequence in many parts.
You have a serious problem here. Twenty bytes is 160 bits is 2^160 possible messages. Ten bytes is 80 bits is 2^80 possible messages. Unless you have some way to reduce your source message space to only containing 2^80 possible messages, you can not do this.
If the messages are static, pass indices into an array containing the different messages it could be sending instead of passing the messages. If they're dynamic, then it's simply not possible unless you can limit yourself to a limited subset of ASCII and store multiple characters in one byte, or the string is extremely repetitive in which case you could consider Run-Length Encoding.

Compare two spectogram to find the offset where they match algorithm

I record a daily 2 minutes radio broadcast from Internet. There's always the same starting and ending jingle. Since the radio broadcast exact time may vary from more or less 6 minutes I have to record around 15 minutes of radio.
I wish to identify the exact time where those jingles are in the 15 minutes record, so I can extract the portion of audio I want.
I already started a C# application where I decode an MP3 to PCM data and convert the PCM data to a spectrogram based on http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I tried to use a Cross Correlation algorithm on the PCM data but the algorithm is very slow around 6 minutes with a step of 10ms and is some occasion it fail to find the jingle start time.
Any ideas of algorithms to compare two spectrogram for match? Or a better way to find that jingle start time?
Thanks,
Update, sorry for the delay
First, thank for all the anwsers most of them were relevent and or interresting ideas.
I tried to implement the Shazam algorithm proposed by fonzo. But failed to detect the peaks in the spectrogram. Here's three spectrograms of the starting jingle from three different records. I tried AForge.NET with the blob filter (but it failed to identify peaks), to blur the image and check for difference in height, the Laplace convolution, slope analysis, to detect the series of vertical bars (but there was too many false positive)...
In the mean while, I tried the Hough algorithm proposed by Dave Aaron Smith. Where I calculate the RMS of each columns. Yes yes each columns, it's a O(N*M) but M << N (Notice a column is around 8k of sample). So in the overall it's not that bad, still the algorithm take about 3 minutes, but has never fail.
I could go with that solution, but if possible, I would prefer the Shazam cause it's O(N) and probably much faster (and cooler also). So does any of you have an idea of an algorithm to always detect the same points in those spectrograms (doesn't have to be peaks), thanks to add a comment.
New Update
Finally, I went with the algorithm explained above, I tried to implement the Shazam algorithm, but failed to find proper peaks in the spectrogram, the identified points where not constant from one sound file to another. In theory, the Shazam algorithm is the solution for that kind of problem. The Hough algorithm proposed by Dave Aaron Smith was more stable and effective. I split around 400 files, and only 20 of them fail to split properly. Disk space when from 8GB to 1GB.
Thanks, for your help.
There's a description of the algorithm used by the shazam service (which identifies a music given a short possibly noisy sample) here : http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
From what I understood, the first thing done is to isolate peaks in the spectrogram (with some tweaks to assure an uniform coverage), which will give a "constellation" of pair of values (time;frequency) from the initial spectrogram. Once done, the sample constellation is compared to the constellation of the full track by translating a window of the sample length from the beginning to the end and counting the number of correlated points.
The paper then describes the technical solution they found to be able to do the comparison fast even with a huge collection of tracks.
I wonder if you could use a Hough transform. You would start by cataloging each step of the opening sequence. Let's say you use 10 ms steps and the opening sequence is 50 ms long. You compute some metric on each step and get
1 10 1 17 5
Now go through your audio and analyze each 10 ms step for the same metric. Call this array have_audio
8 10 8 7 5 1 10 1 17 6 2 10...
Now create a new empty array that's the same length as have_audio. Call it start_votes. It will contain "votes" for the start of the opening sequence. If you see a 1, you may be in the 1st or 3rd step of the opening sequence, so you have 1 vote for the opening sequence starting 1 step ago and 1 vote for the opening sequence starting 3 steps ago. If you see a 10, you have 1 vote for the opening sequence starting 2 steps ago, a 17 votes for 4 step ago, and so on.
So for that example have_audio, your votes will look like
2 0 0 1 0 4 0 0 0 0 0 1 ...
You have a lot of votes at position 6, so there's a good chance the opening sequence starts there.
You could improve performance by not bothering to analyze the entire opening sequence. If the opening sequence is 10 seconds long, you could just search for the first 5 seconds.
Here is a good python package that does just this:
https://code.google.com/p/py-astm/
If you are looking for a specific algorithm, good search terms to use are "accoustic fingerprinting" or "perceptual hashing".
Here's another python package that could also be used:
http://rudd-o.com/new-projects/python-audioprocessing/documentation/manuals/algorithms/butterscotch-signatures
If you already know the jingle sequence, you could analyse the correlation with the sequence instead of the cross correlation between the full 15 minutes tracks.
To quickly calculate the correlation against the (short) sequence, I would suggest using a Wiener filter.
Edit: a Wiener filter is a way to locate a signal in a sequence with noise. In this application, we are considering anything that is "not jingle" as noise (question for the reader: can we still assume that the noise is white and not correlated?).
( I found the reference I was looking for! The formulas I remembered were a little off and I'll remove them now)
The relevant page is Wiener deconvolution. The idea is that we can define a system whose impulse response h(t) has the same waveform as the jingle, and we have to locate the point in a noisy sequence where the system has received an impulse (i.e.: emitted a jingje).
Since the jingle is known, we can calculate its power spectrum H(f), and since we can assume that a single jingle appears in a recorded sequence, we can say that the unknown input x(t) has the shape of a pulse, whose power density S(f) is constant at each frequency.
Given the knowledges above, you can use the formula to obtain a "jingle-pass" filter (as in, only signals shaped like the jingle can pass) whose output is highest when the jingle is played.

Categories