Similar Patterns Recognition

Similar Patterns Recognition - c#

I'm trying to automate a workflow process with a software to help the operator.
The operator looks at a chart build from a byte sequence read from some binary log files, and if he recognizes some specific patterns(in form or shape of the line; chart is a 2d line), he has to do something.
I'm already able to aquire the logs an find these patterns if they arematching exactly (I'm using a string search algo), but I have no idea how to detect patterns that are only partial matches, or similiar.
Some typical case could be:
1 the pattern that I'm looking for is present but with only some byte alerated ie
1a-2b-e1-1b-1a-8c instead of 1a-2b-e1-0b-1a-8c
2 the pattern that I'm looking for is present but with with an offset like
1a-2b-e1 instead of 10-2a-e0
3 a mix of 1 and 2
Anyone known a way to do that? I'm working in vb.net but any input would help.

a few things maybe worth looking at:
let's say our pattern is
01-23-45-67-89-A1
and our possible hit in the binary log looks like this:
02-23-46-66-00-89-A1
what happens when we calculate the absolute difference?
01-00-01-01-89-18
let's say we define a threshold of 01 per byte and a notation of XX for an accepted byte and RR for a rejected byte... what would be accepted and rejected?
XX-XX-XX-XX-RR-RR
now the index where the RR is beginning is interesting ... what happens if we skip this byte in the log?
02-23-46-66-00-89-A1 becomes
02-23-46-66-89-A1
abs difference again
01-00-01-01-00-00
acceptance would now be good...
XX-XX-XX-XX-XX-XX
on the other way around, we could have a byte missing in the log, which leads to the variant that you can try and insert the pattern byte of the first RR in place of the first RR... example:
let's say our pattern again is
01-23-45-67-89-A1
and our possible hit in the binary log looks like this:
02-23-46-88-A1-00
abs diff
01-00-01-21-18-A1
acceptance
XX-XX-XX-RR-RR-RR
now we try and insert... so our log virtually looks like this
02-23-46-67-88-A1-00
abs diff again...
01-00-01-00-01-00
acceptance
XX-XX-XX-XX-XX-XX
of course there may be more types of errors ... like a one off that neither skipping or inserting will fix ...
calculate the original difference ... plus the difference if you skip and the difference if you insert ... take the best result (read: the one with least RR bytes)
you will need to find suitable values for the threshold and how many skips/insertions you want to allow ... or instead of a binary acceptance you could use some metric for similarity

Related

What Algorithm can i use to find any valid result depending on variable integer inputs

In my project i face a scenario where i have a function with numerous inputs. At a certain point i am provided with an result and i need to find one combination of inputs that generates that result.
Here is some pseudocode that illustrates the problem:
Double y = f(x_0,..., x_n)
I am provided with y and i need to find any combination that fits the input.
I tried several things on paper that could generate something, but my each parameter has a range of 6.5 x 10^9 possible values - so i would like to get an optimal execution time.
Can someone name an algorithm or a topic that will be useful for me so i can read up on how other people solved simmilar problems.
I was thinking along the lines of creating a vector from the inputs and judjing how good that vektor fits the problem. This sounds awful lot like an NN, but there is no training phase available.
Edit:
Thank you all for the feedback. The comments sum up the Problems i have and i will try something along the lines of hill climbing.

The general case for your problem might be impossible to solve, but for some cases there are numerical methods that can help you solve your problem.
For example, in 1D space, if you can find a number that is smaller then y and one that is higher then y - you can use the numerical method regula-falsi in order to numerically find the "root" (which is y in your case, by simply invoking the method onf(x) -y).
Other numerical method to find roots is newton-raphson
I admit, I am not familiar with how to apply these methods on multi dimensional space - but it could be a starter. I'd search the literature for these if I were you.
Note: using such a method almost always requires some knowledge on the function.
Another possible solution is to take g(X) = |f(X) - y)|, and use some heuristical algorithms in order to find a minimal value of g. The problem with heuristical methods is they will get you "close enough" - but seldom will get you exactly to the target (unless the function is convex)
Some optimizations algorithms are: Genethic Algorithm, Hill Climbing, Gradient Descent (where you can numerically find the gradient)

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input

Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.

First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Compress small string

Maybe there are any way to compress small strings(86 chars) to something smaller?
#a#1\s\215\c\6\-0.55955,-0.766462,0.315342\s\1\x\-3421.-4006,3519.-4994,3847.1744,sbs
The only way I see is to replace the recurring characters on a unique character.
But i can't find something about that in google.
Thanks for any reply.

http://en.wikipedia.org/wiki/Huffman_coding
Huffman coding would probably be pretty good start. In general the idea is to replace individual characters with the smallest bit pattern needed to replicate the original string or dataset.
You'll want to run statistical analysis on a variety of 'small strings' to find the most common characters so that the more common characters will be represented with the smallest unique bit patterns. And possibly makeup a 'example' small string with every character that will need to be represented (like a-z0-9#.0-)

I took your example string of 85 bytes (not 83 since it was copied verbatim from the post, perhaps with some intended escapes not processed). I compressed it using raw deflate, i.e. no zlib or gzip headers and trailers, and it compressed to 69 bytes. This was done mostly by Huffman coding, though also with four three-byte backward string references.
The best way to compress this sort of thing is to use everything you know about the data. There appears to be some structure to it and there are numbers coded in it. You could develop a representation of the expected data that is shorter. You can encode it as a stream of bits, and the first bit could indicate that what follows is straight bytes in the case that the data you got was not what was expected.
Another approach would be to take advantage of previous messages. If this message is one of a stream of messages, and they all look similar to each other, then you can make a dictionary of previous messages to use as a basis for compression, which can be reconstructed at the other end by the previous messages received. That may offer dramatically improved compression if they messages really are similar.

You should look up RUN-LENGTH ENCODING. Here is a demonstration
rrrrrunnnnnn BECOMES 5r1u6n WHAT? truncate repetitions: for x consecutive r use xr
Now what if some of the characters are digits? Then instead of using x, use the character whose ASCII value is x. for example,
if you have 43 consecutive P, write +P because '+' has ASCII code 43. If you have 49 consecutive y, write 1y because '1' has ASCII code 49.
Now the catch, which you will find with all compression algorithms, is if you have a string with little or no repetitions. Then in that case your code may be longer than the original word. But that's true for all compression algorithms.
NOTE:
I don't encourage using Huffman coding because even if you use the Ziv-Lempel implementation, it's still a lot of work to get it right.

Is there a fast and non-fancy C# code/algorithm to compress a string of comma separated digits close to maximum info density?

In a nutshell, I programmed myself into a corner by creating a CLR aggregate that performs row id concatenation, so I say:
select SumKeys(id), name from SomeTable where name='multiple rows named this'
and I get something like:
SumKeys name
-------- ---------
1,4,495 multiple rows named this
But it dies when SumKeys gets > 8000 chars and I don't think I can do anything about it.
As a quick fix (it's only failing 1% of the time for my application) I thought I might compress the string down and I thought some of you bright people out there might know a slick way to do this.
Something like base64 made for 0-9 and a comma?

You'd be much better of if you figure out more reasonable storage for your data (maybe HashSet)...
But for compression try regular System.IO.Compression.GZipStream ( http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx ) and convert resulting byte array to base64 string if needed... or store as byte array.

How about a hexadecimal representation, where every digit represents a 4-bit half of a character byte (a nibble), with 0xa used as the comma? You will only get a 50% compression, but it is fast and simple.

Not sure how "fancy" you'd consider it, but zip/gzip compression is highly effective for any text (sometimes to the tune of 90% reduction or better). Since you're already working with C# and CLR integration, it hopefully wouldn't be too hard to setup/deploy. I haven't tinkered with any C# libraries for compression yet, but it's easy to find them. For example: http://sharpdevelop.net/OpenSource/SharpZipLib/ or http://dotnetzip.codeplex.com/ or even http://msdn.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx
Or an easier option might be to switch your field to text or varchar/nvarchar(max), if that's feasible.

You can use a Huffman tree. This is basically an algorithm to compress ascii into binary. I was told that it is basically what WinZIP uses, but I'm not sure if that is really true or not. I did a quick search for huffman coding c# and there seems to be at least one decent implementation out there, though I haven't used any of them.
If your "vocabulary" is just digits and commas, a Hoffman tree will get you very good compression.
http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c/

try:
SELECT name, GROUP_CONCAT(id) FROM SomeTable GROUP BY name WHERE name = 'multiple rows named this'

I came across a method that will work with SQL Server:
SELECT
STUFF((
SELECT ','+id FROM SomeTable a WHERE a.name = b.name FOR XML PATH('')
),1,1,'') AS SumKeys, name
FROM SomeTable b
GROUP BY name
WHERE name = 'multiple rows named this'
The WHERE clause is optional

How can I solve this three variable equation with C#?

My teacher asked me to create a program to solve something like:
2x plus 7y plus 2z = 76
6x plus 1y plus 4z = 26
8x plus 2y plus 18z = 1
x = ?
y = ?
z = ?
Problem is, this is literally the first two days of class and he expects us to make something like this.
Any help?

Since this is homework, I'll provide guidance, but not a complete answer...
My suggestion: Write this out on paper. How would you approach this on paper? Once you figure out the basic logic required, translating that into C# should be fairly straightforward.
You'll need to assign a variable for each portion of the equation (not just x/y/z, but also the coefficients), and just step through it in code using the same steps you do on paper.

If you know some maths, you can solve systems of equations using a matrix library (or roll your own).

I would suggest that you come up with the algorithm in pseudo-code before you touch any C#.
At least if you have defined the steps that you need to perform, the task simply becomes one of learning the syntax of C# to accomplish the steps.
Looks like you'll need a math textbook ;)

Have a go at solving this yourself on paper, but keep a note of what steps you do and try and work out what "Algorithm" you are using.
Once you've worked out your algorithm, have a go at writing some C# that does the same thing.

One more advice that can help you is that you'll need to store the equation in some data structure and then (repeatedly) run some steps that modify the data structure. The question is, which data structure can nicely represent this kind of data? If you focus just on the coefficients (since each row always has the same variable in it), you can write just:
2 7 2 76
6 1 4 26
8 2 18 1
Also, you can assume that all operations are + because "minus 7y" actually means "plus (-7)y". This looks like a 2D array, so when programming in C#, you can start by representing the equations as int[,]. Once you load the data into this data structure, you'll just need to write a method that does the operation you did on paper (in general).

Once you get the coefficients represented by a matrix (2 dimensional array), try googling "RREF" (Reduced Row Echelon Form). This is the matrix operation you will want to implement in your program in order to solve the system of equations. Good luck.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.