Searching for partial substring within string in C#

Searching for partial substring within string in C# - c#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input

Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.

First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Related

Best similarity measure for strings with shifted blocks

I'm comparing strings to compute similarity between them. Initially I went for "Levenshtein Distance" algorithm, but it now turns out that it is not the best algorithm for the kind of input I have. My input strings can undergo block-move operations, which results in large Levenshtein distance between very similar string. Here is an example of two string that have large edit distance, but are essentially similar:
First version
Q: Pick your favorite breed:
German Shephard
Dalmation
Colly
Rottweiler
Great Dane
Second version
Q: Which of the following breed is your favorite:
Colly
Dalmation
German Shephard
Great Dane
Rottweiler
I then checked for the diff utility that is used by GIT, which IIRC uses Myers algorithm, but GIT too suffers from the problem that it can't detect block shifts and considers them as two delete and insert operations.
Which other algorithm do I have that would give me smaller distances for strings that are qualitatively similar but might have large edit distances? Even better if an implementation in C#, VB.NET, C++ or Java is available so that I could port it.
Note: Qualitative does not mean any kind of intelligent content analysis. It would still be objective, but should just consider block moves to be one operation rather than N operations, where N is the number characters in the block.

What is the formula to calculate a QR Code's maximum data?

I've Google'd and read quite a bit on QR codes and the maximum data that can be used based on the various settings, all of it being in tabular format. I can't seem to find anything giving a formula or a proper explanation of how these values are calculated.
What I would like to do is this:
Present the user with a form, allowing them to choose Format, EC & Version.
Then they can type in some data and generate a QR code.
Done deal. That part is easy.
The addition I would like to include is a "remaining character count" so that they (the user) can see how much more data they can type in, as well as what effect the properties have on the storage capacity of the QR code.
Does anyone know where I can find the formula(s)? Or do I need to purchase ISO 18004:2006?

A formula to calculate the amount of data you could put in a QRcode would be quite complex to make, not mentioning it would need some approximations for the calculation to be possible. The formula would have to calculate the amount of modules dedicated to the data in your QRCode based on its version, and then calculate how many codewords (which are sets of 8 modules) will be used for the error correction.
To calculate the amount of modules that will be used for the data, you need to know how many modules will be used for the function patterns. While this is not a problem for the three finder patterns, the timing or the version/format information, there will be a problem with the alignment patterns as their number is dependent on the QRCode's version, meaning you anyway would have to use a table at that point.
For the second part, I have to say I don't know how to calculate the number of error correcting codewords based on the correction capacity. For some reason, there are more error correcting codewords used that there should to match the error correction capacity, as for example a 6-H QRCode can correct up to 32.6% of the data, instead of the 30% set by the H correction level.
In any case, as you can see a formula would be quite complex to implement. Using a table like already suggested is probably the best thing you could do.

I wrote the original AIM specification for QR Code back in the '90s for Denso Corporation, and was also project editor for both editions of the ISO/IEC 18004 standard. It was felt to be much easier for people producing code printing software to use a look-up table rather than calculate capacities from a formula - no easy job as there are several independent variables that have to be taken into account iteratively when parsing the text to be encoded to minimise its length in bits, in order to achieve the smallest symbol. The most crucial factor is the mix of characters in the data, the sequence and lengths of sub-strings of numeric, alphanumeric, Kanji data, with the overhead needed to signal each change of character set, then the required level of error correction. I did produce a guidance section for this which is contained in the ISO standard.

The storage is calculated by the QR mode and the version/type that you are using. More specifically the calculation is based on how 'compressible' the characters are and what algorithm that the qr generator is allowed to use on the content present.
More information can be found http://en.wikipedia.org/wiki/QR_code#Storage

Can you dynamically search for sequences within a string in c#?

First time asking a question on here;
I am looking for a way to be able to use a search algorithm, or a built in method to dynamically search for repeating sequences within a string, or other variable.
The reason I say dynamic, is because I want it to be able to search through the string and locate repeating sequences on its own. I am not going to be able to supply a constructor of a sequence to look for.
I am unsure if this is even possible, but if it is, all help would be appreciated!
Here is a basic visual representation of what I am looking for (mind you, this is not code, just a for instance of a string)
This is going to be a long string that will have sequences throughout it. This may have matching characters side by side or it may not, but regardless, this is going to be a long string. If this is going to be a long string, I need it to find these sequences throughout it on its own!
As you can see by the above example, there are 2 sets of matching sequences throughout the single string. If there is any way to identify these programatically, along with being able to be searched through very fast for these different patterns, it would help me significantly!
The matches will most likely be stored in a List / array for later use as well.
Thank you for any help you are able to provide!
Edit:
As this question was asked, case sensitivity will not be an issue.
When I was mentioning there were 2 matches, I meant that 2 particular sequences, had a duplicate. One of which, had 2 duplicates.
#HenkHolterman You are correct that this is going to be a compression algorithm, however, I was not sure where to start for looking for the sequences that I will be matching.
I had been doing multiple searches regarding something similar to this, but was coming up short with the answers I were looking for. That is why my question was posed here the way it was.
Thank you for all the responses I have gotten so far though!

Here's the basic brute force idea
first you find all repeating sequences of size 1(you can change the minimum size to whatever you want).
To do this, you essentially go down the line, and use a regex to find all of the Ts and then all the hs, etc...
Then you find all sequences of size 2, so you'd find all the Ths and the his and the iss
you repeat this until you have found all of the sequences.
The runtime would be
the time complexity to find a particular sequence with regex: O(n)
times the number of different sequences of a particular size: O(n)
times the number of sizes: O(n)
the total time complexity would be O(n3)

Use a suffix tree to do this in O(n) time. I am adding this extraneous sentence to keep this from being converted into a comment.

What Algorithm can i use to find any valid result depending on variable integer inputs

In my project i face a scenario where i have a function with numerous inputs. At a certain point i am provided with an result and i need to find one combination of inputs that generates that result.
Here is some pseudocode that illustrates the problem:
Double y = f(x_0,..., x_n)
I am provided with y and i need to find any combination that fits the input.
I tried several things on paper that could generate something, but my each parameter has a range of 6.5 x 10^9 possible values - so i would like to get an optimal execution time.
Can someone name an algorithm or a topic that will be useful for me so i can read up on how other people solved simmilar problems.
I was thinking along the lines of creating a vector from the inputs and judjing how good that vektor fits the problem. This sounds awful lot like an NN, but there is no training phase available.
Edit:
Thank you all for the feedback. The comments sum up the Problems i have and i will try something along the lines of hill climbing.

The general case for your problem might be impossible to solve, but for some cases there are numerical methods that can help you solve your problem.
For example, in 1D space, if you can find a number that is smaller then y and one that is higher then y - you can use the numerical method regula-falsi in order to numerically find the "root" (which is y in your case, by simply invoking the method onf(x) -y).
Other numerical method to find roots is newton-raphson
I admit, I am not familiar with how to apply these methods on multi dimensional space - but it could be a starter. I'd search the literature for these if I were you.
Note: using such a method almost always requires some knowledge on the function.
Another possible solution is to take g(X) = |f(X) - y)|, and use some heuristical algorithms in order to find a minimal value of g. The problem with heuristical methods is they will get you "close enough" - but seldom will get you exactly to the target (unless the function is convex)
Some optimizations algorithms are: Genethic Algorithm, Hill Climbing, Gradient Descent (where you can numerically find the gradient)

How do I determine if two similar band names represent the same band?

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance

The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.

I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.

soundex may also be useful

In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.