I'm trying to read-in a bunch of unsigned integers from a configuration file into a class. These numbers may be specified in either base-10 (eg: 1234) or in base-16 (eg: 0xAB31). Therefore looking for the strtoul equivalent in C# 2.0.
More specifically, I'm interested in a C# function which mimics the behaviour of the this function when the argument indicating the base or radix is passed in as zero. (Under C++, strtoul will attempt to 'guess' the base or radix based on the first couple of characters in the string and then proceed to convert the number suitably)
Currently I'm manually checking the first two characters (using string.Substring() method) of the string and then calling Convert.ToUInt32(hex, 10) or Convert.ToUInt32(hex, 16) as needed.
I'm sure that there has to be a better way to deal with this problem and hence this post. More elegant ideas/solutions or work-arounds would be great help.
Well, you don't need to use Substring unless it's in hex, but it sounds like you're basically doing it the right way:
return text.StartsWith("0x") ? Convert.ToUInt32(text.Substring(2), 16)
: Convert.ToUInt32(text, 10);
Obviously this will create an extra object for the Substring call, and you could write your own hex parsing code to cope with this - but unless you've actually run into performance problems with this approach, I'd keep it simple.
Related
I'm developing a pdf file viewer. A pdf file stores it characters in bytes and a pdf file can have several megabytes. Using strings for this scenario is a bad idea, because the storage space of a string cannot be reused for another string. Therefor I store these pdf bytes in a char array. When reading the next big pdf file, I can reuse the char array.
Now I need to support a search functionality, so that the user can find a certain text in this huge file. When I am searching, I usually don't want to have to enter proper upper and lower case letters, I might even not remember the correct casing, meaning the search should succeed regardless of casing. When using
string.IndexOf(String, StringComparison)
one can chose InvariantCultureIgnoreCase to get both upper and lower case matches.
However, converting the megabyte char array into an equally big string is a bad idea.
Unfortunately, IndexOf for an Array is not helpful:
public static int IndexOf<T> (T[] array, T value);
This allows to search for only 1 char in a char array and does also not support IgnoreCase, which obviously wouldn't make sense for other arrays, like an integer array.
So the question is:
Which method can be used from DotNet to search a string in a character array.
Please read this before marking this question as dupplicate
I am aware that there are already similar questions regarding searching. But the ones I have seen all convert the character array in one way or another into a string, which I definitely not want.
Also note that many of those solutions don't support ignoring the casing. The solution should also handle exotic Unicodes correctly.
And last but not least, best would be an existing method from DotNet.
I came to the conclusion that I need to implement my own IndexOf method for character arrays. However, programming that proved rather challenging, so I checked in the DotNet source code how string.IndexOf is doing it.
It's a bit confusing because one method is calling another which calls another, each doing not much. Finally, one arrives at:
public unsafe int IndexOf(ReadOnlySpan<char> source, ReadOnlySpan<char> value,
CompareOptions options = CompareOptions.None)
Lo and behold, that was exactly the functionality I was looking for, because it is very easy to convert a char[] into a ReadOnlySpan<char>. This method belongs to the CompareInfo class. To call it, one has to write something like this:
var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf(bigCharArray,
searchString, CompareOptions.IgnoreCase);
So this may be obvious but i have recently inherited some legacy code and scattered around the code are array indexes like this
someArray(&H7D0)
I get that this "&H7D0" is the index but how do i go about changing it to a real number as i am converting the code to C#.
the code is a mess and it's not obvious what it might be.
This is a Hexidecimal number. The equivalent C# would be someArray(0x7d0)
Both are equivalent to the decimal number 2000 so you could actually write someArray(2000) to allow the code to be used in both languages.
Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.
Using a string.CompareTo(string) i can get around this slightly but is not easy to read and i have read on that locallity settings might influence the result.
Is there a way to just simply use < or > on 2 Strings in a more straightforward way?
You can overload operators but you seldom should. To me "stringA" > "stringB" wouldn't mean a damn thing, it's not helping readability IMO. That's why operator overloading guidelines advise not to overload operators if the meaning is not obvious.
EDIT: Operator Overloading Usage Guidelines
Also, in case of String I'm afraid you can't do it seeing as you can put operator-overloading methods only in the class in which the methods are defined.
If the syntax of CompareTo bothers you, maybe wrapping it in extension method will solve your problem?
Like that:
public static bool IsLessThan(this string str, string str2) {
return str.Compare(str2) < 0;
}
I still find it confusing for reader though.
The bottom line is, you can't overload operators for String. Usually you can do something like declaring a partial and stuffing your overloads there, but String is a sealed class, so not this time. I think that the extension method with reasonable name is your best bet. You can put CompareTo or some custom logic inside it.
CompareTo is the proper way in my opinion, you can use the overloads to specify culture specific parameters...
You mention in a comment that you're comparing two strings with values of the form "A100" and "B001". This works in your legacy VB 6 code with the < and > operators because of the way that VB 6 implements string comparison.
The algorithm is quite simple. It walks through the string, one character at a time, and compares the ASCII values of each character. As soon as a character from one string is found to have a lower ASCII code than the corresponding character in the other string, the comparison stops and the first string is declared to be "less than" the second. (VB 6 can be forced to perform a case-insensitive comparison based on the system's current locale by placing the Option Compare Text statement at the top of
the relevant code module, but this is not the default setting.)
Simple, of course, but not entirely logical. Comparing ASCII values skips over all sorts of interesting things you might find in strings nowadays; namely non-ASCII characters. Since you appear to be dealing with strings whose contents have pre-defined limits, this may not be a problem in your particular case. But more generally, writing code like strA < strB is going to look like complete nonsense to anyone else who has to maintain your code (it seems like you're already having this experience), and I encourage you to do the "right thing" even when you're dealing with a fixed set of possible inputs.
There is nothing "straightforward" about using < or > on string values. If you need to implement this functionality, you're going to have to do it yourself. Following the algorithm that I described VB 6 as using above, you could write your own comparison function and call that in your code, instead. Walk through each character in the string, determine if it is a character or a number, and convert it to the appropriate data type. From there, you can compare the two parsed values, and either move on to the next index in the string or return an "equality" value.
There is another problem with that, I think:
Assert.IsFalse(10 < 2);
Assert.IsTrue("10" < "2");
(The second Assert assumes you did an overload for the < operator on the string class.)
But the operator suggests otherwise!!
I agree with Dyppl: you shouldn't do it!
i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)
The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.
You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D
Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.
many of us would like to be able to do that :=)