Approval Tests Image comparison with mask - c#

Is it possible to compare two images with a mask for area's that do not need to be compared.
I managed to get it working with a basic file comparison
[UseReporter(typeof(BeyondCompareReporter))]
public void ThenThePageShouldMatchTheApprovedVersion()
{
SaveScreenshot("page1");
Approvals.VerifyFile(#"C:\page1.png");
}
But i would like to create a mask of the area's i expect to change. Is this possible with ApprovalTests or will i need to modify the screenshot and manually apply the mask before comparing with the approved file. Or is it possible to write your own validators?

It's not possible to mask the area so the comparer will not compare them.
However, it is very easy to actually mask the area (ie, place a black square over the area before you call Verify)
Alternatively, you can usually mock out the variable that is changing.
Details on Comparer:
ApprovalsFileComparer is a very stupid comparer. It knows nothing about file formats and has no idea of what an image is. It simply compares byte to byte. This simplicity allows it to work everywhere, but removes the ability to be smart about stuff. This is usually not an issue as the reporters are very very smart. Able to render and compare and do subtractive diffs and the like.
Happy Testing!

Related

Powerpoint "Save As Picture" from C# Microsoft.Office.Interop.PowerPoint

My question is pretty similar to this one and I'm afraid the answer is the same... I want to save all the shapes/images on a slide as a single png (or jpeg). Programmatically, I get as far as
slide.Shapes.SelectAll();
but don't see a way to save as image. Is this possible? If not, any other suggestions, hopfully w/ examples? (not VBA - I need to automate the whole conversion)
There was a reference to OpenXML in the other post, but I'm not even sure how to pull that in.
I don't know how you'd do this in C# but I'd guess that you'd make use of the same methods as you would with VBA, where you can do:
Activewindow.Selection.ShapeRange.Export( "c:\temp\delete-me.jpg",ppShapeFormatJPG)
ppShapeFormatJPG is a PowerPoint constant, a VBA Long = 1; IIRC that'd be an Integer in C#.
The method also can take two more optional parameters, scalewidth and scaleheight, which govern the width and height of the exported image in undocumented ways. By default, no parms supplied, I get exports at 72 dpi. Larger numbers result in higher pixel count exports but distorted proportions. I'm sure there's some strange logic to it, but it escapes me; all hints welcome!
There's a third optional parm, ExportMode. In my tests, it makes no difference whether you supply it or not, and if you do, which of the available values you choose.

How to detect keyword stuffing?

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!
There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.
As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.
Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.
If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Xna custom VertexElementFormat and VertexElementUsage?

I'm a little confused and not sure how or if this is possible, but is it possible to create custom ones? VertexElementFormat doesn't contain the type I want to use which is byte3 and I have no idea how I could add that to it :/.
No, you can't. For your purposes, you can probably use Byte4 or Color.
These settings tell the GPU how to decode the data you send it and place it in shader registers. They are not extensible.
Note that you can have more than one Position or more than one Color (for examples) by setting the UsageIndex in your VertexElement.

Are there common methods for hashing an input file to a fixed set of values?

Let's say I'm trying to generate a monster for use in a roleplaying game from an arbitrary piece of input data. Think Barcode Battler or a more-recent iPod game whose name escapes me.
It seems to me like the most straightforward way to generate a monster would be to use a hash function on the input data (say, an MP3 file) and use that hash value to pick from some predetermined set of monsters, or use pieces of the hash value to generate statistics for a custom monster.
The question is, are there obvious methods for taking an arbitrary piece of input data and hashing it to one of a fixed set of values? The primary goal of hashing algorithms is, after all, to avoid collisions. Instead, I'm suggesting that we want to guarantee them - that, given a predetermined set of 100 monsters, we want any given MP3 file to map to one of them.
This question isn't bound to a particular language, but I'm working in C#, so that would be my preference for discussion. Thanks!
Hash the file using any hash function of your choice, convert the result into an integer, and take the result modulo 100.
monsterId = hashResult % 100;
Note that if you later decide to add a new monster and change the code to % 101, nearly all hashes will suddenly map to different monsters.
Okay, that's a very nice question. I would say: don't use hash, because this won't be a nice way for the player to predict patterns. From cognitive theory we know that one thing that is interesting in games is that player can learn by trial and error. So if player gives the input of an image of a red dragon and another image of a red dragon with slightly different pixels, he would like to have the same monster appearing, right? If you use hashes that would not be the case.
Instead, I would recommend doing much simpler things. Imagine that your raw piece of input is just a byte[] , it is itself already a list of numbers. Unfortunately it's only a list of numbers from 0 to 255, so if you for example do an average, you can get 1 number from 0 to 255 . That you could map to a number of monsters already, if you need more, you can read pairs of bytes and just compose Int16, that way you will be able to go up to 65536 possible monsters :)
You can use the MD5, SHA1, or SHA2 of a file as a unique finger print for the file. Each hash function will give you a larger, less overlapping fingerprint and each can be obtained by library functions already in the base libraries.
In truth you could probably hash a much smaller portion of the file, for instance the first 1-3MB of the file and still get a fairly unique fingerprint, without the expense of processing a larger file (like an AVI).
Look in the System.Security namespace for the MD5Crypto provider for an example of how to generate a MD5 from a byte sequence.
Edit: If you want to ensure that the hash collides in a relatively short order you can use CRC2, 4, 6, 8, 16, 32 which will collide fairly frequently (especially CRC2 :)) but be the same for the same file. It is easy to generate.

Categories