We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!
There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.
As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.
Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.
If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.
Related
I've Google'd and read quite a bit on QR codes and the maximum data that can be used based on the various settings, all of it being in tabular format. I can't seem to find anything giving a formula or a proper explanation of how these values are calculated.
What I would like to do is this:
Present the user with a form, allowing them to choose Format, EC & Version.
Then they can type in some data and generate a QR code.
Done deal. That part is easy.
The addition I would like to include is a "remaining character count" so that they (the user) can see how much more data they can type in, as well as what effect the properties have on the storage capacity of the QR code.
Does anyone know where I can find the formula(s)? Or do I need to purchase ISO 18004:2006?
A formula to calculate the amount of data you could put in a QRcode would be quite complex to make, not mentioning it would need some approximations for the calculation to be possible. The formula would have to calculate the amount of modules dedicated to the data in your QRCode based on its version, and then calculate how many codewords (which are sets of 8 modules) will be used for the error correction.
To calculate the amount of modules that will be used for the data, you need to know how many modules will be used for the function patterns. While this is not a problem for the three finder patterns, the timing or the version/format information, there will be a problem with the alignment patterns as their number is dependent on the QRCode's version, meaning you anyway would have to use a table at that point.
For the second part, I have to say I don't know how to calculate the number of error correcting codewords based on the correction capacity. For some reason, there are more error correcting codewords used that there should to match the error correction capacity, as for example a 6-H QRCode can correct up to 32.6% of the data, instead of the 30% set by the H correction level.
In any case, as you can see a formula would be quite complex to implement. Using a table like already suggested is probably the best thing you could do.
I wrote the original AIM specification for QR Code back in the '90s for Denso Corporation, and was also project editor for both editions of the ISO/IEC 18004 standard. It was felt to be much easier for people producing code printing software to use a look-up table rather than calculate capacities from a formula - no easy job as there are several independent variables that have to be taken into account iteratively when parsing the text to be encoded to minimise its length in bits, in order to achieve the smallest symbol. The most crucial factor is the mix of characters in the data, the sequence and lengths of sub-strings of numeric, alphanumeric, Kanji data, with the overhead needed to signal each change of character set, then the required level of error correction. I did produce a guidance section for this which is contained in the ISO standard.
The storage is calculated by the QR mode and the version/type that you are using. More specifically the calculation is based on how 'compressible' the characters are and what algorithm that the qr generator is allowed to use on the content present.
More information can be found http://en.wikipedia.org/wiki/QR_code#Storage
I have a c# program that lets me use my microphone and when I speak, it does commands and will talk back. For example, when I say "What's the weather tomorrow?" It will reply with tomorrows weather.
The only problem is, I have to type out every phrase I want to say and have it pre-recorded. So if I want to ask for the weather, I HAVE to say it like i coded it, no variations. I am wondering if there is code to change this?
I want to be able to say "Whats the weather for tomorrow", "whats tomorrows weather" or "can you tell me tomorrows weather" and it tell me the next days weather, but i don't want to have to type in each phrase into code. I seen something out there about e.Result.Alternates, is that what I need to use?
This cannot be done without involving linguistic resources. Let me explain what I mean by this.
As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).
Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.
Pattern matching
To avoid hard-coding the sentences like
Tell me about the weather.
What's the weather tomorrow?
Weather report!
you can instead define a pattern that matches any of those sentences:
if a sentence contains "weather", then output a weather report
This can be further refined in manifold ways, e.g. :
if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol
This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.
Measuring Similarity
If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.
In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as
I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0
Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).
The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).
On a more general note, there are many other considerations to be made of course.
For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:
I want you to output a weather report.
Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.
Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.
Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.
For that, the code won't be able to judge that! You need to split the command in text array! Such as
Tomorrow
Weather
What
This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).
It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.
Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.
This is going to be a long post. I would like to have suggestions if any on the procedure I am following. I want the best method to print line numbers next to each CRLF-terminated-line in a richtextbox. I am using C# with .NET. I have tried using ListView but it is inefficient when number of lines grow. I have been successful in using Graphics in custom control to print the line numbers and so far I am happy with the performance.
But as the number of lines grow to 50K to 100K the scrolling is affected badly. I have overridden WndProc method and handling all the messages to call the line-number printing only when required. (Overriding OnContentsResized and OnVScroll make redundant calls to the printing method).
Now the line number printing is fine when number of lines is small say upto 10K (with which I am fine as it is rare need to edit a file with 10000 lines) but I want to remove the limitation.
Few Observations
Number of lines displayed in the richtexbox is constant +-1. So, the performance difference should be due to large text and not because I am using Graphics painting.
Painting line numbers for large text is slower when compared to small files
Now the Pseudo Code
FIRST_LINE_NUMBER = _textBox.GetFirstVisibleLineNumber();
LAST_LINE_NUMBER = _textBox.GetLastVisibleLineNUmber();
for(loop_from_first_to_last_line_number)
{
Y = _textBox.GetYPositionOfLineNumber(current_line_number);
graphics_paint_line_number(current_line_number, Y);
}
I am using GetCharIndexFromPosition and loop through the RichTextBox.Lines to find the line number in both the functions which get the line numbers. To get Y position I am using GetPositionFromCharIndex to get the Point struct.
All the above RichTextBox methods seem to be of O(n), which eats up the performance. (Correct me if I am wrong.)
I have decided to use a binary-tree to store the line numbers to improve the search perfomance when searching for line number by char index. I have an idea of getting a data-structure which takes O(n) construction time, O(nlgn) worst-case-update, and O(lgn) search.
Is this approach worth the effort?
Is there any other approach to solve the problem? If required I am ready to write the control from scratch, I just want it to be light-weight and fast.
Before deciding on the best way forward, we need to make sure we understand the bottleneck.
First of all, it is important to know how RichTextbox (which I assume you are using as you mentioned it) handles the large files. So I would recommend to remove all line printing stuff and see how it performs with large text. If it is poor, there is your problem.
Second step would be to put some profiling statements or just use a profiler (one comes with the VS 2010) to find the bottleneck. It might turn out to be the method for finding the line number, or something else.
At this point, I would only suggest more investigation. If you have finished the investigation and have more info, update your question and I will get back to you accordingly.
I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.
Assumptions
The documents can be pretty large, up
to 100MB.
More often than not the document
will remain unchanged - (i.e. I don't
want to do unnecessary up front
processing).
Changes will typically be quite close
to each other in the document (i.e. as
the user types).
It should be possible to apply changes fast (without copying the whole document)
Changes will be applied in terms of
offsets and new/deleted text (not as
line/col).
To work in C#
Current considerations
Storing the data as a string. Easy to
code, fast to set, very slow to
update.
Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).
There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).
Thanks for your thoughts.
I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.
To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.
File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400
Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets
16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.
Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).
Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.
Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.
You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.
FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.
I would suggest you to take a look at Memory Mapped Files (MMF).
Some pointers:
Memory Mapped Files .NET
http://msdn.microsoft.com/en-us/library/ms810613.aspx
I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.
You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.
You can move lines within a node without much effort.
The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.
Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.
Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.
The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.
I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..
From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.
If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)
I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.