How to print line numbers for textbox in c# - c#

This is going to be a long post. I would like to have suggestions if any on the procedure I am following. I want the best method to print line numbers next to each CRLF-terminated-line in a richtextbox. I am using C# with .NET. I have tried using ListView but it is inefficient when number of lines grow. I have been successful in using Graphics in custom control to print the line numbers and so far I am happy with the performance.
But as the number of lines grow to 50K to 100K the scrolling is affected badly. I have overridden WndProc method and handling all the messages to call the line-number printing only when required. (Overriding OnContentsResized and OnVScroll make redundant calls to the printing method).
Now the line number printing is fine when number of lines is small say upto 10K (with which I am fine as it is rare need to edit a file with 10000 lines) but I want to remove the limitation.
Few Observations
Number of lines displayed in the richtexbox is constant +-1. So, the performance difference should be due to large text and not because I am using Graphics painting.
Painting line numbers for large text is slower when compared to small files
Now the Pseudo Code
FIRST_LINE_NUMBER = _textBox.GetFirstVisibleLineNumber();
LAST_LINE_NUMBER = _textBox.GetLastVisibleLineNUmber();
for(loop_from_first_to_last_line_number)
{
Y = _textBox.GetYPositionOfLineNumber(current_line_number);
graphics_paint_line_number(current_line_number, Y);
}
I am using GetCharIndexFromPosition and loop through the RichTextBox.Lines to find the line number in both the functions which get the line numbers. To get Y position I am using GetPositionFromCharIndex to get the Point struct.
All the above RichTextBox methods seem to be of O(n), which eats up the performance. (Correct me if I am wrong.)
I have decided to use a binary-tree to store the line numbers to improve the search perfomance when searching for line number by char index. I have an idea of getting a data-structure which takes O(n) construction time, O(nlgn) worst-case-update, and O(lgn) search.
Is this approach worth the effort?
Is there any other approach to solve the problem? If required I am ready to write the control from scratch, I just want it to be light-weight and fast.

Before deciding on the best way forward, we need to make sure we understand the bottleneck.
First of all, it is important to know how RichTextbox (which I assume you are using as you mentioned it) handles the large files. So I would recommend to remove all line printing stuff and see how it performs with large text. If it is poor, there is your problem.
Second step would be to put some profiling statements or just use a profiler (one comes with the VS 2010) to find the bottleneck. It might turn out to be the method for finding the line number, or something else.
At this point, I would only suggest more investigation. If you have finished the investigation and have more info, update your question and I will get back to you accordingly.

Related

Efficiently Concatenating Delimited Characters Into Words

I am working on an NLP based application that uses global keyboard hook to read key presses. Here is it's working interface:
BIEngine.Hook.KeyboardListener KListener = new BIEngine.Hook.KeyboardListener();
private void Application_Startup(object sender, StartupEventArgs e)
{
KListener.KeyDown += new BIEngine.Hook.RawKeyEventHandler(KListener_KeyDown);
}
void KListener_KeyDown(object sender, BIEngine.Hook.RawKeyEventArgs args)
{
Trace.WriteLine(args.ToString());
}
Now I am getting the words for the user as space, carriage return, tab, periods etc delimited set of alphabets. So if the user types in his software window got today I would be getting
g
o
t
t..
So what would be the most efficient way (as this application would be running constantly in the background) to concatenate these letters to form words sans the spaces and other delimiters and react to a certain set words, say for example if the user types today, it will be passed to the NLP library and the user would be presented with some sort of feedback.
Thanks for any suggestions, codes etc.
I strongly recommend that you use the simplest approach that does what you want, and stop worrying about performance. Premature optimization, as it's known, can cost lots of time with very little benefit.
If you never let the string get particularly long (like, ~2000 characters) then I suggest you simply append to a normal string, trimming it whenever it grows longer than, say, 100 chars. I highly doubt you will be able to observe any performance impact from this. Only if you ever run into measurable performance problems (say, you notice the program taking more than 0.1% CPU time while the user is typing) should you consider optimizing this. And I bet you'll find that it's not your string concatenation that is using the CPU, but something else altogether.
Why? Because if you try to optimize everything before it is a problem, you will never get much actual work done. Most of the time optimization is unnecessary.
Having said all this, the most efficient way to match a string character by character would be to use a finite state machine, but I feel that explaining how to go about that is outside the scope of this question.
Looking at your post, I automatically thought about using a write instead of a writeline, but I don't know what implications that might have on your actual configuration.
That would keep it on the "same line", but to what end?
You can also insert a block code to your app to perform the visual or logical transformations, then display it or process it.
This way, you don't have to add additional workloads to your app start procedure.

Reading a specific line from a huge text file (c# 4.0)

EDIT:
#Everyone Sorry, I feel silly getting mixed up with the size of int32. Question could be closed, but since there are several answers already, I selected the first one.
Original question is below for reference
I am looking for a way to load a specific line from very large textfiles and I was planning on using File.ReadLines and the Skip() method:
File.ReadLines(fileName).Skip(nbLines).Take(1).ToArray();
Problem is, Skip() takes an int value, and int values are limited to 2 million or so. Should be fine for most files, but what if the file contains, say 20 million lines? I tried using a long, but no overload of Skip() accepts longs.
Lines are of variable, unknown length so I can't count the bytes.
Is there an option that doesn't involve reading line by line or splitting the file in chunks? This operation must be very fast.
Integers are 32-bit numbers, and so are limited to 2 billion or so.
That said, if you have to read a random line from the file, and all you know is that the file has lines, you will have to read it line by line until you reach the line you want. You can use some buffers to ease up on the I/O a little bit (they're on by default), but you won't get any better performance than that.
Unless you change the way the file is saved. If you could create an index file, containing the position of each line the main file, you can make reading a line infinitely faster.
Well, not infinitely, a but a lot faster - from O(N) to almost O(1) (almost, because seeking to a random byte in a file may not be an O(1) operation, depending on how the OS does it).
I voted to close your question because your premises are incorrect. However, were this a real problem, there's nothing to stop you writing your own Skip extension method that takes a long instead of an int:
public static class SkipEx
{
public static IEnumerable<T> LongSkip<T>(this IEnumerable<T> src,
long numToSkip)
{
long counter = 0L;
foreach(var item in src)
{
if(counter++ < numToSkip)continue;
yield return item;
}
}
}
so now you can do such craziness as
File.GetLines(filename).LongSkip(100000000000L)
without problems (and come back next year...). Tada!
Int values are limited to around 2 billion not two million. So unless your file is going to have more than around 2.4 billion lines, you should be fine.
You always can use SkipWhile and TakeWhile, and write your own predicates

How to detect keyword stuffing?

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!
There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.
As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.
Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.
If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.

Why does textbox overflow slow down the program so significantly?

I made an application (something like Google Maps) and I added a textbox field to which debugging data were written (of course I meant to remove it afterwards). The interesting fact is that after it was "full" let's say several kilobytes - the whole program slowed down significantly and needed to be exited because one could not work with it.
Could you please explain?
Well, it is surely more than a couple of kilobytes. But yes, TextBox is pretty unsuitable as a control to display tracing information. Every time you add a new line, it must re-allocate its internal buffer, merging the old text with the new text. It is the exact same kind of problem with .NET's String class. With the StringBuilder class as a workaround, but no equivalent exists for TextBox.
Another option that makes TextBox very slow when you add a lot of lines is the WordWrap property. Setting it to True requires it to do a lot of work to figure out the length of each line every time it paints itself.
So workarounds are to leave WordWrap set to False and to prevent the amount of text from growing boundlessly by throwing half of it away whenever the length reaches a limit. Or by using a different control, TextBox isn't very suitable anyway since it doesn't make sense to edit tracing data. Like ListBox.
Instead of appending a little data at a time, eg:
debugTextBox.Text += "Some new debug info"
Perhaps this stragegy might be faster:
StringBuilder debugText = new StringBuilder();
...
debugText.Append("Some new debug info");
debugTextBox.Text = debugText.ToString();
(although StringBuilder is probably overkill for this, and may prove slower than just working directly with string concatenations against a string debugText)

Best approach to holding large editable documents in memory

I need to hold a representation of a document in memory, and am looking for the most efficient way to do this.
Assumptions
The documents can be pretty large, up
to 100MB.
More often than not the document
will remain unchanged - (i.e. I don't
want to do unnecessary up front
processing).
Changes will typically be quite close
to each other in the document (i.e. as
the user types).
It should be possible to apply changes fast (without copying the whole document)
Changes will be applied in terms of
offsets and new/deleted text (not as
line/col).
To work in C#
Current considerations
Storing the data as a string. Easy to
code, fast to set, very slow to
update.
Array of Lines, moderatly easy to code, slower to set (as we have to parse the string into lines), faster to update (as we can insert remove lines easily, but finding offsets requires summing line lengths).
There must be a load of standard algorithms for this kind of thing (it's not a million miles of disk allocation and fragmentation).
Thanks for your thoughts.
I would suggest to break the file into blocks. All blocks have the same length when you load them, but the length of each block might change if the user edits this blocks. This avoids moving 100 megabyte of data if the user inserts one byte in the front.
To manage the blocks, just but them - together with the offset of each block - into a list. If the user modifies a blocks length you must only update the offsets of the blocks after this one. To find an offset, you can use binary search.
File size: 100 MiB
Block Size: 16 kiB
Blocks: 6400
Finding a offset using binary search (worst case): 13 steps
Modifying a block (worst case): copy 16384 byte data and update 6400 block offsets
Modifying a block (average case): copy 8192 byte data and update 3200 block offsets
16 kiB block size is just a random example - you can balance the costs of the operations by choosing the block size, maybe based on the file size and the probability of operations. Doing some simple math will yield the optimal block size.
Loading will be quite fast, because you load fixed sized blocks, and saving should perform well, too, because you will have to write a few thousand blocks and not millions of single lines. You can optimize loading by loading blocks only on demand and you can optimize saving by only saving all blocks that changed (content or offset).
Finally the implementation will not be to hard, too. You could just use the StringBuilder class to represent a block. But this solution will not work well for very long lines with lengths comparable to the block size or larger because you will have to load many blocks and display only a small parts with the rest being to the left or right of the window. I assume you will have to use a two dimensional partitioning model in this case.
Good Math, Bad Math wrote an excellent article about ropes and gap buffers a while ago that details the standard methods for representing text files in a text editor, and even compares them for simplicity of implementation and performance. In a nutshell: a gap buffer - a large character array with an empty section immediately after the current position of the cursor - is your simplest and best bet.
You might find this paper useful --- Data Structures for Text Sequences which describes and experimentally analyses a few standard algorithms, and compares [among other things] gap buffers and piece tables.
FWIW, it concludes piece tables are slightly better overall; though net.wisdom seems to prefer gap buffers.
I would suggest you to take a look at Memory Mapped Files (MMF).
Some pointers:
Memory Mapped Files .NET
http://msdn.microsoft.com/en-us/library/ms810613.aspx
I'd use a b-tree or skip list of lines, or larger blocks if you aren't going to edit much.
You don't have much extra cost determine line ends on load, since you have to visit each character on loading anyway.
You can move lines within a node without much effort.
The total length of the text in each node is stored in the node, and changes propagated up to parent nodes.
Each line is represented by a data array, and start index, length and capacity. Line break/carriage returns aren't put in the data array. Common operations such as breaking lines only requires changes to the references into the array; editing lines requires a copy if capacity is exceeded. A similar structure might be used per line temporarily when editing that line, so you don't perform a copy on each key-press.
Off the top of my head, I would have thought an indexed linked list would be fairly efficient for this sort of thing unless you have some very long lines.
The linked list would give you an efficient way to store the data and add or remove lines as the user edits. The indexing allows you to quickly jump to a particular point in your file. This sort of idea lends itself well to undo/redo type operations too as it should be reasonably easy to sort edits into small atomic operations.
I'd agree with crisb's point though, it's probably better to get something simple working first and then see if it really is slow..
From your description it sounds a lot like your document is unformatted text only - so a stringbuilder would do fine.
If its a formatted document, I would be inclined to use the MS Word APIs or similar and just offload your document processing to them - will save you an awful lot of time as document parsing can often be a pain in the a** :-)
I wouldn't get too worried about the performance yet - it sounds a lot like you haven't implemented one yet, so you also don't know what performance characteristics the rest of your app has - it may be that you can't actually afford to hold multiple documents in memory at all when you actually get round to profiling it.

Categories