Efficiently Concatenating Delimited Characters Into Words

Efficiently Concatenating Delimited Characters Into Words - c#

I am working on an NLP based application that uses global keyboard hook to read key presses. Here is it's working interface:
BIEngine.Hook.KeyboardListener KListener = new BIEngine.Hook.KeyboardListener();
private void Application_Startup(object sender, StartupEventArgs e)
{
KListener.KeyDown += new BIEngine.Hook.RawKeyEventHandler(KListener_KeyDown);
}
void KListener_KeyDown(object sender, BIEngine.Hook.RawKeyEventArgs args)
{
Trace.WriteLine(args.ToString());
}
Now I am getting the words for the user as space, carriage return, tab, periods etc delimited set of alphabets. So if the user types in his software window got today I would be getting
g
o
t
t..
So what would be the most efficient way (as this application would be running constantly in the background) to concatenate these letters to form words sans the spaces and other delimiters and react to a certain set words, say for example if the user types today, it will be passed to the NLP library and the user would be presented with some sort of feedback.
Thanks for any suggestions, codes etc.

I strongly recommend that you use the simplest approach that does what you want, and stop worrying about performance. Premature optimization, as it's known, can cost lots of time with very little benefit.
If you never let the string get particularly long (like, ~2000 characters) then I suggest you simply append to a normal string, trimming it whenever it grows longer than, say, 100 chars. I highly doubt you will be able to observe any performance impact from this. Only if you ever run into measurable performance problems (say, you notice the program taking more than 0.1% CPU time while the user is typing) should you consider optimizing this. And I bet you'll find that it's not your string concatenation that is using the CPU, but something else altogether.
Why? Because if you try to optimize everything before it is a problem, you will never get much actual work done. Most of the time optimization is unnecessary.
Having said all this, the most efficient way to match a string character by character would be to use a finite state machine, but I feel that explaining how to go about that is outside the scope of this question.

Looking at your post, I automatically thought about using a write instead of a writeline, but I don't know what implications that might have on your actual configuration.
That would keep it on the "same line", but to what end?
You can also insert a block code to your app to perform the visual or logical transformations, then display it or process it.
This way, you don't have to add additional workloads to your app start procedure.

Related

Speech to text in c#

I have a c# program that lets me use my microphone and when I speak, it does commands and will talk back. For example, when I say "What's the weather tomorrow?" It will reply with tomorrows weather.
The only problem is, I have to type out every phrase I want to say and have it pre-recorded. So if I want to ask for the weather, I HAVE to say it like i coded it, no variations. I am wondering if there is code to change this?
I want to be able to say "Whats the weather for tomorrow", "whats tomorrows weather" or "can you tell me tomorrows weather" and it tell me the next days weather, but i don't want to have to type in each phrase into code. I seen something out there about e.Result.Alternates, is that what I need to use?

This cannot be done without involving linguistic resources. Let me explain what I mean by this.
As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).
Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.
Pattern matching
To avoid hard-coding the sentences like
Tell me about the weather.
What's the weather tomorrow?
Weather report!
you can instead define a pattern that matches any of those sentences:
if a sentence contains "weather", then output a weather report
This can be further refined in manifold ways, e.g. :
if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol
This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.
Measuring Similarity
If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.
In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as
I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0
Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).
The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).
On a more general note, there are many other considerations to be made of course.
For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:
I want you to output a weather report.
Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.
Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.
Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.

For that, the code won't be able to judge that! You need to split the command in text array! Such as
Tomorrow
Weather
What
This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).
It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.

Why does textbox overflow slow down the program so significantly?

I made an application (something like Google Maps) and I added a textbox field to which debugging data were written (of course I meant to remove it afterwards). The interesting fact is that after it was "full" let's say several kilobytes - the whole program slowed down significantly and needed to be exited because one could not work with it.
Could you please explain?

Well, it is surely more than a couple of kilobytes. But yes, TextBox is pretty unsuitable as a control to display tracing information. Every time you add a new line, it must re-allocate its internal buffer, merging the old text with the new text. It is the exact same kind of problem with .NET's String class. With the StringBuilder class as a workaround, but no equivalent exists for TextBox.
Another option that makes TextBox very slow when you add a lot of lines is the WordWrap property. Setting it to True requires it to do a lot of work to figure out the length of each line every time it paints itself.
So workarounds are to leave WordWrap set to False and to prevent the amount of text from growing boundlessly by throwing half of it away whenever the length reaches a limit. Or by using a different control, TextBox isn't very suitable anyway since it doesn't make sense to edit tracing data. Like ListBox.

Instead of appending a little data at a time, eg:
debugTextBox.Text += "Some new debug info"
Perhaps this stragegy might be faster:
StringBuilder debugText = new StringBuilder();
...
debugText.Append("Some new debug info");
debugTextBox.Text = debugText.ToString();
(although StringBuilder is probably overkill for this, and may prove slower than just working directly with string concatenations against a string debugText)

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input

Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.

First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).

I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Is it more or less efficient to perform a check before performing a Replace in C#?

This is an almost academic question but I'm curious as to its answer.
Suppose you have a loop that performs a routine replace on every row in a dataset. Let's say there's 10,000 such rows.
Is it more efficient to have something like this:
Row = Row.Replace('X', 'Y');
Or to check whether the row even contains the character that is to be replaced in the first place, like this:
if (Row.Contains('X')) Row = Row.Replace('X', 'Y');
Is there any difference in terms of efficiency? I realize that that the difference might be very minor bit I'm interested in knowing if one way is better than the other regardless of how much better it may be. Also, would your answer be different if the probability of finding the character that's to be replaced was 10% from it it being 90%?

For your check, Row.Contains('X'), is an O(n) function, which means that it iterates over the entire string one character at a time to see if that character exists.
Row.Replace('X', 'Y') works exactly the same way, it checks every single character one character at a time.
So, if you have that check in place, you iterate over the string potentially twice. If you just replace, you iterate over the string once.

You need to measure first on a realistic dataset, then decide which is higher performance. If your typical dataset doesn't often have anything, then having the Contains() call may be faster (because although Replace also iterates through all chars in the string, there will be an extra string object created and garbage collected due to the immutability of strings), but if "X" is often present, the check becomes a waste and actually slows things down.
Also, this typically isn't the first place to look for and worry about performance problems. Things like chatty interfaces, network I/O, web services, databases, file I/O and GUI updates are going to hurt you orders of magnitude more than stuff like this.
If you were going to do stuff like this, and if Row came back from a database (as it's name suggests) then getting the database to do the query might be another approach to save performance. E.g.
select MyTextColumn from MyTable where MyTextColumn like '%X%'
Then perform the replacement on all the results, because you know you only returned results where the replacement was needed.
This does introduce other concerns though - for example, in SQL Server, if the above example included an index on MyTextColumn, SQL Server won't be able to use that index because the like argument starts with a wildcard (it's not considered to be "sargable").
In summary, write for correctness, readability and maintenance first, then measure performance and make targeted improvements where they are found to be required.

The first option is faster. In order to check if a substring is present it first has to find it. As there won't be any caching mechanism why not replace it directly? Otherwise you'd be searching twice. If 'X' is present many times you would be basically doubling the effort.

Don't forget that strings in C# are IMMUTABLE. That means they cannot change.
For it to replace anything it has to create a new string in memory, and copy the data across, then garbage collect the old string later on.
Using Contains() first, will prevent needless creation, copying, and garbage collection of string data, and therefore perform faster.

How to print line numbers for textbox in c#

This is going to be a long post. I would like to have suggestions if any on the procedure I am following. I want the best method to print line numbers next to each CRLF-terminated-line in a richtextbox. I am using C# with .NET. I have tried using ListView but it is inefficient when number of lines grow. I have been successful in using Graphics in custom control to print the line numbers and so far I am happy with the performance.
But as the number of lines grow to 50K to 100K the scrolling is affected badly. I have overridden WndProc method and handling all the messages to call the line-number printing only when required. (Overriding OnContentsResized and OnVScroll make redundant calls to the printing method).
Now the line number printing is fine when number of lines is small say upto 10K (with which I am fine as it is rare need to edit a file with 10000 lines) but I want to remove the limitation.
Few Observations
Number of lines displayed in the richtexbox is constant +-1. So, the performance difference should be due to large text and not because I am using Graphics painting.
Painting line numbers for large text is slower when compared to small files
Now the Pseudo Code
FIRST_LINE_NUMBER = _textBox.GetFirstVisibleLineNumber();
LAST_LINE_NUMBER = _textBox.GetLastVisibleLineNUmber();
for(loop_from_first_to_last_line_number)
{
Y = _textBox.GetYPositionOfLineNumber(current_line_number);
graphics_paint_line_number(current_line_number, Y);
}
I am using GetCharIndexFromPosition and loop through the RichTextBox.Lines to find the line number in both the functions which get the line numbers. To get Y position I am using GetPositionFromCharIndex to get the Point struct.
All the above RichTextBox methods seem to be of O(n), which eats up the performance. (Correct me if I am wrong.)
I have decided to use a binary-tree to store the line numbers to improve the search perfomance when searching for line number by char index. I have an idea of getting a data-structure which takes O(n) construction time, O(nlgn) worst-case-update, and O(lgn) search.
Is this approach worth the effort?
Is there any other approach to solve the problem? If required I am ready to write the control from scratch, I just want it to be light-weight and fast.

Before deciding on the best way forward, we need to make sure we understand the bottleneck.
First of all, it is important to know how RichTextbox (which I assume you are using as you mentioned it) handles the large files. So I would recommend to remove all line printing stuff and see how it performs with large text. If it is poor, there is your problem.
Second step would be to put some profiling statements or just use a profiler (one comes with the VS 2010) to find the bottleneck. It might turn out to be the method for finding the line number, or something else.
At this point, I would only suggest more investigation. If you have finished the investigation and have more info, update your question and I will get back to you accordingly.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.