Porter stemmer algorithm in information-retrieval [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I need to create simple search engine for my application. Let's simplify it to the following: we have some texts (a lot) and i need to search and show relevant results.
I've based on this great article extend some things and it works pretty well for me.
But i have problem with stemming words to terms. For example words "annotation", "annotations" etc. will be stemmed to "annot", but imagine you try search something, and you will see unexpected results:
"anno" - nothing
"annota" - nothing
etc.
Only word "annot" will give relevant result. So, how should i improve my search to give expected results? Because "annot" contains "anno" and "annota" is slightly more than "annot". Using contains all the time obviously isn't the solution
If in first case i can use some Ternary search tree, in second case i don't know what to do.
Any ideas would be very helpful.
UPDATE
oleksii has pointed me to n-grams here, which may works for me, but i don't know how to properly index n-grams.
So the Question:
Which data structure would be the best for my needs
How properly index my n-grams

Stemming perhaps isn't much relevant here. Stemming will convert a plural to a singular form.
Given you have a tokeniser, a stemmer and a cleaner (to remove stop words, perhaps punctuation and numbers, short words etc) what you are looking at is a full-text search. I would advice you to get an off-the-shelf solution (like Elasticsearch, Lucene, Solr), but if you fancy a DIY approach I can suggest the following naive implementation.
Step 1
Create a search-orientated tokeniser. One example would be an n-gram tokeniser. It will take your word and split into the following sequences:
annotation
1 - [a, n, o, t, a, i]
2 - [an, nn, no, ot, ...]
3 - [ann, nno, not, ota, ...]
4 - [anno, nnot, nota, otat, ...]
....
Step 2
Sort n-grams for more efficient look-up
Step 3
Search n-grams for exact match using binary search

Related

string.contains() vs string.equals() or string == performance [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm returning a string from an API that has a length of 45 characters. There is one word that is unique for one condition that doesn't appear in the other condition.
I'm wondering if using string.contains() is faster performance-wise than comparing the whole string with string.equals() or string == "blah blah".
I don't know the inner workings of any of these methods, but logically, it seems like contains() should be faster because it can stop traversing the string after it finds the match. Is this accurate? Incidentally, the word I want to check is the first word in the string.
I agree with D Stanley (comment). You should use String.StartsWith()
That said, I also don't know the inner working of each method either, but I can see your logic. However "String.Contains()" may still load the entire string before processing it, in which case the performance difference would be very small.
As a final point, with a string length of only 45 characters, the performance difference should me extremely minute. I was shocked when I wrote a junky method to substitute characters and found that is processes ~10kb of text in a fraction of a blink of the eye. So unless you're doing some crazy handling else wise in your app, it shouldn't matter much.

C# Check if a string is a Sentence [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Basically I want to check if a String is a Sentence ("Hello, I am Me!") or Symbol Spam ("HH,,,{''{"), without using the number of symbols as a factor as much as possible. Right now it just detects based on a counter of symbols, but when someone says something with lots of punctuation, they get kicked.
Help?
If the number of symbols in the text is not sufficient, and you don't want to use something too fancy (or bought) could I suggest implementing one or more of these further steps (of increasing difficulty):
Make a count of all A-Za-z and space characters in the string and make a ratio of this to the count of symbols - so if they write a sentence then !!!!!!!!!!!!! at the end it still doesn't snag as the ratio is high enough.
If this still isn't discerning enough, add a further check if you pass item 1...
Count numbers of consecutive A-Za-z characters in the string - work out the average length of these 'words' - if the average is too short then it is probably spam.
These can be done in RegEx reasonably easily - If you want more sophistication then you have to use something written by someone else that has much more developed statistical methods (or start reading lexographical university papers that are beyond me!)

How to structure program that calculates math expression [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I plan to write a program or rather function which will be able to analyze a string parameter which in turn will be math expression. Only the 4 basic operations are allowed(addition, subtraction, multiplication and division) and the numbers are all whole numbers from -100 to 100. The result is allowed to be float. I know the registries work in the same way I.e calculate result of two numbers and store it, than calculate result of stored value and the next operant and store. And so forth until there are no operands left. The number of operands will usually be 2 but I will have a need of 3 or even more so yes, more operands is a requirement.
I was wondering how would you structure this in C#? What tools helper functions you would use in this scenario?
Note: I am working on Unity 5.1.4 project and I want to use a math parser in it. Unity is .NET 2.0
Note: This seems most promising: http://mono.1490590.n4.nabble.com/Javascript-eval-function-in-c-td1490783.html
It uses a variant of eval() function.
In .NET there are no some high level helper functions to help you with this. You would have to parse and tokenize the string in your code. There are however third party libraries that do what you need, for instance Expression Compiler, Simple Math Parser, Mathos Parser, and many other. Search for math expression parser.
If you want to make one from scratch you could look the code of existing ones.
Hans Passant mentions a simple solution, maybe just what you need. You get the result of the expression, so if you need just that, and not the actual expression tokens, then .NET got you covered.
This tool finished the job with no adding external references, dlls or what not: http://mono.1490590.n4.nabble.com/Javascript-eval-function-in-c-td1490783.html

How do I get two numbers between two words (C#) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have a string "Building1Floor2" and it's always in that format, how do I cleanly get the building number (e.g. 1) and floor number. I'm thinking I need a regex, but not entirely sure that's the best way. I could just use the index if the format stays the same, but if I have have a high floor number e.g. 100 it will break.
P.S. I'm using C#.
Use a regex like this:
Building(\d+)Floor(\d+)
Regex would be an ok option here if "Building" and "Floor" could change. e.g.: "Floor1Room23"
You could use "[A-Za-z]+([0-9]{1,})[A-Za-z]+([0-9]{1,})"
With those groupings, $1 would now be the Building number, and $2 would be Floor.
If "Building" and "Floor" never changed, however, then regex might be overkill.. you could use a string split
Find the index of the "F" and substring on that.
int first = str.IndexOf("F") ;
String building = str.substring(1, first);

Is it a bad idea to convert byte arrays to strings then parse with regular expressions? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here's the scenario: I've been recently tasked to write a rs232 serial device communication interface for our existing application. This application has base classes in place to do the actual communication. Basically all I do is accept a byte array into my class then process it.
Part of the issue is that the byte array delivered can be no more than 1000 bytes at a time yet there could be more data waiting to come in that belongs to that transaction. So I have no idea if what was delivered to me is complete. What I am doing is converting that 1000 byte array into a string and stuffing it into a buffer. This buffer then runs a regex to see if what was added creates a complete transaction. I know it's complete if it matches a particular signature (basically a series of control codes at the beginning and end). This buffer will only append data up to 3 times before giving up if no match is found in case of garbage data coming in and no match is ever possible. This isn't a high data volume device so I don't expect tons of data to come pouring in constantly. And the regular expression is only ever executed on, at most, 3000 characters.
So far it works pretty good, but my question is are regular expressions terrible for this? Are there any ramifications in regards to performance for what I'm using them for? My understanding is that regular expressions are typically bad for large volumes of data but I feel this is quite small.
are regular expressions terrible for this?
On the contrary, regular expressions are great for matching patterns in data sequences.
Are there any ramifications in regards to performance for what I'm using them for?
Regular expressions can be written in really inefficient ways, but that is usually a problem with a particular regular expression, not with regular expressions as a technique.
My understanding is that regular expressions are typically bad for large volumes of data but I feel this is quite small.
There is no universal definition of "large" and "small". Depending on a regex engine, your expression is usually translated into a state machine described by the expression. These machines are really efficient at what they do, in which case the size of the data block can be very considerable. On the other hand, one could write a regex with a lot of backtracking, causing unacceptable performance even on input strings of hundred characters or less.
nothing about what you're doing is raising any red flags.
Some things to keep in mind
Don't preoccupy yourself with performance. Just design your program first, and optimize for performance afterwards, and do so only if you have a performance problem.
Some tasks are unsuitable for regular expressions. Regular expressions can't parse XML very well, and they also can't parse patterns like XnYn Without knowing specifically what you're trying to match for with your regex, I can't really analyze whether it's suitable for your problem. Just be careful that you don't have any odd edge cases.
Regex being bad for large amounts of data is not something that I've heard before, and I've been looking around for it online, I'm still not finding much warning against it.
Normally, the most simple solution is the best one. If you can think of a more straight forward and simple solution to your problem, then go ahead with that. If not, then don't worry too much.

Categories