How to parse through data efficiently - c#

I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.

Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.

Related

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

regular expression validtor

i have text box for phone number .i need to validate it.my requiremants are
Take only numeric more than 10digits
Take symbols like (,),-,
can any one help for this.i tried
^[\d{10,14} +\s +\( +\)-]+$
but not working.
You may take a look at the following article which will help you build such expression.
You haven't said what is wrong with your regex (why it's not working as expected) but I'm guessing that the issue is it matches far more than it should. I.e it will match 1 or more of all the characters in your set (rather than just between 10 and 14).
I think you're mistake is that you have put way too much in your character set. You've got the + symbol in there 3 times and it looks like your trying to use quantifiers from within the set as well, which is not allowed. Character sets are the equivalent of single character alternations. So, [abc] is the equivalent of a|b|c.
I'm assuming that you want the input to be between 10 and 14 numbers while still allowing any number (zero or more) of the following characters:
+()-,
As some others have suggested, you could just put the chars you want in a set and then specify the quantifier after it like this: ^[0-9()-,+]{10,14}$. This will almost get you there. Only problem with it is that it will allow between 10 and 14 of any of these characters, so it would successfully match this:
,,,,,++()---
Which clearly you don't want (do you?)
So, in order to better solve this problem, you'll need to be more specific about what is allowed and where in the subject it is allowed. Because i don't know exactly what you want to match, i can't take you much further.
Hopefully the information I've provided here should be good enough to get you started, and if you have more questions... well that's what we're all here for right, so ask away.
To help you out with learning, below are a few resources you might find useful (this is a small subset of what's available, so do go ahead and search for yourself):
Testing tools
Rubular (ruby)
GSkinner Regex Testser
RegexHero (dotnet)
Helpful info
Regular-Expressions.Info
Codeproject 30 Minute Tutorial

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?
Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.
I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.
Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).
I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories