Simplifying Regex's - escaping - c#

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David

Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));

You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).

To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

Related

How to check the equality of two identical URLs if one of them has "www"?

The following URLs are identical:
http://example.com
http://www.example.com
Can I get the expected result using a method provided in .Net framework like Compare() method in Uri class? or should I handle this case manually?
Unfortunately I don't have enough rep to leave a simple comment for you to ponder about so I'll just leave an answer as food for thought.
I can see a few possible solutions which should cover most scenarios (I'm pretty certain I will probably miss quite a few but everyone makes mistakes):
Use a loop to move over both URIs and compare each part/component as you see fit with break early conditions to speed it up
Clean up the URIs as much as you can using arbitrarily defined rules (eg. remove protocol, remove www prefix, trim ending /) then use Equals() to compare
If you're feeling bold you could use something similar to lexical analysis to convert both URIs into a objects/tokens/parts and compare the subsequent result (harder to implement but probably the most accurate)
Just keep in mind that the guys in the comments are right. The URLs aren't technically the same and the logic you implement in your ultimate solution is purely defined around your definition of 'identical'.
Also, I wouldn't use the solution that simple replaces "www." with "" since someone crazy could easily put a 'www.' somewhere else in their URL and break that implementation unless you perform the replace on both URLs which is also quite risky since one could have more 'www.' instances than the other and would still be considered 'identical'
As mentioned in the comments, these are actually not identical urls and cannot be treated as such.
That aside, it may help you to check the equality after removing the www part:
string urla = #"http://example.com";
string urlb = #"http://www.example.com";
if (urlb.contains("www.")) urlb = urlb.replace("www.", "");
if (urla == urlb) {
// url matches
}

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

How to parse through data efficiently

I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?
So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.
not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.
Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories