Splitting a string based on specific characters

Splitting a string based on specific characters - c#

I will try to describe my problem as well as I can.
I am trying to write a program that will handle equations like:
F = (X∨A) ↔ (X∨B) ( (X OR A) is equivalent to (X OR B) )!
I have to solve it by 'X', or to better say, write disjunctive and/or conjunctive normal form.
So, theoretically, it goes like this:
When the truth table is written, you have to see when F is equal to 1 (tautology), and write conjunctive/disjunctive normal form.
For example (disjunctive normal form for the given table):
For A=0,B=0 and A=1,B=1, the value of X does not matter, and for A=0,B=1 AND A=1,B=0, X must be 1.
In the end,
X=A∨B.
Since I'm writing it in C#, equations are written in a TextBox.
What bothers me,is how should I separate my string so I can solve it part by part?

What about first trying "Split()" method (or other methods) of the class String in C#? In the first place, you'd better push your users to insert a blank (as the separator of Split()) between each pair of tokens (e.g. A AND B) , so that you can concentrate on the main logic of your solver.

I see. It's basically a simplifying calculator that doesn't (necessarily) have sequential input via buttons but a ready-made formula typed or pasted in a textbox.
What you need to do is therefore
define a set of permitted operators (AND, OR, NOT, XOR, etc.) in
various permitted notations (Λ, V, !, =, !=, etc., +,*,-,=, ↔ etc.)
check whether the input is syntactically correct; which means
you'll probably have to first remove all spaces from the input
string, then use regular expressions for allowed constructs, which
will probably prove to be a pain
if input is valid, check for parentheses first to determine grouped
expressions
simplify according to boolean algebra
Or simply follow this link (Java but still): Boolean expression solver/simplifier, use the tool bc2cnf or any other library that is linked there and spare yourself a lot of headache by restricting the permitted input to one permitted by the used library.
Hope this helps.

Related

How to validate a excel expression programmatically

I have been developing an application that one of it's responsability is provide to user an page that it's possible to write math expression in EXCEL WAY.
It is an application in ASP.NET MVC, and it's use the SpreadSheetGear library to EXECUTE excel expression.
As it's show below, The page has an texarea to write expression and two button on the right. The green one is for VALIDATE THE EXPRESSION and the red one is for clean textarea.
A,B,C are parameter, that application will replace for values. Notice that it is not possible to know the parameter data type. I mean, if I write a concatenate function, It is necessary that user use double quotes (") to delimitate string. For example
CONCATENATE("A","B") thus, is necessary that user KNOW functions parameters and its correlate data types.
My main issue is how to validate the expression?
SpreadSheetGear there isn't any method to perform this validation.
The method spreadsheetgear provides to perform an formula is:
string formula = "{formula from textarea}"
worksheet.EvaluateValue(formula)
and it's expect and string.
As I don't know what formula user will write, or how many parameters this formula has and what are the parameters data type, it's kind difficult to validate.
Now my question is?
How could I validate the expression?
I think in an alternative: Provide to user and page with textbox for each parameter in the expression. The user will be able to provide some data and validate the RESULT of formula. If the sintaxe of formula is wrong the application throw an exception.
It would be a good solution, but for each "PROCESS" that user will interact to, He'll write 10, 15 formulas, and maybe it would be little painful.
Anyone could provide me an Good solution for that?
Thank you!

https://sites.google.com/site/akshatsharma80/home/data-structures/validate-an-arithmetic-expression
refer this site for validation

This is a very late response but I have been working on expression evaluators in Excel with VBA for a while and I might shed some light. I have three solutions to try but all have limitations.
1) Ask the user to put a '$' after a variable name to signify a string (or some other unique character). Drawback is that it is not as simple as typing a single letter for a variable.
2) Assume all variables entered are double precision. Then change the variable to strings in all combinations until one combination works. Could be very time consuming to try all the combinations if the user enters lots of individual variables.
3) Assume all variables entered are double precision. But then have a list in your program of functions that require strings for parameters. Then you could parse the expression, lookup the functions in your list and then designate the parameters that require string input with a string signifier (like in step 1). This will not account for user defined functions.
Now to test out the function, replace all the numeric variables with '1' and all the string variables with "a", then EvaluateValue. If you get a result or an error signifying a calculation error, it is good.
BTW, in order to parse the expression, I suggest the following method. I do not know C#, only VB, so I will only talk in general terms.
1) Take your expression string and do a search and replace of all the typical operators with the same operator but with a backslash ("\") in front and behind the operator (you can use any other character that is not normally used in Excel formulas if you like). This will delineate these operators so that you can easily ignore them and split up your expression into chunks. Typically only need to delineate +,-,/,*,^,<,>,= and {comma}. So search for a "+" and replace it with a "\+\" and so on. For parenthesis, replace "(" and ")" with "(\\" and "\\)" respectively.
So your sample formula "SUM(A, SQRT(B, C)) * PI()" will look like this:
"SUM(\\A\,\ SQRT(\\B\,\ C\\)\\) \*\ PI(\\\\)"
You can also clean up the string a bit more by eliminating any spaces and by eliminating redundant backslashes by replacing every three consecutive backslashes with a single one (replace "\\" with "\").
2) In Visual Basic there is a command called 'Split' that can take a string like this and split it into a one dimensional array using a delimiter (in this case, the backslash). There must be an equivalent in C# or you can just make one. Your array will look like this: "SUM(", "", "A", ",", "SQRT(", "", "B", etc.
Now iterate through your array, starting at the first element and then skipping every other element. These elements will either be a number (a numeric test), a variable, a function (with have a "(" at the end of it), a parenthesis or blank.
Now you can do other checks as you need and replace the variables with actual values.
3) When you are done, rejoin the array back into a string, without any delimiters, and try the Evaluate function.

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.

I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.

To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.

This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.

I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Regex to parse C/C++ functions declarations

I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).
I'm working from either headers or a list where the signatures take the form:
public: void __thiscall myClass::method(int, class myOtherClass * )
I have the following regex, which works for most functions:
(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)
There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.
I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?

C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.
But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.
Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.

Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.
The BNF specification looks like for a list of integers or words separated might look like :
using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;
// Inside the definition...
integer = +digit_p; // One or more digits.
word = +alpha_p; // One or more letters.
token = integer | word; // An integer or a word.
token_list = token >> *(+space_p >> token) // A token, followed by 0 or more tokens.
For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).

No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.
If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David

Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));

You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).

To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.