What is the best character for String.Split?

What is the best character for String.Split? - c#

Disclaimer: I KNOW that in 99% of cases you shouldn't "serialize" data in a concatenated string.
What char you guys use in well-known situation:
string str = userId +"-"+ userName;
In majority of cases I have fallen back to | (pipe) but, in some cases users type even that. What about "non-typable" characters like ☼ (ALT+9999)?

That depends on too many factors to give a concrete answer.
Firstly, why are you doing this? If you feel the need to store the userId and userName by combining them in this fashion, consider alternative approaches, e.g. CSV-style quoting or similar.
Secondly, under normal circumstances only delimiters that aren't part of the strings should be used. If userId is just a number then "-" is fine... but what if the number could be negative?
Third, it depends on what you plan to do with the string. If it is simply for logging or debugger or some other form of human consumption then you can relax a bit about it, and just choose a delimiter that looks appropriate. If you plan to store data like this, use a delimiter than ensures you can extract the data properly later on, regardless of the values of userId or userName. If you can get away with it, use \0 for example. If either value comes from an untrusted source (i.e. the Internet), then make sure the delimiter can't be used as a character in either string. Generally you would limit the characters that each contains - say, digits for userId and letters, digits and SOME punctuation characters for userName.

If it's for data storage and retrieval, there is no way to guarantee that a user won't find a way to inject your delimiter into the string. The safe thing to do is pre-process the input somehow:
Let - be the special character
If a - is encountered in the input, replace it with something like -0.
Use -- as your delimiter
So userid = "alpha-dog" and userName = "papa--0bear" will be translated to
alpha-0dog--papa-0-00bear
The important thing is that your scheme needs to be perfectly undoable, and that the user shouldn't be able to break it, no matter what they enter.
Essentially this is a very primitive version of sanitization.

Related

How do I determine a delimiter in a text file

I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?

You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).

Reverse RegExp from user entered string ( C#)

Is it possible to generate regular expressions from a user entered string? Are there any C# libraries to do this?
For example a user enters a string e.g. ABCxyz123 and the C# code automatically generates [A-Z]{3}[a-z]{3}\d{3}.
This is a simple string but we could have more complicated strings like
MON-0123/AB/5678-abc 2/7
Or
1234-678/abc::1234ABC?246
I already have a string tokeniser (from a previous stackoverflow question) so I could construct a regex from the list of tokens.
But I was wondering if there is a lib or C# code out there that’ll do it.
Edit: Important, I should of also said: It's not the actual character in the string that are important but the type of character and how many.
e.g A user could enter a "pattern" string of ABCxyz123.
This would be interpreted as
3 upper case alphas followed by
3 lower case alphas followed by
3 digits
So other users (when complied) must enter strings that match that pattern [A-Z]{3}[a-z]{3}\d{3}., e.g. QAZplm789
It's the format of user entered strings that's need to be checked not the actual content if that makes sense

Jerry has a related link
creating a regular expression for a list of strings
There are a few other links off this.
I'm not trying to do anything complicated e.g NLP etc.
I could use C# expression builder and dynamic linq at a push, but that seems overkill and a code maintainable nightmare .
I'll write my own "simple" regex builder from the tokenized string.
Example Use Case:
An admin office user where I work could setup the string patterns for each field by typing a string pattern, My code converts this to a regex, I store these in a database.
E.g: Field one requires 3 digits at the start. If there are 2 digits then send to workflow 1 if 3 then send to workflow 2. I could simply check the number of chars by substr or what ever. But this would be a concrete solution.
I am trying to do this generically for multiple documents with multiple fields. Also, each field could have multiple format checkers.
I don't want to write specific C# checks for every single field in numerous documents.
I'll get on with it, should keep me amused for a couple of days.

Optimistic RegEx Matching for User Text Entry

I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?

What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.

Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?

Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.

I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.

Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).

I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.