Generate all combination of letters - c#

I am trying to build a app that can scramble the input letters. I have found code samples that can rearrange:
abc into cba, acb etc.
What I am trying to do though is the above, but also being able to output shorter combinations using only the letters inputted.
So my desired app would be able to sort abc into a, bc, acb etc.
I realise this might require some sort of algorithm or but I haven't been able to find anything related on the web.
Thanks!

You need to use the concept of "combinations" in combinatorics - it combines permutations with selections of subsets:
Algorithm to return all combinations of k elements from n

If you know how to get all combinations of all available letters, then just add the space character to the list of possible characters and get all combinations, trimming anything to the left of and including the space (and ignore the empty cases).
For example, for the word IF, you have 'IF' and 'FI'. If you treat space as possible you have
' IF', ' FI', 'I F', 'F I', 'IF ', 'FI '
which, trimming everything left of and including the space, becomes
'IF', 'FI', 'F', 'I', '', ''
Ignoring the empty cases, those are your possible combinations including shorter words.

Are you willing to do this in php? If so - the
array_rand ($array, $num);
function would work great. It's first argument is the array in question:
a. make an array of A-Z
the second argument is how many 'letters' or array components to choose
b. use:
rand($min, $max);
to generate a random number between 1 and 26 (characters in the alphabet).
c. Loop this function as many times as you'd like.

Do this until string has looped through all it's indexes
string at index equal wanted number modulo number base plus character set code
wanted number equals wanted number divided by number base
continue.

Related

string.IndexOf() not recognizing modified characters

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.
e.g.
string find = "abcʼabcabc";
int index = find.IndexOf("c");
In this code, index should be 2, but it returns 6.
Is there a way to get around this?
Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).
You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):
string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one
Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).
The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.
string find = "abcʼabcabc";
int index = find.IndexOf("c", StringComparison.Ordinal);

Regular Expression for UK postcodes

I have a list of post codes which should be excluded from my shipping methods.
Suppose I have to exclude Scilly Isles, Isle of Man and few others.
For the above 2 areas valid post codes are IM1-IM9, IM86, IM87, IM89. And if it is IM25 or IM85 it is invalid.
I have writtent following expression. But it is returning even it is IM25 or IM 85.
var regex = new Regex("(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM[1-9]|TR[1-9])");
If I am passing IM85, to my expression it should return false. for IM1-IM9,, IM86, IM87, IM89 it should return true.
Same with TR post codes also. TR1-TR27 is a valid post code. If I give TR28, it should return false.
I am using '|' to seperate multiple patterns. Is that the right way of including multiple patterns in 1 expression.
What do you expect? What should be matched and what not? And please give an example of the string you want to test.
If you match your pattern against "IM25" it will match because you do allow IM[1-9] in your pattern, so you get a valid partial match. If you want to avoid that (I am not sure what you want to achieve) and want to allow really only a single digit after the first letters, use a "word boundary" \b and specify exactly what you want to allow, something like this:
(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM([1-9]|8[6-9])|TR([1-9]|2[0-7]))\b
See it here on Regexr
this would allow for the "IM" part also 6-9 as a second digit when there is a 8 before.
Update
It is still not clear what the context of your task is. I assume you have a list of valid Postcodes, probably it would be better, you extract the post code or only the first part of it (for that you can eventually use a regex) and check if it is in the list or not.
The actual validation is on the wikipedia site... Google has the answers ;) http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
(GIR 0AA)|(((A[BL]|B[ABDFHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9]) [0-9][ABD-HJLNP-UW-Z]{2})
I still think you need more clarification. As a huge Regex guy, I would like to point out that multi-digit ranges should try to be put into the code side, not the Regex side, just for your sanity. But I personally like to play with Regex in this way. Regex reads one character at a time, so it only recognizes zero through nine. Not ten, not twenty eight. If you want to allow the following:
28 through 347
Then it becomes pretty complicated.
To put it into words, you want to allow:
If Two Digits, allow 2-9 for the first digit, and:
If the first digit is a Two, then allow 8/9 for the second digit,
ElseIf the first Digit is 3-9, then allow 0-9 for the second digit
Elseif Three Digits, allow 1-3 for the first Digit, and:
If the first digit is a Three, then allow 0-4 for the second digit, and:
If the second digit is a Four, then allow 0-7 for the third digit,
ElseIf the second digit is 0-3, then allow 0-9 for the third digit.
ElseIf the first digit is 1/2, then allow 0-9 for both the Second and Third digits.
Then with that, you can write a proper Regex like so, which searches for a word boundary or non-Digit surrounding a 2-pair or 3-pair. With this type of Problem-Solving, you should be able to figure out your Regex issue. Otherwise, let us know more about EXACTLY What you want to Match and NOT Match:
(\b|\D)((2[89]|[3-9][0-9])(\b|\D)|(3(4[0-7]|[0-3][0-9])|[12][0-9][0-9])(\b|\D))
I have changed my approach.
Instead of going for a regular expression which is becoming more complex, I am saving all the excluded outward codes of UK post codes.
And if any post code contains the particular outward code, excluding the post code from the list.
Outward codes are in this format
XX-YYY
XXX-YYY
XXXX-YYY
In all above formats, X represents outward code of an UK postcode.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

What is the regular expression for the following strings and would the expression change if the number rolled over?

What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.
There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")
I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t
This match a string that contains 12 hexa
[0-9A-F]{12}
Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.
So, how about this:
[0-9A-F]{12}
Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$

Categories