Dynamic Regex for number range using c# - c#

I'm looking at UK postcodes and trying to work out how I can take data from a database (the first part of a UK postcode) and dynamically create a regexp for them using c#. For example:
AB44-56
I know what I want as an output:
AB([4][4-9]|[5][0-6])+
However, I can't work out how I might be able to do this with logic, perhaps I need to split the Letters from the numbers first, but i can't do that using split.
I have other combinations too - single range:
AB31 would be AB[3][1]+
Some with just letters:
BT would be BT+
Some with a single letter and 1 or two numbers:
G83 Would be G[8][3]
Any suggestions or guidance would be very much appriciated how this may be coded.

afrom wikipedia UK postal codes :
This can be generalised as: (one or two letters)(number between 0 and
99)(zero or one letter)(space)(single digit)(two letters)
so
^[A-Z,a-z]{0,2}\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$
might work.
EDIT: Also if you are trying to restric the postal codes to say those with the same prefix as the ones in the database you could do this.
var source = "BTasdfweasdf"; //from the database
var input = "BT1A 1BB"; //from the somewhere else
var regex = Regex.Replace(source, #"(^[A-z,a-z]{0,2})(.*)", #"$1\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$");
var match = Regex.Match(input,regex);

Related

c# string format validate

Update: The acceptable format is ADD|| .
I need to check if the request that the server gets, is in this format, and the numbers are between <>.
After that I have to read the numbers and add them and write the result back. So, if the format not fits to for example ADD|<5>|<8>
I have to refuse it and make a specific error message(it is not a number, it is wrong format, etc.). I checked the ADD| part, I took them in an array, and I can check, if the numbers are not numbers. But I cannot check if the numbers are in <> or not, because the numbers can contain multiple digits and ADD|<7>|<13> is not the same number of items likeADD|<2358>|<78961156>. How can I check that the numbers are in between <>?
please help me with the following: I need to make a server-client console application, and I would like to validate requests from the clients. The acceptable format is XXX|<number>|<number>.
I can split the message like here:
string[] messageProcess = message.Split('|');
and I can check if it is a number or not:
if (!(double.TryParse(messageProcess[1], out double number1)) || !(double.TryParse(messageProcess[2], out double number2)))
but how can I check the <number> part?
Thank you for your advice.
You can use Regex for that.
If I understood you correctly, follwing inputs should pass validation:
xxx|1232|32133
xxx|5345|23423
XXX|1323|45645
and following shouldn't:
YYY|1231|34423
XXX|ds12|sda43
If my assumptions are correct, this Regex should do the trick:
XXX\|\d+\|\d+
What it does?
first it looks for three X's... (if it doesn't matter if it's uppercase or lowercase X substitute XXX with (?:XXX|xxx) or use "case insensitive regex flag" - demo)
separated by pipe (|)...
then looks for more than one digit...
separated by pipe (|)...
finally ending with another set of one or more digits
You can see the demo here: Regex101 Demo
And since you are using C#, the Regex.IsMatch() would probably fit you best. You can read about it here, if you are unfamiliar with regular expressions and how to use them in C#.

Delete character out of string

I am having some problems with a quite easy task - i feel like im missing something very obvious here.
I have a .csv file which is semicolon seperated. In this file are several numbers that contain dots like "1.300" but there are also dates included like "2015.12.01". The task is to find and delete all dots but only those that are in numbers and not in dates. The dates and numbers are completely variable and never at the same position in the file.
My question now: What is the 'best' way to handle this problem?
From a programmers point of view: Is it a good solution to just split at every semilicon, count the dots and if there is only one dot, delete it? This is the only way to solve the problem i could think of by now.
Example source file:
2015.12.01;
13.100;
500;
1.200;
100;
Example result:
2015.12.01;
13100;
500;
1200;
100;
If you can rely on the fact that dates have two dots and numbers just one, you can use that as a filter:
string s = "123.45";
if (s.Count(x => x == '.') == 1)
{
s = s.Replace(".", null);
}
The source file looks like a valid file generated by a program running on a machine whose locale uses . as the thousand separator (most of Europe does) and date separator (German locales only I think). Such locales also use ; as the list separator.
If the question was only how to parse such dates, numbers, the answer would be to pass the proper culture to the parse function, eg: decimal.Parse("13.500",new CultureInfo("de-at")) would return 13500. The actual issue though is that the data must be fed to another program that uses . as the decimal separator.
The safest option would be to change the locale used by the exporting program, eg change the thread CultureInfo if the exporter is a .NET program, the locale in an SSIS package etc, to a locale like en-gb to export with . and avoid the weird date format. This assumes that the next program in the pipeline doesn't use German for the date, English for numbers
Another option would be to load the text, parse the fields using the proper locale then export them in the format required by the next program.
Finally, a regular expression could be used to match only the numeric fields and remove the dot. This can be a bit tricky and depends on the actual contents.
For example (\d+)\.(\d{3}) can be used to match numbers if there is only one thousand separator. This can fail if some text field contains similar values. Or ;(\d+)\.(\d{3}); could match only a full field, except the first and last fields, eg:
Regex.Replace("1.457;2016.12.30;13.000;1,50;2015.12.04;13.456",#";(\d+)\.(\d{3});",#"$1$2;")
produces :
1.457;2016.12.3013000;1,50;2015.12.04;13.456
A regular expression that would match either numbers between ; or the first/last field could be
(^|;)(\d+)\.(\d{3})(;|$)
This would produce 1457;2016.12.30;13000;1,50;2015.12.04;13456, eg:
var data="1.457;2016.12.30;13.000;1,50;2015.12.04;13.456";
var pattern=#"(^|;)(\d+)\.(\d{3})(;|$)";
var replacement=#"$1$2$3$4";
var result= Regex.Replace(data,pattern,replacement);
The advantage of a regex over splitting and replacing strings is that it's a lot faster and more memory efficient. Instead of generating temporary strings for each split, manipulation, a Regex only calculates indexes in the source. A string object is generated only when you request the final text result. This results in far fewer allocations and garbage collections.
Even in medium-sized files this can result in 10x better performance
I wouldn't rely on the number of dots as mistakes can be made.
You can use the double.TryParse to safely test if the string is a number
var data = "2015.12.01;13.100;500;1.200;100;";
var dataArray = data.Split(';');
foreach (var s in dataArray)
{
double result;
if(double.TryParse(s,out result))
// implement your logic here
Console.WriteLine(s.Replace(".",string.Empty));
}

Reverse RegExp from user entered string ( C#)

Is it possible to generate regular expressions from a user entered string? Are there any C# libraries to do this?
For example a user enters a string e.g. ABCxyz123 and the C# code automatically generates [A-Z]{3}[a-z]{3}\d{3}.
This is a simple string but we could have more complicated strings like
MON-0123/AB/5678-abc 2/7
Or
1234-678/abc::1234ABC?246
I already have a string tokeniser (from a previous stackoverflow question) so I could construct a regex from the list of tokens.
But I was wondering if there is a lib or C# code out there that’ll do it.
Edit: Important, I should of also said: It's not the actual character in the string that are important but the type of character and how many.
e.g A user could enter a "pattern" string of ABCxyz123.
This would be interpreted as
3 upper case alphas followed by
3 lower case alphas followed by
3 digits
So other users (when complied) must enter strings that match that pattern [A-Z]{3}[a-z]{3}\d{3}., e.g. QAZplm789
It's the format of user entered strings that's need to be checked not the actual content if that makes sense
Jerry has a related link
creating a regular expression for a list of strings
There are a few other links off this.
I'm not trying to do anything complicated e.g NLP etc.
I could use C# expression builder and dynamic linq at a push, but that seems overkill and a code maintainable nightmare .
I'll write my own "simple" regex builder from the tokenized string.
Example Use Case:
An admin office user where I work could setup the string patterns for each field by typing a string pattern, My code converts this to a regex, I store these in a database.
E.g: Field one requires 3 digits at the start. If there are 2 digits then send to workflow 1 if 3 then send to workflow 2. I could simply check the number of chars by substr or what ever. But this would be a concrete solution.
I am trying to do this generically for multiple documents with multiple fields. Also, each field could have multiple format checkers.
I don't want to write specific C# checks for every single field in numerous documents.
I'll get on with it, should keep me amused for a couple of days.

C# read from text file and store in variables

I have a text file that reads
1 "601 Cross Street College Station TX 71234"
2 "(another address)"
3 ...
.
.
I wanted to know how to parse this text file into an integer and a string using C#. The integer would hold the S.No and the string the address without the quotes.
I need to do this because later on I have a function that takes these two values from the text file as input and spits out some data. This function has to be executed on each entry in the text file.
If i is an integer and add is the string, the output should be
a=1; add=601 Cross Street College Station TX 71234 //for the first line and so on
As one can observe the address needs to be one string.
This is not a homework question. And what I have been able to accomplish so far is to read out all the lines using
string[] lines = System.IO.File.ReadAllLines(#"C:\Users\KS\Documents\input.txt");
Any help is appreciated.
I would need to see more of your input data to determine the most reliable method.
But one approach would be to split each address into words. You can then loop through the words and find each word that contains only digits. This will be your street number. You could look after the street number and look for S, So, or South but as your example illustrates, there might be no such indicator.
Also, you haven't provided what you want to happen if more than one number is found.
As far as removing the quotes, just remove the first and last characters. I'd recommend checking that they are in fact quotes before removing them.
From your description, every entry has this format:
[space][number][space][quote][address][quote]
Here is some quick and dirty code that will parse this format into an int/string tuple:
using namespace System;
using namespace System.Linq;
static Tuple<int, string> ParseLine(string line)
{
var tokens = line.Split(); // Split by spaces
var number = int.Parse(tokens[1]); // The number is the 2nd token
var address = string.Join(" ", tokens.Skip(2)); // The address is every subsequent token
address = address.Substring(1, address.Length - 2); // ... minus the first and last characters
return Tuple.Create(number, address);
}

Efficient and fast way to parse a string with different languages

I have a string something like (generated via Google Transliterate REST call, and transliterated into 2 languages):
" This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड
থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious
अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস "
Now Google Transliterate REST call allows FIVE words at a time, so I had to loop, add it to the list and then concatenate the string. That's why we see that each CHUNK (of each language) is of 5 words. The total number of words is 7 words, so first 5 (This world is beautiful and) lies before rest 2 (amazingly mysterious) later.
How do I most efficiently parse the sentence such that I get something like:
This world is beautiful and amazingly mysterious थिस वर्ल्ड इस बेऔतिफुल एंड अमज़िन्ग्ली म्य्स्तेरिऔस থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ আমাজিন্গ্লি ম্য্স্তেরীয়ুস
Since the length of sentence, and the number of languages it can be converted into can be dynamic, may be using lists of each language can work, and then concatenated later?
I used an approach where I transliterated each word, one at a time, it works well, but too slow as it increases the number of calls to the API.
Can someone help me with an efficient (and dynamic) implementation of such a scenario? Thanks a bunch!
One list per language is the way to go.
if you mean different character ASCII code by different languages, you can use this answer here:
Regular expression Spanish and Arabic words
Pay for google translate's API and then your length restriction goes up to 5,000 characters per request https://developers.google.com/translate/v2/faq
Also, yes, as Daniel has said - grouping the text by language will be necessary
I have tried a work out, correct me if i misinterpret your question
string statement = "This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস ";
string otherLangStmt = statement;
MatchCollection matchCollection = Regex.Matches(statement, "([a-zA-Z]+)");
string result = "";
foreach (Match match in matchCollection)
{
if (match.Groups.Count > 0)
{
result += match.Groups[0].Value + " ";
otherLangStmt = otherLangStmt.Replace(match.Groups[0].Value, string.Empty);
}
}
otherLangStmt = Regex.Replace(otherLangStmt.Trim(), "[\\s]", " ");
Console.WriteLine(result);
Console.WriteLine(otherLangStmt);

Categories