I have the following string
"98225-2077 Bellingham WA"
I need to use Regex to separate Zip Code, City and State.
the groups should return
(98225-2077)(Bellingham) and (WA).
The State is optional and will always be at the end and will consist of two Uppercase charachters.
I am able to filter out the following using regex
Zip Code : (^([\S]+-)?\d+(-\d+)?) - Group[1]
City: ((^([\S]+-)?\d+(-\d+)?)\s)?(\S.*) = Group[5].
Can there be a single regex to filter out all the three using the same regex and return blank in case the state is not there?
I would opt for just splitting the string on space and then using the various parts as you need. Because your city name may consist of multiple words, I iterate from the second to next-to-last element to build the city name. This solution assumes that the zip code and state two abbreviation will always be single words.
string address = "98225-2077 Bellingham WA";
string[] tokens = address.Split(' ');
string city = "";
for (int i=1; i < tokens.Length-1; i++)
{
if (i > 1)
{
city += " ";
}
city += tokens[i];
}
Console.WriteLine("zip code: {0}", tokens[0]);
Console.WriteLine("city: {0}", city);
Console.WriteLine("state: {0}", tokens[tokens.Length-1]);
Easy!
^([\d-]+)\s+(.+?)\s*([A-Z]{2})?$
https://regex101.com/r/tL4tN5/1
Explanation:
^([\d-]+): ^ is for the very beginning of the string. \d for digits
\s+(.+?)\s*: Get anything in the middle between ZIP code and state
([A-Z]{2})?$: {2} means 2 character in the specified range [A-Z]. ? means it exists 1 or 0 times.
This will also work
^(\d[\d-]+)\s+(.*?)(?:\s+([A-Z]{2}))?$
Regex Demo
Ideone Demo
I really think you can do it without a regex. Here are two solutions:
Non-regex solution:
/// <summary>
/// Split address into ZIP, Description/Street/anything, [A-Z]{2} state
/// </summary>
/// <returns>null if no space is found</returns>
public static List<string> SplitZipAnyStateAddress(this string s)
{
if (!s.Contains(' ')) return null;
var zip = s.Substring(0, s.IndexOf(' '));
var state = s.Substring(s.LastIndexOf(' ') + 1);
var middle = s.Substring(zip.Length + 1, s.Length - state.Length - zip.Length - 2);
return state.Length == 2 && state.All(p => Char.IsUpper(p)) ?
new List<string>() { zip, middle, state } :
new List<string>() { zip, string.Format("{0} {1}", middle, state) };
}
Results:
StringRegUtils.SplitZipAnyStateAddress("98225-2077 Bellingham WA");
// => [0] 98225-2077 [1] Bellingham [2] WA
StringRegUtils.SplitZipAnyStateAddress("98225-2077 Bellin gham");
// => [0] 98225-2077 [1] Bellin gham
StringRegUtils.SplitZipAnyStateAddress("98225-2077 New Delhi CA");
// => [0] 98225-2077 [1] New Delhi [2] CA
REGEX
If not, you can use my intial regex suggestion (I think a ? got lost):
^(?<zip>\d+-\d+)\s+(?<city>.*?)(?:\s+(?<state>[A-Z]{2}))?$
See the regex demo
Details:
^ - start of string
(?<zip>\d+-\d+) - 1+ digits followed with - followed with 1+ digits
\s+ - 1+ whitespaces
(?<city>.*?) - 0+ characters other than a newline as few as possible up to the
(?:\s+(?<state>[A-Z]{2}))? - optional (1 or 0) occurrences of
\s+ - 1+ whitespaces
(?<state>[A-Z]{2}) - exactly 2 uppercase ASCII letters
$ - end of string
Related
I have this string "1: This 2: Is 3: A 4: Test" and would like to split it based on the numbering, like this:
"1: This"
"2: Is"
"3: A"
"4: Test"
I think this should be possible with a regular expression, but unfortunately I don't understand much about it.
This: string[] result = Regex.Split(input, #"\D+"); just splits the numbers without the colon and the content behind it.
You can use
string[] result = Regex.Split(text, #"(?!^)(?=(?<!\d)\d+:)")
See this regex demo. Note that the (?<!\d) negative lookbehind is necessary when you have bullet point with two or more digits. Details:
(?!^) - not at the start of string
(?=(?<!\d)\d+:) - the position that is immediately followed with one or more digits (not preceded with any digit) and a : char.
If you use a capture group () like this:
string[] result = Regex.Split(str, #"(\d+:)");
the captured values will be added to the array too. Then all that is left to do is to merge every first value with every second value (we skip index 0 as it is empty):
List<string> values = new();
for (int i = 1; i < result.Length; i += 2)
{
values.Add(result[i] + result[i + 1]);
}
There are probably cleaner ways to do this, but this works.
Using \D+ matches 1 or more non digits, and will therefore match : This to split on.
Instead of using split, you can also match the parts:
\b[0-9]+:.*?(?=\b[0-9]+:|$)
The pattern matches:
\b A word boundary to prevent a partial word match
[0-9]+: Match 1+ digits and :
.*? Match as least as possible characters
(?=\b[0-9]+:|$) Positive lookahead, assert either 1+ digits and : or the end of the string to the right
.NET regex demo
Example in C#:
string str = "1: This 2: Is 3: A 4: Test";
string pattern = #"\b[0-9]+:.*?(?=\b[0-9]+:|$)";
MatchCollection matchList = Regex.Matches(str, pattern);
string[] result = matchList.Cast<Match>().Select(match => match.Value).ToArray();
Array.ForEach(result, Console.WriteLine);
Output
1: This
2: Is
3: A
4: Test
Split by space then take each second item. Because if you define the word as something delimited by (white)space, '1.' or '2.' are words too, and you aren't able to distinguish them.
string[] split = content.Split(' ', StringSplitOptions.None);
string[] result = new string[split.Length / 2];
for (int i = 1; i < split.Length; i = i + 2) result[i / 2] = split[i];
Im struggling to create a Regex that finds all placeholder occurrences in a given text. Placeholders will have the following format:
[{PRE.Word1.Word2}]
Rules:
Delimited by "[{PRE." and "}]" ("PRE" upper case)
2 words (at least 1 char long each) separated by a dot. All chars valid on each word apart from newline.
word1: min 1 char, max 15 chars
word2: min 1 char, max 64 chars
word1 cannot have dots, if there are more than 2 dots inside placeholder extra ones will be part of word2. If less than 2 dots, placeholder is invalid.
Looking to get all valid placeholders regardless of what the 2 words are.
Im not being lazy, just spent an horrible amount of time building the rule on regexr.com, but was unable to cross all these rules.
Looking fwd to checking your suggestions.
The closest I've got to was the below, and any attempt to expand on that breaks all valid matches.
\[\{OEP\.*\.*\}\]
Much appreciated!
Sample text where Regex should find matches:
Random text here
[{Test}] -- NO MATCH
[{PRE.TestTest3}] --NO MATCH
[{PRE.TooLong.12345678901234567890}] --NO MATCH
[{PRE.Address.Country}] --MATCH
[{PRE.Version.1.0}] --MATCH
Random text here
You can use
\[{PRE\.([^][{}.]{1,15})\.(.{1,64}?)}]
See the regex demo
Details
\[{ - a [{ string
PRE\. - PRE. text
([^][{}.]{1,15}) - Group 1: any one to fifteen chars other than [, ], {, } and .
\. - a dot
(.{1,64}?) - any one to 64 chars other than line break chars as few as possible
}] - a }] text.
If you need to get all matches in C#, you can use
var pattern = #"\[{PRE\.([^][{}.]{1,15})\.(.{1,64}?)}]";
var matches = Regex.Matches(text, pattern);
See this C# demo:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "[{PRE.Word1.Word2}] and [{PRE.Word 3.Word..... 2 %%%}]";
var pattern = #"\[{PRE\.([^][{}.]{1,15})\.(.{1,64}?)}]";
var matches = Regex.Matches(text, pattern);
var props = new List<Property>();
foreach (Match m in matches)
props.Add(new Property(m.Groups[1].Value,m.Groups[2].Value));
foreach (var item in props)
Console.WriteLine("Word1 = " + item.Word1 + ", Word2 = " + item.Word2);
}
public class Property
{
public string Word1 { get; set; }
public string Word2 { get; set; }
public Property()
{}
public Property(string w1, string w2)
{
this.Word1 = w1;
this.Word2 = w2;
}
}
}
Output:
Word1 = Word1, Word2 = Word2
Word1 = Word 3, Word2 = Word..... 2 %%%
string input = "[{PRE.Word1.Word2}]";
// language=regex
string pattern = #"\[{ PRE \. (?'group1' .{1,15}? ) \. (?'group2' .{1,64}? ) }]";
var match = Regex.Match(input, pattern, RegexOptions.IgnorePatternWhitespace);
Console.WriteLine(match.Groups["group1"].Value);
Console.WriteLine(match.Groups["group2"].Value);
There are tons of posts regarding how to capitalize the first letter with C#, but I specifically am struggling how to do this when ignoring prefixed non-letter characters and tags inside them. Eg,
<style=blah>capitalize the word, 'capitalize'</style>
How to ignore potential <> tags (or non-letter chars before it, like asterisk *) and the contents within them, THEN capitalize "capitalize"?
I tried:
public static string CapitalizeFirstCharToUpperRegex(string str)
{
// Check for empty string.
if (string.IsNullOrEmpty(str))
return string.Empty;
// Return char and concat substring.
// Start # first char, no matter what (avoid <tags>, etc)
string pattern = #"(^.*?)([a-z])(.+)";
// Extract middle, then upper 1st char
string middleUpperFirst = Regex.Replace(str, pattern, "$2");
middleUpperFirst = CapitalizeFirstCharToUpper(str); // Works
// Inject the middle back in
string final = $"$1{middleUpperFirst}$3";
return Regex.Replace(str, pattern, final);
}
EDIT:
Input: <style=foo>first non-tagged word 1st char upper</style>
Expected output: <style=foo>First non-tagged word 1st char upper</style>
You may use
<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)
The regex does the following:
<[^<>]*> - matches <, any 0+ chars other than < and > and then >
| - or
(?<!\p{L}) - finds a position not immediately preceded with a letter
(\p{L}) - captures into Group 1 any letter
(\p{L}*) - captures into Group 2 any 0+ letters (that is necessary if you want to lowercase the rest of the word).
Then, check if Group 2 matched, and if yes, capitalize the first group value and lowercase the second one, else, return the whole value:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() + m.Groups[2].Value.ToLower() :
m.Value);
If you do not need to lowercase the rest of the word, remove the second group and the code related to it:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() : m.Value);
To only replace the first occurrence using this approach, you need to set a flag and reverse it once the first match is found:
var s = "<style=foo>first non-tagged word 1st char upper</style>";
var found = false;
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m => {
if (m.Groups[1].Success && !found) {
found = !found;
return m.Groups[1].Value.ToUpper();
} else {
return m.Value;
}
});
Console.WriteLine(result); // => <style=foo>First non-tagged word 1st char upper</style>
See the C# demo.
Using look-behind regex feature you can match the first 'capitalize' without > parenthesis and then you can capitalize the output.
The regex is the following:
(?<=<.*>)\w+
It will match the first word after the > parenthesis
I am trying to convert string which i take form NSDictionary as a dictionary and then I have to via method :
string NSDictionaryConverter(string name)
{
foreach (var a in str)
{
if (a.Key.Description.Equals(name))
{
result = a.Value.ToString();
}
Console.WriteLine(str.Keys);
}
return result;
}
Take what ever i need.
Why do I use dictionary ? These dictionary contains information for everything which conatain annotation from the map.
The Key FormattedAddressLines contatins for example :
FormattedAddressLines = (
"ZIP City Name",
Country
);
The value which with I have problems is address, because it contains a lot of details. I need all them displayed nicely on the screen.
Namely, I need to remove ", (, ) chars and line breaks with whitespace before punctuation.
After regex it looks still messy :
string address = NSDictionaryConverter("FormattedAddressLines");
string city = NSDictionaryConverter("City");
string zip = NSDictionaryConverter("ZIP");
string country = NSDictionaryConverter("Country");
address = Regex.Replace(address, #"([()""])+", "");
fullAddress = address + ", " + city + ", " + zip + ", " + country;
addressLabel.Text = fullAddress;
How could i do this to looks like :
Full Address value, - new line
XXX, - new line
XXX, - new Line
... - new line
N value - new line
It seems you need to remove specific special characters and whitespace before punctuation.
You need to add a \s*(?:\r?\n|\r)\s*(?=\p{P}) alternative to your regex:
Regex.Replace(address, #"[()""]+|\s*(?:\r?\n|\r)+\s*(?=\p{P})", "")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The \s* matches 0+ whitespaces, (?:\r?\n|\r)+ matches 1 or more line breaks and \s*(?=\p{P}) matches 0+ whitespaces that are followed with a punctuation symbol. It might be necessary to replace \p{P} with [\p{P}\p{S}] if you also want to include symbols.
See the regex demo:
Hi I've been fooling around with this for awhile figured it was time to ask for help ...
I'm trying to return all capital char (non numeric or special char phrases) sequences longer then 5 characters from a wacky a string.
so for:
02/02/12-02:45 PM(CKI)-DISC RSPNS SRVD 01/31/12-PRINTED DISCOVERY:spina.bp.doc(DGB)
01/27/12-ON CAL-FILED NOTICE OF TRIAL(JCX) 01/24/12-SENT OUR DEMANDS(Auto-Gen) 01/23/12-
02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af
I would want to return a list of
DISC RSPNS SRVD
PRINTED DISCOVERY
FILED NOTICE OF TRIAL
SENT OUR DEMANDS
I've been fooling around with variations of the following:
[A-Z][A-Z\d]+
[A-Z][A-Z\d]+ [A-Z][A-Z\d]+"
however this is a little outside my scope of knowledge with Regex.
Edit
I'm trying
string[] capWords = Regex.Split(d.caption, #"[A-Z\s]{5,}");
foreach (var u in capWords) { Console.WriteLine(u); }
Outputting:
02/02/12-02:45 PM(CKI)-
01/31/12-
:spina.bp.doc(DGB) 01/27/12-
(JCX) 01/24/12-
(Auto-Gen) 01/23/12-02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af
Kendall's Suggestion Outputs:
02/02/12-02:45 PM(CKI)-
01/31/12-
:spina.bp.doc(DGB) 01/27/12-
(JCX) 01/24/12-
(Auto-Gen) 01/23/12-02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af
Here you go:
[A-Z\s]{5,}
Tested and returns only the items you listed.
Explanation:
[A-Z\s] - matches only capital letters and spaces
{5,} - matches must be at least 5 characters, with no upper limit on number of characters
Code:
MatchCollection matches = Regex.Matches(d.caption, #"[A-Z\s]{5,}");
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
}
Try this. I am assuming you want leading/trailing spaces stripped.
[A-Z][A-Z ]{4,}[A-Z]
Also, I don't think you want Regex.Split.
var matches = Regex.Matches(d.caption, #"[A-Z][A-Z ]{4,}[A-Z]");
foreach (var match in matches)
{
Console.WriteLine(match.Value);
}
You could also do:
var matches = Regex.Matches(d.caption, #"[A-Z][A-Z ]{4,}[A-Z]")
.OfType<Match>()
.Select(m => m.Value);
foreach (string match in matches)
{
Console.WriteLine(match);
}
You had asked for a single RegEx solution but using given criteria and examples I could not get a single reg ex to count a string and ignore a certain character type (spaces). Failure was on character groups like ON CAL which should fail as a match but were passing because of the total character count.
So in order to make sure that character groups with only 5 Uppercase characters were present I had to use two regEx expressions. This was a little cumbersome and I was able to do this faster and much simpler with string methods.
This might work with a single regEx if you could list some certainties about the formatting of the source text. For example if we knew that the character groups that you are looking for are always preceded by a dash and terminated by a punctuation mark that is not a dash, or terminated by a number.
5 PM( --- FAIL (not preceded by a dash)
(CKI) --- FAIL (not preceded by a dash)
-DISC RSPNS SRVD 0 --- PASS
-PRINTED DISCOVERY: --- PASS
-ON CAL- --- FAIL (terminated by a dash)
-FILED NOTICE OF TRIAL( --- PASS
-SENT OUR DEMANDS( --- PASS
Barring that, I have included the code that will get you your results in one of two ways. I prefer the second.
String source1 = "02/02/12-02:45 PM(CKI)-DISC RSPNS SRVD 01/31/12-PRINTED
DISCOVERY:spina.bp.doc(DGB) 01/27/12-ON CAL-FILED NOTICE OF TRIAL(JCX) 01/24/12-SENT
OUR DEMANDS(Auto-Gen) 01/23/12- 02:31 PM-File pulled and given to KG for responses.(JLS) 01/20/12(PC)-rcd df jmt af ";
String assembledString;
public void bumbleBeeTunaTest()
{
String strippedString = source1.Replace(" ", "");
String regString1 = "";
String regString2 = #"([A-Z]{6,})";
String matchHold1,matchHold1First,matchHold1Last,matchHold1Middle;
Int32 matchHold1Len;
Regex regExTwo = new Regex(regString2);
MatchCollection regMatch2 = regExTwo.Matches(strippedString);
foreach (Match match2 in regMatch2)
{
matchHold1 = match2.Groups[1].Value;
matchHold1Len = matchHold1.Length;
matchHold1First = matchHold1.Substring(0,1);
matchHold1Last = matchHold1.Substring(matchHold1Len - 1,1);
matchHold1Middle = matchHold1.Substring(1, matchHold1Len - 2);
Debug.Print("Stripped String Matches - " + matchHold1);
regString1 = #"(" + matchHold1First + "[" + matchHold1Middle+ " ]{" + (matchHold1Len -1) + ",}" + matchHold1Last + ")";
Regex regExOne = new Regex(regString1);
MatchCollection regMatch1 = regExOne.Matches(source1);
regMatch1 = regExOne.Matches(source1);
foreach (Match match1 in regMatch1)
{
Debug.Print("Re-Assembled Matches :" + match1.Groups[1].Value.ToString());
}
}
// Does the same thing as the above. Just a little simpler.
for (int i = 0; i < source1.Length; i++)
{
if (char.IsUpper(source1[i]) | char.IsWhiteSpace(source1[i]))
{
assembledString += source1[i];
}
else
{
if (!string.IsNullOrEmpty(assembledString))
{
if (assembledString.Count(char.IsUpper) > 5)
{
Debug.Print("Non Reg Ex Version " + assembledString);
}
assembledString = "";
}
}
}
}
The output looks like this.
Stripped String Matches - DISCRSPNSSRVD
Re-Assembled Matches :DISC RSPNS SRVD
Stripped String Matches - PRINTEDDISCOVERY
Re-Assembled Matches :PRINTED DISCOVERY
Stripped String Matches - FILEDNOTICEOFTRIAL
Re-Assembled Matches :FILED NOTICE OF TRIAL
Stripped String Matches - SENTOURDEMANDS
Re-Assembled Matches :SENT OUR DEMANDS
Non Reg Ex Version DISC RSPNS SRVD
Non Reg Ex Version PRINTED DISCOVERY
Non Reg Ex Version FILED NOTICE OF TRIAL
Non Reg Ex Version SENT OUR DEMANDS