Excluding duplicates from a string in results - c#

I am trying to amend this regex so that it does not match duplicates.
Current regex:
[\""].+?[\""]|[^ ]+
Sample string:
".doc" "test.xls", ".doc","me.pdf", "test file.doc"
Expected results:
".doc"
"test.xls"
"me.pdf"
But not
".doc"
"test.xls"
".doc"
"me.pdf"
Note:
Filenames could potentially have spaces e.g. test file.doc
items could be separated by a space or a comma or both
strings could have quotes around or NOT have quotes around e.g. .doc or ".doc".

In C#, you may use a simple regex to extract all valid matches and use .Distinct() to only keep unique values.
The regex is simple:
"(?<ext>[^"]+)"|(?<ext>[^\s,]+)
See the regex demo, you only need Group "ext" values.
Details
"(?<ext>[^"]+)" - ", (group "ext") any 1+ chars other than " and then "
| - or
(?<ext>[^\s,]+) - (group "ext") 1+ chars other than whitespace and comma
The C# code snippet:
var text = "\".doc\" \"test.xls\", \".doc\",\"me.pdf\", \"test file.doc\".doc \".doc\"";
Console.WriteLine(text); // => ".doc" "test.xls", ".doc","me.pdf", "test file.doc".doc ".doc"
var pattern = "\"(?<ext>[^\"]+)\"|(?<ext>[^\\s,]+)";
var results = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(x => x.Groups["ext"].Value)
.Distinct();
Console.WriteLine(string.Join("\n", results));
Output:
.doc
test.xls
me.pdf
test file.doc

Related

Regex split preserving strings and escape character

I need to split a string on C#, based on space as delimiter and preserving the quotes.. this part is ok.
But additionally, I want to allow escape character for string \" to allow include other quotes inside the quotes.
Example of what I need:
One Two "Three Four" "Five \"Six\""
To:
One
Two
Three Four
Five "Six"
This is the regex I am currently using, it is working for all the cases except "Five \"Six\""
//Split on spaces unless in quotes
List<string> matches = Regex.Matches(input, #"[\""].+?[\""]|[^ ]+")
.Cast<Match>()
.Select(x => x.Value.Trim('"'))
.ToList();
I'm looking for any Regex, that would do the trick.
You can use
var input = "One Two \"Three Four\" \"Five \\\"Six\\\"\"";
// Console.WriteLine(input); // => One Two "Three Four" "Five \"Six\""
List<string> matches = Regex.Matches(input, #"(?s)""(?<r>[^""\\]*(?:\\.[^""\\]*)*)""|(?<r>\S+)")
.Cast<Match>()
.Select(x => Regex.Replace(x.Groups["r"].Value, #"\\(.)", "$1"))
.ToList();
foreach (var s in matches)
Console.WriteLine(s);
See the C# demo.
The result is
One
Two
Three Four
Five "Six"
The (?s)"(?<r>[^"\\]*(?:\\.[^"\\]*)*)"|(?<r>\S+) regex matches
(?s) - a RegexOptions.Singleline equivalent to make . match newlines, too
"(?<r>[^"\\]*(?:\\.[^"\\]*)*)" - ", then Group "r" capturing any zero or more chars other than " and \ and then zero or more sequences of any escaped char and zero or more chars other than " and \, and then a " is matched
| - or
(?<r>\S+) - Group "r": one or more whitespaces.
The .Select(x => Regex.Replace(x.Groups["r"].Value, #"\\(.)", "$1")) takes the Group "r" value and unescapes (deletes a \ before) all escaped chars.

Regex only letters except set of numbers

I'm using Replace(#"[^a-zA-Z]+", "");
leave only letters, but I have a set of numbers or characters that I want to keep as well, ex: 122456 and 112466. But I'm having trouble leaving it only if it's this sequence:
ex input:
abc 1239 asm122456000
I want to:
abscasm122456
tried this: ([^a-zA-Z])+|(?!122456)
My answer doesn't applying Replace(), but achieves a similar result:
(?:[a-zA-Z]+|\d{6})
which captures the group (non-capturing group) with the alphabetic character(s) or a set of digits with 6 occurrences.
Regex 101 & Test Result
Join all the matching values into a single string.
using System.Linq;
Regex regex = new Regex("(?:[a-zA-Z]+|\\d{6})");
string input = "abc 1239 asm12245600";
string output = "";
var matches = regex.Matches(input);
if (matches.Count > 0)
output = String.Join("", matches.Select(x => x.Value));
Sample .NET Fiddle
Alternate way,
using .Split() and .All(),
string input = "abc 1239 asm122456000";
string output = string.Join("", input.Split().Where(x => !x.All(char.IsDigit)));
.NET Fiddle
It is very simple: you need to match and capture what you need to keep, and just match what you need to remove, and then utilize a backreference to the captured group value in the replacement pattern to put it back into the resulting string.
Here is the regex:
(122456|112466)|[^a-zA-Z]
See the regex demo. Details:
(122456|112466) - Capturing group with ID 1: either of the two alternatives
| - or
[^a-zA-Z] - a char other than an ASCII letter (use \P{L} if you need to match any char other than any Unicode letter).
Note the removed + quantifier as [^A-Za-z] also matches digits.
You need to use $1 in the replacement:
var result = Regex.Replace(text, #"(122456|112466)|[^a-zA-Z]", "$1");

Splitting a String with conditions

Given a string of:
"S1 =F A1 =T A2 =T F3 =F"
How can I split it so that the result is an array of strings where the 4 strings ,individual string would look like this:
"S1=F"
"A1=T"
"A2=T"
"F3=F"
Thank you
You can try matching all Name = (T|F) conditions with regular expressions and then get rid of white spaces in the each match with a help of Linq:
using System.Linq;
using System.Text.RegularExpressions;
..
string source = "S1 \t = F A1 = T A2 = T F3 = F";
string[] result = Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>()
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, result));
Outcome:
S1=F
A1=T
A2=T
F3=F
Edit: What's goining on. First part is a regular expression matching:
... Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>() ...
We are trying to find out fragments with a pattern
[A-Za-z] - Letter A..Z or a..z
[A-Za-z0-9]* - followed by zero or many letters or digits
\s* - zero or more white spaces (spaces, tabulations etc.)
= - =
\s* - zero or more white spaces (spaces, tabulations etc.)
[TF] - either T or F
Second part is match clearing: for each match found e.g. S1 \t = F we want to obtain "S1=F" string:
...
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
We use Linq here: for each character in the match we filter out all white spaces (take character c if and only if it's not a white space):
match.Value.Where(c => !char.IsWhiteSpace(c))
then combine (Concat) filtered characters (IEnumerable<char>) of each match back to string and organize these strings as an array (materialization):
.Select(match => string.Concat(...))
.ToArray();

C# regex. Everything inside curly brackets{} and mod(%) charaters

I'm trying to get the values between {} and %% in a same Regex.
This is what I have till now. I can successfully get values individually for each but I was curious to learn about how can I combine both.
var regex = new Regex(#"%(.*?)%|\{([^}]*)\}");
String s = "This is a {test} %String%. %Stack% {Overflow}";
Expected answer for the above string
test
String
Stack
Overflow
Individual regex
#"%(.*?)%" gives me String and Stack
#"\{([^}]*)\}" gives me test and Overflow
Following is my code.
var regex = new Regex(#"%(.*?)%|\{([^}]*)\}");
var matches = regex.Matches(s);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Value);
}
Similar to your regex. You can use Named Capturing Groups
String s = "This is a {test} %String%. %Stack% {Overflow}";
var list = Regex.Matches(s, #"\{(?<name>.+?)\}|%(?<name>.+?)%")
.Cast<Match>()
.Select(m => m.Groups["name"].Value)
.ToList();
If you want to learn how conditional expressions work, here is a solution using that kind of .NET regex capability:
(?:(?<p>%)|(?<b>{))(?<v>.*?)(?(p)%|})
See the regex demo
Here is how it works:
(?:(?<p>%)|(?<b>{)) - match and capture either Group "p" with % (percentage), or Group "b" (brace) with {
(?<v>.*?) - match and capture into Group "v" (value) any character (even a newline since I will be using RegexOptions.Singleline) zero or more times, but as few as possible (lazy matching with *? quantifier)
(?(p)%|}) - a conditional expression meaning: if "p" group was matched, match %, else, match }.
C# demo:
var s = "This is a {test} %String%. %Stack% {Overflow}";
var regex = "(?:(?<p>%)|(?<b>{))(?<v>.*?)(?(p)%|})";
var matches = Regex.Matches(s, regex, RegexOptions.Singleline);
// var matches_list = Regex.Matches(s, regex, RegexOptions.Singleline)
// .Cast<Match>()
// .Select(p => p.Groups["v"].Value)
// .ToList();
// Or just a demo writeline
foreach (Match match in matches)
Console.WriteLine(match.Groups["v"].Value);
Sometimes the capture is in group 1 and sometimes it's in group 2 because you have two pairs of parentheses.
Your original code will work if you do this instead:
Console.WriteLine(match.Groups[1].Value + match.Groups[2].Value);
because one group will be the empty string and the other will be the value you're interested in.
#"[\{|%](.*?)[\}|%]"
The idea being:
{ or %
anything
} or %
I think you should use a combination of conditional anda nested groups:
((\{(.*)\})|(%(.*)%))

Extract string that contains only letters in C#

string input = "5991 Duncan Road";
var onlyLetters = new String(input.Where(Char.IsLetter).ToArray());
Output: DuncanRoad
But I am expecting output is Duncan Road. What need to change ?
For the input like yours, you do not need a regex, just skip all non-letter symbols at the beginning with SkipWhile():
Bypasses elements in a sequence as long as a specified condition is true and then returns the remaining elements.
C# code:
var input = "5991 Duncan Road";
var onlyLetters = new String(input.SkipWhile(p => !Char.IsLetter(p)).ToArray());
Console.WriteLine(onlyLetters);
See IDEONE demo
A regx solution that will remove numbers that are not part of words and also adjoining whitespace:
var res = Regex.Replace(str, #"\s+(?<!\p{L})\d+(?!\p{L})|(?<!\p{L})\d+(?!\p{L})\s+", string.Empty); 
You can use this lookaround based regex:
repl = Regex.Replace(input, #"(?<![a-zA-Z])[^a-zA-Z]|[^a-zA-Z](?![a-zA-Z])", "");
//=> Duncan Road
(?<![a-zA-Z])[^a-zA-Z] matches a non-letter that is not preceded by another letter.
| is regex alternation
[^a-zA-Z](?![a-zA-Z]) matches a non-letter that is not followed by another letter.
RegEx Demo
You can still use LINQ filtering with Char.IsLetter || Char.IsWhiteSpace. To remove all leading and trailing whitespace chars you can call String.Trim:
string input = "5991 Duncan Road";
string res = String.Join("", input.Where(c => Char.IsLetter(c) || Char.IsWhiteSpace(c)))
.Trim();
Console.WriteLine(res); // Duncan Road

Categories