So I have a string that I need to split by semicolon's
Email address: "one#tw;,.'o"#hotmail.com;"some;thing"#example.com
Both of the email addresses are valid
So I want to have a List<string> of the following:
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
But the way I am currently splitting the addresses is not working:
var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()).ToList();
Because of the multiple ; characters I end up with invalid email addresses.
I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.
Does anyone have any better suggestions?
Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign #, you can use this regular expression to capture e-mail addresses:
((?:[^#"]+|"[^"]*")#[^;]+)(?:;|$)
The idea is to capture either an unquoted [^#"]+ or a quoted "[^"]*" part prior to #, and then capture everything up to semicolon ; or the end anchor $.
Demo of the regex.
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world";
var mm = Regex.Matches(input, "((?:[^#\"]+|\"[^\"]*\")#[^;]+)(?:;|$)");
foreach (Match m in mm) {
Console.WriteLine(m.Groups[1].Value);
}
This code prints
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
Demo 1.
If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:
((?:(?:[^#\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")#[^;]+)(?:;|$)
Everything else remains the same.
Demo 2.
I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.
public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
{
var startIndex = 0;
var delimiterIndex = 0;
while (delimiterIndex >= 0)
{
delimiterIndex = input.IndexOf(';', startIndex);
string substring = input;
if (delimiterIndex > 0)
{
substring = input.Substring(0, delimiterIndex);
}
if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
{
yield return substring;
input = input.Substring(delimiterIndex + 1);
startIndex = 0;
}
else
{
startIndex = delimiterIndex + 1;
}
}
}
Then the following
var input = "blah#blah.com;\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;asdasd#asd.co.uk;";
foreach (var email in SplitEmailsByDelimiter(input, ';'))
{
Console.WriteLine(email);
}
Would give this output
blah#blah.com
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
asdasd#asd.co.uk
You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape)
{
int beginIndex = 0;
int length = 0;
bool escaped = false;
foreach (char c in str)
{
if (c == beginEndEscape)
{
escaped = !escaped;
}
if (!escaped && c == delimiter)
{
yield return str.Substring(beginIndex, length);
beginIndex += length + 1;
length = 0;
continue;
}
length++;
}
yield return str.Substring(beginIndex, length);
}
Then the following
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;\"D;D#blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"'))
Console.WriteLine(v);
While give this output
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
"D;D#blah;blah.com"
Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape, char singleEscape)
{
StringBuilder builder = new StringBuilder();
bool escapedSequence = false;
bool previousEscapeChar = false;
foreach (char c in str)
{
if (c == singleEscape && !previousEscapeChar)
{
previousEscapeChar = true;
continue;
}
if (c == beginEndEscape && !previousEscapeChar)
{
escapedSequence = !escapedSequence;
}
if (!escapedSequence && !previousEscapeChar && c == delimiter)
{
yield return builder.ToString();
builder.Clear();
continue;
}
builder.Append(c);
previousEscapeChar = false;
}
yield return builder.ToString();
}
Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.
Related
How to split text into words?
Example text:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
The words in that line are:
Oh
you
can't
help
that
said
the
Cat
we're
all
mad
here
I'm
mad
You're
mad
Split text on whitespace, then trim punctuation.
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
First, Remove all special characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
var split = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
public static string RemoveSpecialCharacters(this string str) {
var sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
Update
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases
Just to add a variation on #Adam Fridental's answer which is very good, you could try this Regex:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, #"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
\w+[^\s]*\w+|\w
If you don't want to use a Regex object, you could do something like...
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
This is one of solution, i dont use any helper class or method.
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
If you want to use the "for cycle" to check each char and save all punctuation in the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]
public class SentenceSplitResult
{
public string word;
public bool isAWord;
}
public class StringsHelper
{
private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();
private readonly string input;
public StringsHelper(string input)
{
this.input = input;
}
public List<SentenceSplitResult> GetSplitSentence()
{
StringBuilder sb = new StringBuilder();
try
{
if (String.IsNullOrEmpty(input)) {
Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
return outputList;
}
bool isAletter = IsAValidLetter(input[0]);
// Each char i checked if is a part of a word.
// If is YES > I can store the char for later
// IF is NO > I Save the word (if exist) and then save the punctuation
foreach (var _char in input)
{
isAletter = IsAValidLetter(_char);
if (isAletter == true)
{
sb.Append(_char);
}
else
{
SaveWord(sb.ToString());
sb.Clear();
SaveANotWord(_char);
}
}
SaveWord(sb.ToString());
}
catch (Exception ex)
{
Logger.Log(ex);
}
return outputList;
}
private static bool IsAValidLetter(char _char)
{
if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
{
return false;
}
return true;
}
private void SaveWord(string word)
{
if (String.IsNullOrEmpty(word) == false)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = true,
word = word
});
}
}
private void SaveANotWord(char _char)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = false,
word = _char.ToString()
});
}
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't) but remove the single quotes like in 'Oh.
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}
I want to replace the delimiter comma with tabs in a CSV file
Input
Output
Note that commas shouldn't be replaced for words enclosed by quotes. Also in the output, we want to omit the double quotes
I tried the following, but the code also replaces commas for words enclosed by quotes
public void Replace_comma_with_tabs(string path)
{
var file = File
.ReadLines(path)
.SkipWhile(line => string.IsNullOrWhiteSpace(line)) // To be on the safe side
.Select((line, index) => line.Replace(',', '\t')) // replace ',' with '\t'
.ToList(); // Materialization, since we write into the same file
File.WriteAllLines(path, file);
}
How can I skip commas for the words enclosed by quotes?
Here is one way of doing it. It uses flag quotesStarted to check if comma should be treated as delimiter or part of the text in column. I also used StringBuilder since that class has good performance with string concatenation. It reads lines and then for each line it iterates through its characters and checks for those with special meaning (comma, single quote, tab, comma between single quotes):
static void Main(string[] args)
{
var path = "data.txt";
var file = File.ReadLines(path).ToArray();
StringBuilder sbFile = new StringBuilder();
foreach (string line in file)
{
if (String.IsNullOrWhiteSpace(line) == false)
{
bool quotesStarted = false;
StringBuilder sbLine = new StringBuilder();
foreach (char currentChar in line)
{
if (currentChar == '"')
{
quotesStarted = !quotesStarted;
sbLine.Append(currentChar);
}
else if (currentChar == ',')
{
if (quotesStarted)
sbLine.Append(currentChar);
else
sbLine.Append("\t");
}
else if (currentChar == '\t')
throw new Exception("Tab found");
else
sbLine.Append(currentChar);
}
sbFile.AppendLine(sbLine.ToString());
}
}
File.WriteAllText("Result-" + path, sbFile.ToString());
}
There's a lot of ways to do this but here's one. This only includes the code to transform a string that has comma delimited text with quoted text. You'd use "ToTabs" instead of "Replace" inside your Select statement. You'll have to harden this to add some error checking.
This will handle escaped quotes inside of quoted fields and it transforms existing tabs to spaces, but it's not a full blown CSV parser.
static class CsvHelper
{
public static string ToTabs(this string source)
{
Func<char,char> getState = NotInQuotes;
char last = ' ';
char InQuotes(char ch)
{
if ('"' == ch && last != '"')
getState = NotInQuotes;
else if ('\t' == ch)
ch = ' ';
last = ch;
return ch;
}
char NotInQuotes(char ch)
{
last = ch;
if ('"' == ch)
getState = InQuotes;
else if (',' == ch)
return '\t';
else if ('\t' == ch)
ch = ' ';
return ch;
}
return string.Create(source.Length, getState, (buffer,_) =>
{
for (int i = 0; i < source.Length; ++i)
{
buffer[i] = getState(source[i]);
}
});
}
}
static void Main(string[] _)
{
const string Source = "a,string,with,commas,\"field,with,\"\"commas\", and, another";
var withTabs = Source.ToTabs();
Console.WriteLine(Source);
Console.WriteLine(withTabs);
}
To change commas in a string to tabs, use Replace method.
Example:
str2.Replace(",", "hit tab key");
string str = "Lucy, John, Mark, Grace";
string str2 = str.Replace(",", " ");
I need a regex to validate string.
string test = "C:\Dic\<:Id:>.<:Dic:>testtest<:Location:>.Test.doc"
I made I regex to get all fields between "<:" and ":>".
Regex.Matches(fileNameConfig, #"\<(.+?)\>")
.Cast<Match>()
.Select(m => m.Groups[0].Value).ToList();
Now, I need to check that if are there any opened tags that have not close tags and are there any nested tags.
string test = "C:\Dic\<:<:Id:>.<:Dic:>testtest<:Location:>.Test.doc"
string test = "<:C:\Dic\<:Id:>.<:Dic:>testtest<:Location:>.Test.doc:>"
The nesting can be tested by counting the opening and closing brackets.
At any position in the string, the number of opening brackets before this position must be greater or equal the number of closing brackets.
At the end of the string, the number of opening brackets must equal the number of closing brackets exactly.
public static bool IsBracketNestingValid(string input) {
if (string.IsNullOrWhiteSpace(input)) {
return true; // empty string is always nested correctly
}
const string openingBracket = "<:";
const string closingBracket = ":>";
if (input.Length < openingBracket.Length) {
// perform this check if we expect that input strings
// must contain at least one bracket (e.g. "<" should be invalid)
return false;
}
int openingCount = 0;
int closingCount = 0;
for (int pos = 0; pos < input.Length-1; pos++) {
string currentToken = string.Format("{0}{1}", input[pos], input[pos+1]);
if (currentToken == openingBracket) {
openingCount++;
// skip over this recognized token
// (so we do not count any ':' twice, e.g. "<:>" should be invalid)
pos++;
}
if (currentToken == closingBracket) {
closingCount++;
pos++; // skip over this recognized token
}
if (closingCount > openingCount) {
return false; // found closing bracket before opening bracket
}
}
return openingCount == closingCount;
}
C# Fiddle
I have a string like below, which is pipe separated. it has double quotes around string (ex: "ANI").
How do I split this with pipe delimiter (which are not inside double quotes) ?
511186|"ANI"|"ABCD-102091474|E|EFG"||"2013-07-20 13:47:19.556"
And splitted values shoule be like below:
511186
"ANI"
"ABCD-102091474|E|EFG"
"2013-07-20 13:47:19.556"
Any help would be appreciated!
EDIT
The answer that I accepted, did not work for those strings which has double quotes inside. Any idea, what should be the issue ?
using System.Text.RegularExpressions;
string regexFormat = string.Format(#"(?:^|\{0})(""[^""]*""|[^\{0}]*)", '|');
string[] result = Regex.Matches("111001103|\"E\"|\"BBB\"|\"XXX\"|||10000009|153086649|\"BCTV\"|\"REV\"|||1.00000000|||||\"ABC-BT AD\"|\"\"\"ABC - BT\"\" AD\"|||\"N\"||\"N\"|||\"N\"||\"N",regexFormat)
.Cast<Match>().Select(m => m.Groups[1].Value).ToArray();
foreach(var i in result)
Console.WriteLine(i)
You can use a regular expression to match the items in the string:
string[] result = Regex.Matches(s, #"(?:^|\|)(""[^""]*""|[^|]*)")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToArray();
Explanation:
(?: A non-capturing group
^|\| Matches start of string or a pipe character
) End of group
( Capturing group
"[^"]*" Zero or more non-quotes surrounded by quotes
| Or
[^|]* Zero or more non-pipes
) End of group
Here is one way to do it:
public List<string> Parse(string str)
{
var parts = str.Split(new[] {"|"}, StringSplitOptions.None);
List<string> result = new List<string>();
for (int i = 0; i < parts.Length; i++)
{
string part = parts[i];
if (IsPartStart(part))
{
List<string> sub_parts = new List<string>();
do
{
sub_parts.Add(part);
i++;
part = parts[i];
} while (!IsPartEnd(part));
sub_parts.Add(part);
part = string.Join("|", sub_parts);
}
result.Add(part);
}
return result;
}
private bool IsPartStart(string part)
{
return (part.StartsWith("\"") && !part.EndsWith("\"")) ;
}
private bool IsPartEnd(string part)
{
return (!part.StartsWith("\"") && part.EndsWith("\""));
}
This works by splitting everything, and it then joins some of the parts that needs joining by searching for parts that starts with " and corresponding parts that ends with ".
string.Split("|", inputString);
...will give you the individual parts, but will fail if any of the parts have a pipe separator in them.
If it's a CSV file, following all the usual CSV rules about character-escaping, etc. (but using a pipe symbol instead of comma), then you should look at using CsvHelper, a NuGet package designed for reading and writing CSV files. It does all the hard work, and deals with all the corner cases that you'd otherwise have to do yourself.
Here's how I'd do it. It's fairly simple and I think you'll find it's very fast as well. I haven't run any tests, but I'm pretty confident that it's faster than regular expressions.
IEnumerable<string> Parse(string s)
{
int pos = 0;
while (pos < s.Length)
{
char endChar = '|';
// Test for quoted value
if (s[pos] == '"')
{
pos++;
endChar = '"';
}
// Extract this value
int newPos = s.IndexOf(endChar, pos);
if (newPos < 0)
newPos = s.Length;
yield return s.Substring(pos, newPos - pos);
// Move to start of next value
pos = newPos + 1;
if (pos < s.Length && s[pos] == '|')
pos++;
}
}
I was just wondering if there is a simple way of doing this. i.e. Replacing the occurrence of consecutive characters with the same character.
For eg: - if my string is "something likeeeee tttthhiiissss" then my final output should be "something like this".
The string can contain special characters too including space.
Can you guys suggest some simple way for doing this.
This should do it:
var regex = new Regex("(.)\\1+");
var str = "something likeeeee!! tttthhiiissss";
Console.WriteLine(regex.Replace(str, "$1")); // something like! this
The regex will match any character (.) and \\1+ will match whatever was captured in the first group.
string myString = "something likeeeee tttthhiiissss";
char prevChar = '';
StringBuilder sb = new StringBuilder();
foreach (char chr in myString)
{
if (chr != prevChar) {
sb.Append(chr);
prevChar = chr;
}
}
How about:
s = new string(s
.Select((x, i) => new { x, i })
.Where(x => x.i == s.Length - 1 || s[x.i + 1] != x.x)
.Select(x => x.x)
.ToArray());
In english, we are creating a new string based on a char[] array. We construct that char[] array by applying a few LINQ operators:
Select: Capture the index i along with the current character x.
Filter out charaters that are not the same as the subsequent character
Select the character x.x back out of the anonymous type x.
Convert back to a char[] array so we can pass to constructor of string.
Console.WriteLine("Enter any string");
string str1, result="", str = Console.ReadLine();
char [] array= str.ToCharArray();
int i=0;
for (i = 0; i < str.Length;i++ )
{
if ((i != (str.Length - 1)))
{ if (array[i] == array[i + 1])
{
str1 = str.Trim(array[i]);
}
else
{
result += array[i];
}
}
else
{
result += array[i];
}
}
Console.WriteLine(result);
In this code the program ;
will read the string as entered from user
2.Convert the string in char Array using string.ToChar()
The loop will run for each character in string
each character stored in that particular position in array will be compared to the character stored in position one greater than that . And if the characters are found same the character stored in that particular array would be trimmed using .ToTrim()
For last character the loop will show error of index out of bound as it would be the last position value of the array. That's why I used * if ((i != (str.Length - 1)))*
6.The characters left after trimming are stored in result in concatenated form .
word = "something likeeeee tttthhiiissss"
re.sub(r"(.)\1+", r"\1",word)