I'm trying to parse and split the following sample text file using dotnet c# in order to break each single data points into separate strings.
§Id§|§Name§|§UpdateDate§|§Description§
1|§AAA/FE-45§|2000-02-02 00:00:00|§§
2|§BBB-123§|2000-02-03 00:00:00|§§
3|§CC|45§|2000-02-07 00:00:00|§The following,
is a multiline description
please check Name:
CC|45 as soon as possible§
File Properties:
CodePage: ANSI
Column Headers: Yes
Row Delimiter: {CR}{LF}
Column Delimiter: | (Vertical Bar)
Text Qualifier: §
The trouble I have is that the text type columns are qualified with a non standard symbol and the given text could be a block of multi-line text that may contain various symbols such as {CRLF}, {LF} or even | (Vertical Bar).
From what I can read around, i cannot use TextFieldParser because it only handles double quote qualifier and Bulk Insert does not support text qualifier.
I'm no c# expert at all; I wouldn't want to reinvent the wheel and ideally would like to use the best practices. But I also like to understand and "own" what I produce so I would prefer to avoid Libraries such as Filehelpers.
Thank you for your guidance!
A typical approach would be to use finite automata for this. In your case, you can try the following code:
public static List<string[]> split(string s)
{
bool ins = false;
int no = 3;
var L = new List<string>();
var Res = new List<string[]>();
var B = new StringBuilder();
foreach (var c in s)
{
switch (c)
{
case '§':
if (ins)
{
ins = false;
L.Add(B.ToString());
if (no == 0)
{
Res.Add(L.ToArray<string>());
L.Clear();
no = 3;
}
}
else
{
ins = true;
B.Clear();
}
break;
case '|':
if (!ins) { no--; }
else B.Append(c);
break;
default:
if (ins) B.Append(c);
break;
}
}
return Res;
}
}
Try this code
string pattern = #"(?<id>\d+) \| (?<name>§.+?§) \| (?<date>\d{4}-\d\d-\d\d \s \d\d:\d\d:\d\d) \| (?<desc>§.*?§)";
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);
string text = File.ReadAllText("test.txt", Encoding.GetEncoding(1251));
text = text.Split(new string[] { Environment.NewLine }, 2, StringSplitOptions.None)[1];
var matches = regex.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups["id"].Value);
Console.WriteLine(match.Groups["name"].Value.Trim('§'));
Console.WriteLine(match.Groups["date"].Value);
Console.WriteLine(match.Groups["desc"].Value.Trim('§'));
Console.WriteLine();
}
Related
I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards
If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].
Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}
I'm wondering how I can identify headings with differing numerical marking styles with one or more regular expressions assuming sometimes styles overlap between documents. The goal is to extract all the subheadings and data for a specific heading in each file, but these files aren't standardized. Is regular expressions even the right approach here?
I'm working on a program that parses a .pdf file and looks for a specific section. Once it finds the section it finds all subsections of that section and their content and stores it in a dictionary<string, string>. I start by reading the entire pdf into a string, and then use this function to locate the "marking" section.
private string GetMarkingSection(string text)
{
int startIndex = 0;
int endIndex = 0;
bool startIndexFound = false;
Regex rx = new Regex(HEADINGREGEX);
foreach (Match match in rx.Matches(text))
{
if (startIndexFound)
{
endIndex = match.Index;
break;
}
if (match.ToString().ToLower().Contains("marking"))
{
startIndex = match.Index;
startIndexFound = true;
}
}
return text.Substring(startIndex, (endIndex - startIndex));
}
Once the marking section is found, I use this to find subsections.
private Dictionary<string, string> GetSubsections(string text)
{
Dictionary<string, string> subsections = new Dictionary<string, string>();
string[] unprocessedSubSecs = Regex.Split(text, SUBSECTIONREGEX);
string title = "";
string content = "";
foreach(string s in unprocessedSubSecs)
{
if(s != "") //sometimes it pulls in empty strings
{
Match m = Regex.Match(s, SUBSECTIONREGEX);
if (m.Success)
{
title = s;
}
else
{
content = s;
if (!String.IsNullOrWhiteSpace(content) && !String.IsNullOrWhiteSpace(title))
{
subsections.Add(title, content);
}
}
}
}
return subsections;
}
Getting these methods to work the way I want them to isn't an issue, the problem is getting them to work with each of the documents. I'm working on a commercial application so any API that requires a license isn't going to work for me.
These documents are anywhere from 1-16 years old, so the formatting varies quite a bit. Here is a link to some sample headings and subheadings from various documents. But to make it easy, here are the regex patterns I'm using:
Heading: (?m)^(\d+\.\d+\s[ \w,\-]+)\r?$
Subheading: (?m)^(\d\.[\d.]+ ?[ \w]+) ?\r?$
Master Key: (?m)^(\d\.?[\d.]*? ?[ \-,:\w]+) ?\r?$
Since some headings use the subheading format in other documents I am unable to use the same heading regex for each file, and the same goes for my subheading regex.
My alternative to this was that I was going to write a master key (listed in the regex link) to identify all types of headings and then locate the last instance of a numeric character in each heading (5.1.X) and then look for 5.1.X+1 to find the end of that section.
That's when I ran into another problem. Some of these files have absolutely no proper structure. Most of them go from 5.2->7.1.5 (5.2->5.3/6.0 would be expected)
I'm trying to wrap my head around a solution for something like this, but I've got nothing... I am open to ideas not involving regex as well.
Here is my updated GetMarkingSection method:
private Dictionary<string, string> GetMarkingSection(string text)
{
var headingRegex = HEADING1REGEX;
var subheadingRegex = HEADING2REGEX;
Dictionary<string, string> markingSection = new Dictionary<string, string>();
if (Regex.Matches(text, HEADING1REGEX, RegexOptions.Multiline | RegexOptions.Singleline).Count > 0)
{
foreach (Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if (Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
else
{
headingRegex = HEADING2REGEX;
subheadingRegex = HEADING3REGEX;
foreach(Match m in Regex.Matches(text, headingRegex, RegexOptions.Multiline | RegexOptions.Singleline))
{
if(Regex.IsMatch(m.ToString(), HEADINGMASTERKEY))
{
if (m.Groups[2].Value.ToLower().Contains("marking"))
{
var subheadings = Regex.Matches(m.ToString(), subheadingRegex, RegexOptions.Multiline | RegexOptions.Singleline);
foreach (Match s in subheadings)
{
markingSection.Add(s.Groups[1].Value + " " + s.Groups[2].Value, s.Groups[3].Value);
}
return markingSection;
}
}
}
}
return null;
}
Here are some example PDF files:
See if this approach works:
var heading1Regex = #"^(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\s|\Z)";
Demo
var heading2Regex = #"^(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\s|\Z)";
Demo
var heading3Regex = #"^(\d+)\.(\d+)\.(\d+)\s(?<title>.*?)$\n(?<content>.*?)$\n*(?=^\d+\.\d+\.\d+\s|\Z)";
Demo
For each pdf file:
var headingRegex = heading1Regex;
var subHeadingRegex = heading2Regex;
if there are any matches for headingRegex
{
for each match, find matches for subHeadingRegex
}
else
{
var headingRegex = heading2Regex;
var subHeadingRegex = heading3Regex;
//repeat same steps
}
1. Edge case 1: after 5.2, comes 7.1.3
As shown here,
get main section match using heading2Regex.
convert group1 of the match to integer
int.TryParse(match.group1, out var headingIndex);
get sub section matches for heading3Regex
for each subsection match, convert group1 to integer.
int.TryParse(match.group1, out var subHeadingIndex);
check if headingIndex is equal to subHeadingIndex. if not handle accordingly.
I am trying to future proof a program I am creating so that the pattern I need to have users put in is not hard coded. There is always a chance that the letter or number patter can change, but when it does I need everyone to remain consistent. Plus I want the managers to be to control what goes in without relying on me. Is it possible to use regex or another string tool to compare input against a list stored in a database. I want it to be easy so the patterns stored in the database would look like X###### or X######-X####### and so on.
Sure, just store the regular expression rules in a string column in a table and then load them into an IEnumerable<Regex> in your app. Then, a match is simply if ANY of those rules match. Beware that conflicting rules could be prone to greedy race (first one to be checked wins) so you'd have to be careful there. Also be aware that there are many optimizations that you could perform beyond my example, which is designed to be simple.
List<string> regexStrings = db.GetRegexStrings();
var result = new List<Regex>(regexStrings.Count);
foreach (var regexString in regexStrings)
{
result.Add(new Regex(regexString);
}
...
// The check
bool matched = result.Any(i => i.IsMatch(testInput));
You could store your patterns as-is in your database, and then translate them to regexes.
I don't know specifically what characters you'd need in your format, but let's suppose you just want to substitute a number to # and leave the rest as-is, here's some code for that:
public static Regex ConvertToRegex(string pattern)
{
var sb = new StringBuilder();
sb.Append("^");
foreach (var c in pattern)
{
switch (c)
{
case '#':
sb.Append(#"\d");
break;
default:
sb.Append(Regex.Escape(c.ToString()));
break;
}
}
sb.Append("$");
return new Regex(sb.ToString());
}
You can also use options like RegexOptions.IgnoreCase if that's what you need.
NB: For some reason, Regex.Escape escapes the # character, even though it's not special... So I just went for the character-by-character approach.
private bool TestMethod()
{
const string textPattern = "X###";
string text = textBox1.Text;
bool match = true;
if (text.Length == textPattern.Length)
{
char[] chrStr = text.ToCharArray();
char[] chrPattern = textPattern.ToCharArray();
int length = text.Length;
for (int i = 0; i < length; i++)
{
if (chrPattern[i] != '#')
{
if (chrPattern[i] != chrStr[i])
{
return false;
}
}
}
}
else
{
return false;
}
return match;
}
This is doing everything I need it to do now. Thanks for all the tips though. I will have to look into the regex more in the future.
Using MaskedTextProvider, you could do do something like this:
using System.Globalization;
using System.ComponentModel;
string pattern = "X&&&&&&-X&&&&&&&";
string text = "Xabcdef-Xasdfghi";
var culture = CultureInfo.GetCultureInfo("sv-SE");
var matcher = new MaskedTextProvider(pattern, culture);
int position;
MaskedTextResultHint hint;
if (!matcher.Set(text, out position, out hint))
{
Console.WriteLine("Error at {0}: {1}", position, hint);
}
else if (!matcher.MaskCompleted)
{
Console.WriteLine("Not enough characters");
}
else if (matcher.ToString() != text)
{
Console.WriteLine("Missing literals");
}
else
{
Console.WriteLine("OK");
}
For a description of the format, see: http://msdn.microsoft.com/en-us/library/system.windows.forms.maskedtextbox.mask
I have a text file with a certain format. First comes an identifier followed by three spaces and a colon. Then comes the value for this identifier.
ID1 :Value1
ID2 :Value2
ID3 :Value3
What I need to do is searching e.g. for ID2 : and replace Value2 with a new value NewValue2. What would be a way to do this? The files I need to parse won't get very large. The largest will be around 150 lines.
If the file isn't that big you can do a File.ReadAllLines to get a collection of all the lines and then replace the line you're looking for like this
using System.IO;
using System.Linq;
using System.Collections.Generic;
List<string> lines = new List<string>(File.ReadAllLines("file"));
int lineIndex = lines.FindIndex(line => line.StartsWith("ID2 :"));
if (lineIndex != -1)
{
lines[lineIndex] = "ID2 :NewValue2";
File.WriteAllLines("file", lines);
}
Here's a simple solution which also creates a backup of the source file automatically.
The replacements are stored in a Dictionary object. They are keyed on the line's ID, e.g. 'ID2' and the value is the string replacement required. Just use Add() to add more as required.
StreamWriter writer = null;
Dictionary<string, string> replacements = new Dictionary<string, string>();
replacements.Add("ID2", "NewValue2");
// ... further replacement entries ...
using (writer = File.CreateText("output.txt"))
{
foreach (string line in File.ReadLines("input.txt"))
{
bool replacementMade = false;
foreach (var replacement in replacements)
{
if (line.StartsWith(replacement.Key))
{
writer.WriteLine(string.Format("{0} :{1}",
replacement.Key, replacement.Value));
replacementMade = true;
break;
}
}
if (!replacementMade)
{
writer.WriteLine(line);
}
}
}
File.Replace("output.txt", "input.txt", "input.bak");
You'll just have to replace input.txt, output.txt and input.bak with the paths to your source, destination and backup files.
Ordinarily, for any text searching and replacement, I'd suggest some sort of regular expression work, but if this is all you're doing, that's really overkill.
I would just open the original file and a temporary file; read the original a line at a time, and just check each line for "ID2 :"; if you find it, write your replacement string to the temporary file, otherwise, just write what you read. When you've run out of source, close both, delete the original, and rename the temporary file to that of the original.
Something like this should work. It's very simple, not the most efficient thing, but for small files, it would be just fine:
private void setValue(string filePath, string key, string value)
{
string[] lines= File.ReadAllLines(filePath);
for(int x = 0; x < lines.Length; x++)
{
string[] fields = lines[x].Split(':');
if (fields[0].TrimEnd() == key)
{
lines[x] = fields[0] + ':' + value;
File.WriteAllLines(lines);
break;
}
}
}
You can use regex and do it in 3 lines of code
string text = File.ReadAllText("sourcefile.txt");
text = Regex.Replace(text, #"(?i)(?<=^id2\s*?:\s*?)\w*?(?=\s*?$)", "NewValue2",
RegexOptions.Multiline);
File.WriteAllText("outputfile.txt", text);
In the regex, (?i)(?<=^id2\s*?:\s*?)\w*?(?=\s*?$) means, find anything that starts with id2 with any number of spaces before and after :, and replace the following string (any alpha numeric character, excluding punctuations) all the way 'till end of the line. If you want to include punctuations, then replace \w*? with .*?
You can use regexes to achieve this.
Regex re = new Regex(#"^ID\d+ :Value(\d+)\s*$", RegexOptions.IgnoreCase | RegexOptions.Compiled);
List<string> lines = File.ReadAllLines("mytextfile");
foreach (string line in lines) {
string replaced = re.Replace(target, processMatch);
//Now do what you going to do with the value
}
string processMatch(Match m)
{
var number = m.Groups[1];
return String.Format("ID{0} :NewValue{0}", number);
}
My string is as follows:
smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
I need back:
smtp:jblack#test.com
SMTP:jb#test.com
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;
The problem is the semi-colons seperate the addresses and also part of the X400 address. Can anyone suggest how best to split this?
PS I should mentioned the order differs so it could be:
X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com
There can be more than 3 address, 4, 5.. 10 etc including an X500 address, however they do all start with either smtp: SMTP: X400 or X500.
EDIT: With the updated information, this answer certainly won't do the trick - but it's still potentially useful, so I'll leave it here.
Will you always have three parts, and you just want to split on the first two semi-colons?
If so, just use the overload of Split which lets you specify the number of substrings to return:
string[] bits = text.Split(new char[]{';'}, 3);
May I suggest building a regular expression
(smtp|SMTP|X400|X500):((?!smtp:|SMTP:|X400:|X500:).)*;?
or protocol-less
.*?:((?![^:;]*:).)*;?
in other words find anything that starts with one of your protocols. Match the colon. Then continue matching characters as long as you're not matching one of your protocols. Finish with a semicolon (optionally).
You can then parse through the list of matches splitting on ':' and you'll have your protocols. Additionally if you want to add protocols, just add them to the list.
Likely however you're going to want to specify the whole thing as case-insensitive and only list the protocols in their uppercase or lowercase versions.
The protocol-less version doesn't care what the names of the protocols are. It just finds them all the same, by matching everything up to, but excluding a string followed by a colon or a semi-colon.
Split by the following regex pattern
string[] items = System.Text.RegularExpressions.Split(text, ";(?=\w+:)");
EDIT: better one can accept more special chars in the protocol name.
string[] items = System.Text.RegularExpressions.Split(text, ";(?=[^;:]+:)");
http://msdn.microsoft.com/en-us/library/c1bs0eda.aspx
check there, you can specify the number of splits you want. so in your case you would do
string.split(new char[]{';'}, 3);
Not the fastest if you are doing this a lot but it will work for all cases I believe.
string input1 = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string input2 = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;;smtp:jblack#test.com;SMTP:jb#test.com";
Regex splitEmailRegex = new Regex(#"(?<key>\w+?):(?<value>.*?)(\w+:|$)");
List<string> sets = new List<string>();
while (input2.Length > 0)
{
Match m1 = splitEmailRegex.Matches(input2)[0];
string s1 = m1.Groups["key"].Value + ":" + m1.Groups["value"].Value;
sets.Add(s1);
input2 = input2.Substring(s1.Length);
}
foreach (var set in sets)
{
Console.WriteLine(set);
}
Console.ReadLine();
Of course many will claim Regex: Now you have two problems. There may even be a better regex answer than this.
You could always split on the colon and have a little logic to grab the key and value.
string[] bits = text.Split(':');
List<string> values = new List<string>();
for (int i = 1; i < bits.Length; i++)
{
string value = bits[i].Contains(';') ? bits[i].Substring(0, bits[i].LastIndexOf(';') + 1) : bits[i];
string key = bits[i - 1].Contains(';') ? bits[i - 1].Substring(bits[i - 1].LastIndexOf(';') + 1) : bits[i - 1];
values.Add(String.Concat(key, ":", value));
}
Tested it with both of your samples and it works fine.
This caught my curiosity .... So this code actually does the job, but again, wants tidying :)
My final attempt - stop changing what you need ;=)
static void Main(string[] args)
{
string fneh = "X400:C=US400;A= ;P=Test;O=Exchange;S=Jack;G=Black;x400:C=US400l;A= l;P=Testl;O=Exchangel;S=Jackl;G=Blackl;smtp:jblack#test.com;X500:C=US500;A= ;P=Test;O=Exchange;S=Jack;G=Black;SMTP:jb#test.com;";
string[] parts = fneh.Split(new char[] { ';' });
List<string> addresses = new List<string>();
StringBuilder address = new StringBuilder();
foreach (string part in parts)
{
if (part.Contains(":"))
{
if (address.Length > 0)
{
addresses.Add(semiColonCorrection(address.ToString()));
}
address = new StringBuilder();
address.Append(part);
}
else
{
address.AppendFormat(";{0}", part);
}
}
addresses.Add(semiColonCorrection(address.ToString()));
foreach (string emailAddress in addresses)
{
Console.WriteLine(emailAddress);
}
Console.ReadKey();
}
private static string semiColonCorrection(string address)
{
if ((address.StartsWith("x", StringComparison.InvariantCultureIgnoreCase)) && (!address.EndsWith(";")))
{
return string.Format("{0};", address);
}
else
{
return address;
}
}
Try these regexes. You can extract what you're looking for using named groups.
X400:(?<X400>.*?)(?:smtp|SMTP|$)
smtp:(?<smtp>.*?)(?:;+|$)
SMTP:(?<SMTP>.*?)(?:;+|$)
Make sure when constructing them you specify case insensitive. They seem to work with the samples you gave
Lots of attempts. Here is mine ;)
string src = "smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
Regex r = new Regex(#"
(?:^|;)smtp:(?<smtp>([^;]*(?=;|$)))|
(?:^|;)x400:(?<X400>.*?)(?=;x400|;x500|;smtp|$)|
(?:^|;)x500:(?<X500>.*?)(?=;x400|;x500|;smtp|$)",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
foreach (Match m in r.Matches(src))
{
if (m.Groups["smtp"].Captures.Count != 0)
Console.WriteLine("smtp: {0}", m.Groups["smtp"]);
else if (m.Groups["X400"].Captures.Count != 0)
Console.WriteLine("X400: {0}", m.Groups["X400"]);
else if (m.Groups["X500"].Captures.Count != 0)
Console.WriteLine("X500: {0}", m.Groups["X500"]);
}
This finds all smtp, x400 or x500 addresses in the string in any order of appearance. It also identifies the type of address ready for further processing. The appearance of the text smtp, x400 or x500 in the addresses themselves will not upset the pattern.
This works!
string input =
"smtp:jblack#test.com;SMTP:jb#test.com;X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G=Black;";
string[] parts = input.Split(';');
List<string> output = new List<string>();
foreach(string part in parts)
{
if (part.Contains(":"))
{
output.Add(part + ";");
}
else if (part.Length > 0)
{
output[output.Count - 1] += part + ";";
}
}
foreach(string s in output)
{
Console.WriteLine(s);
}
Do the semicolon (;) split and then loop over the result, re-combining each element where there is no colon (:) with the previous element.
string input = "X400:C=US;A= ;P=Test;O=Exchange;S=Jack;G="
+"Black;;smtp:jblack#test.com;SMTP:jb#test.com";
string[] rawSplit = input.Split(';');
List<string> result = new List<string>();
//now the fun begins
string buffer = string.Empty;
foreach (string s in rawSplit)
{
if (buffer == string.Empty)
{
buffer = s;
}
else if (s.Contains(':'))
{
result.Add(buffer);
buffer = s;
}
else
{
buffer += ";" + s;
}
}
result.Add(buffer);
foreach (string s in result)
Console.WriteLine(s);
here is another possible solution.
string[] bits = text.Replace(";smtp", "|smtp").Replace(";SMTP", "|SMTP").Replace(";X400", "|X400").Split(new char[] { '|' });
bits[0],
bits[1], and
bits[2]
will then contains the three parts in the order from your original string.