Parse a multiline email to var - c#

I'm attempting to parse a multi-line email so I can get at the data which is on its own newline under the heading in the body of the email.
It looks like this:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
It appears I am getting everything on each messagebox when I use string reader readline, though all I want is the data under each ------ as shown
This is my code:
foreach (MailItem mail in publicFolder.Items)
{
if (mail != null)
{
if (mail is MailItem)
{
MessageBox.Show(mail.Body, "MailItem body");
// Creates new StringReader instance from System.IO
using (StringReader reader = new StringReader(mail.Body))
{
string line;
while ((line = reader.ReadLine()) !=null)
//Loop over the lines in the string.
if (mail.Body.Contains("Marketing ID"))
{
// var localno = mail.Body.Substring(247,15);//not correct approach
// MessageBox.Show(localrefno);
//MessageBox.Show("found");
//var conexid = mail.Body.Replace(Environment.NewLine);
var regex = new Regex("<br/>", RegexOptions.Singleline);
MessageBox.Show(line.ToString());
}
}
//var stringBuilder = new StringBuilder();
//foreach (var s in mail.Body.Split(' '))
//{
// stringBuilder.Append(s).AppendLine();
//}
//MessageBox.Show(stringBuilder.ToString());
}
else
{
MessageBox.Show("Nothing found for MailItem");
}
}
}
You can see I had numerous attempts with it, even using substring position and using regex. Please help me get the data from each line under the ---.

It is not a very good idea to do that with Regex because it is quite easy to forget the edge cases, not easy to understand, and not easy to debug. It's quite easy to get into a situation that the Regex hangs your CPU and times out. (I cannot make any comment to other answers yet. So, please check at least my other two cases before you pick your final solution.)
In your cases, the following Regex solution works for your provided example. However, some additional limitations are there: You need to make sure there are no empty values in the non-starting or non-ending column. Or, let's say if there are more than two columns and any one of them in the middle is empty will make the names and values of that line mismatched.
Unfortunately, I cannot give you a non-Regex solution because I don't know the spec, e.g.: Will there be empty spaces? Will there be TABs? Does each field has a fixed count of characters or will they be flexible? If it is flexible and can have empty values, what kind of rules to detected which columns are empty? I assume that it is quite possible that they are defined by the column name's length and will have only space as delimiter. If that's the case, there are two ways to solve it, two-pass Regex or write your own parser. If all the fields has fixed length, it would be even more easier to do: Just using the substring to cut the lines and then trim them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public class Record{
public string Name {get;set;}
public string Value {get;set;}
}
public static void Main()
{
var regex = new Regex(#"(?<name>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<name>((?!-)[\w]+[ ]?)+)?)+(?:\r\n|\r|\n)(?>(?<splitters>(-+))(?>[ \t]+)?)+(?:\r\n|\r|\n)(?<value>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<value>((?!-)[\w]+[ ]?)+)?)+", RegexOptions.Compiled);
var testingValue =
#"EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144";
var matches = regex.Matches(testingValue);
var rows = (
from match in matches.OfType<Match>()
let row = (
from grp in match.Groups.OfType<Group>()
select new {grp.Name, Captures = grp.Captures.OfType<Capture>().ToList()}
).ToDictionary(item=>item.Name, item=>item.Captures.OfType<Capture>().ToList())
let names = row.ContainsKey("name")? row["name"] : null
let splitters = row.ContainsKey("splitters")? row["splitters"] : null
let values = row.ContainsKey("value")? row["value"] : null
where names != null && splitters != null &&
names.Count == splitters.Count &&
(values==null || values.Count <= splitters.Count)
select new {Names = names, Values = values}
);
var records = new List<Record>();
foreach(var row in rows)
{
for(int i=0; i< row.Names.Count; i++)
{
records.Add(new Record{Name=row.Names[i].Value, Value=i < row.Values.Count ? row.Values[i].Value : ""});
}
}
foreach(var record in records)
{
Console.WriteLine(record.Name + " = " + record.Value);
}
}
}
output:
Marketing ID = GR332230
Local Number = 0000232323
Dispatch Code = GX3472
Logic code = 1
Destination ID = 3411144
Destination details =
Please note that this also works for this kind of message:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
output:
Marketing ID = GR332230
Local Number = 0000232323
Dispatch Code = GX3472
Logic code = 1
Destination ID =
Destination details = 3411144
Or this:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
output:
Marketing ID =
Local Number =
Dispatch Code = GX3472
Logic code = 1
Destination ID =
Destination details = 3411144

var dict = new Dictionary<string, string>();
try
{
var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int starts = 0, end = 0, length = 0;
while (!lines[starts + 1].StartsWith("-")) starts++;
for (int i = starts + 1; i < lines.Length; i += 3)
{
var mc = Regex.Matches(lines[i], #"(?:^| )-");
foreach (Match m in mc)
{
int start = m.Value.StartsWith(" ") ? m.Index + 1 : m.Index;
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length - 1) ;
length = Math.Min(end - start, lines[i - 1].Length - start);
string key = length > 0 ? lines[i - 1].Substring(start, length).Trim() : "";
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length) ;
length = Math.Min(end - start, lines[i + 1].Length - start);
string value = length > 0 ? lines[i + 1].Substring(start, length).Trim() : "";
dict.Add(key, value);
}
}
}
catch (Exception ex)
{
throw new Exception("Email is not in correct format");
}
Live Demo
Using Regular Expressions:
var dict = new Dictionary<string, string>();
try
{
var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int starts = 0;
while (!lines[starts + 1].StartsWith("-")) starts++;
for (int i = starts + 1; i < lines.Length; i += 3)
{
var keys = Regex.Matches(lines[i - 1], #"(?:^| )(\w+\s?)+");
var values = Regex.Matches(lines[i + 1], #"(?:^| )(\w+\s?)+");
if (keys.Count == values.Count)
for (int j = 0; j < keys.Count; j++)
dict.Add(keys[j].Value.Trim(), values[j].Value.Trim());
else // remove bug if value of first key in a line has no value
{
if (lines[i + 1].StartsWith(" "))
{
dict.Add(keys[0].Value.Trim(), "");
dict.Add(keys[1].Value.Trim(), values[0].Value.Trim());
}
else
{
dict.Add(keys[0].Value, values[0].Value.Trim());
dict.Add(keys[1].Value.Trim(), "");
}
}
}
}
catch (Exception ex)
{
throw new Exception("Email is not in correct format");
}
Live Demo

Here is my attempt. I don't know if the email format can change (rows, columns, etc).
I can't think of an easy way to separate the columns besides checking for a double space (my solution).
class Program
{
static void Main(string[] args)
{
var emailBody = GetEmail();
using (var reader = new StringReader(emailBody))
{
var lines = new List<string>();
const int startingRow = 2; // Starting line to read from (start at Marketing ID line)
const int sectionItems = 4; // Header row (ex. Marketing ID & Local Number Line) + Dash Row + Value Row + New Line
// Add all lines to a list
string line = "";
while ((line = reader.ReadLine()) != null)
{
lines.Add(line.Trim()); // Add each line to the list and remove any leading or trailing spaces
}
for (var i = startingRow; i < lines.Count; i += sectionItems)
{
var currentLine = lines[i];
var indexToBeginSeparatingColumns = currentLine.IndexOf(" "); // The first time we see double spaces, we will use as the column delimiter, not the best solution but should work
var header1 = currentLine.Substring(0, indexToBeginSeparatingColumns);
var header2 = currentLine.Substring(indexToBeginSeparatingColumns, currentLine.Length - indexToBeginSeparatingColumns).Trim();
currentLine = lines[i+2]; //Skip dash line
indexToBeginSeparatingColumns = currentLine.IndexOf(" ");
string value1 = "", value2 = "";
if (indexToBeginSeparatingColumns == -1) // Use case of there being no value in the 2nd column, could be better
{
value1 = currentLine.Trim();
}
else
{
value1 = currentLine.Substring(0, indexToBeginSeparatingColumns);
value2 = currentLine.Substring(indexToBeginSeparatingColumns, currentLine.Length - indexToBeginSeparatingColumns).Trim();
}
Console.WriteLine(string.Format("{0},{1},{2},{3}", header1, value1, header2, value2));
}
}
}
static string GetEmail()
{
return #"EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144";
}
}
Output looks something like this:
Marketing ID,GR332230,Local Number,0000232323
Dispatch Code,GX3472,Logic code,1
Destination ID,3411144,Destination details,

Here is an aproach asuming you don't need the headers, info comes in order and mandatory.
This won't work for data that has spaces or optional fields.
foreach (MailItem mail in publicFolder.Items)
{
MessageBox.Show(mail.Body, "MailItem body");
// Split by line, remove dash lines.
var data = Regex.Split(mail.Body, #"\r?\n|\r")
.Where(l => !l.StartsWith('-'))
.ToList();
// Remove headers
for(var i = data.Count -2; lines >= 0; i -2)
{
data.RemoveAt(i);
}
// now data contains only the info you want in the order it was presented.
// Asuming info doesn't have spaces.
var result = data.SelectMany(d => d.Split(' '));
// WARNING: Missing info will not be present.
// {"GR332230", "0000232323", "GX3472", "1", "3411144"}
}

Related

Can't properly rebuild a string with Replacement values from Dictionary

I am trying to build a file using a template. I am processing the file in a while loop line by line. The first section of the file, first 35 lines are header information. The infromation is surrounded by # signs. Take this string for example:
Field InspectionStationID 3 {"PVA TePla #WSM#", "sw#data.tool_context.TOOL_SOFTWARE_VERSION#", "#data.context.TOOL_ENTITY#"}
The expected output should be:
Field InspectionStationID 3 {"PVA TePla", "sw0.2.002", "WSM102"}
This header section uses a different mapping than the rest of the file so I wanted to parse the file line by line from top to bottom and use a different logic for each section so that I don't waste time parsing the entire file at once multiple times for different sections.
The logic uses two dictionaries populated from an xml file. Because the file has mutliple tables, I combined them in the two dictionaries like so:
var headerCdataIndexKeyVals = Dictionary<string, int>(){
{"data.tool_context.TOOL_SOFTWARE_VERSION", 1},
{"data.context.TOOL_ENTITY",0}
};
var headerCdataArrayKeyVals = new Dictionary<string, List<string>>();
var tool_contextCdataList = new list <string>{"HM654", "sw0.2.002"};
var contextCdataList = new List<string>{"WSM102"}
headerCdataArrayKeyVals.add("tool_context", tool_contextCdataList);
headerCdataArrayKeyVals.add("context", contextCdataList);
To help me map the values to their respective positions in the string in one go and without having to loop through multiple dictionaries.
I am using the following logic:
public static string FindSubsInDelimetersAndReturn(string str, char openDelimiter, char closeDelimiter, HeaderMapperData mapperData )
{
string newString = string.Empty;
// Stores the indices of
Stack <int> dels = new Stack <int>();
for (int i = 0; i < str.Length; i++)
{
var let = str[i];
// If opening delimeter
// is encountered
if (str[i] == openDelimiter && dels.Count == 0)
{
dels.Push(i);
}
// If closing delimeter
// is encountered
else if (str[i] == closeDelimiter && dels.Count > 0)
{
// Extract the position
// of opening delimeter
int pos = dels.Peek();
dels.Pop();
// Length of substring
int len = i - 1 - pos;
// Extract the substring
string headerSubstring = str.Substring(pos + 1, len);
bool hasKey = mapperData.HeaderCdataIndexKeyVals.TryGetValue(headerSubstring.ToUpper(), out int headerCdataIndex);
string[] headerSubstringSplit = headerSubstring.Split('.');
string headerCDataVal = string.Empty;
if (hasKey)
{
if (headerSubstring.Contains("CONTAINER.CONTEXT", StringComparison.OrdinalIgnoreCase))
{
headerCDataVal = mapperData.HeaderCdataArrayKeyVals[headerSubstringSplit[1].ToUpper() + '.' + headerSubstringSplit[2].ToUpper()][headerCdataIndex];
//mapperData.HeaderCdataArrayKeyVals[]
}
else
{
headerCDataVal = mapperData.HeaderCdataArrayKeyVals[headerSubstringSplit[1].ToUpper()][headerCdataIndex];
}
string strToReplace = openDelimiter + headerSubstring + closeDelimiter;
string sub = str.Remove(i + 1);
sub = sub.Replace(strToReplace, headerCDataVal);
newString += sub;
}
else if (headerSubstring == "WSM" && closeDelimiter == '#')
{
string sub = str.Remove(len + 1);
newString += sub.Replace(openDelimiter + headerSubstring + closeDelimiter, "");
}
else
{
newString += let;
}
}
}
return newString;
}
}
But my output turns out to be:
"\tFie\tField InspectionStationID 3 {\"PVA TePla#WSM#\", \"sw0.2.002\tField InspectionStationID 3 {\"PVA TePla#WSM#\", \"sw#data.tool_context.TOOL_SOFTWARE_VERSION#\", \"WSM102"
Can someone help understand why this is happening and how I can go about correcting it so I get the output:
Field InspectionStationID 3 {"PVA TePla", "sw0.2.002", "WSM102"}
Am i even trying to solve this the right way or is there a better cleaner way to do it? Btw if the key is not in the dictionary I replace it with empty string

separate string with characters as number and ) in C#

I don't have much experience in C#. i am getting string from DB like
string strType = "1) Step to start workorder 1 2)step 2 continue 3)issue of workorder4)create workorder by name" // String is not fixed any numbers of Steps can be included.
I wanted to separate out above string like
1)step to start workorder
2)step 2 continue
3)issue of workorder
4)create workorder by name (SO ON.....)
i tried following but its static if i get more step it will fail....also solution is not good
string[] stringSeparators = new string[] { "1)", "2)", "3)", "4)" };
string[] strNames = strType.Split(stringSeparators, StringSplitOptions.None );
foreach (string strName in firstNames)
Console.WriteLine(strName);
How can I separate out string based on number and ) characters. best solution for any string...
Try the below code -
var pat = #"\d+[\)]";
var str= "1) Step to start workorder 1 2)step 2 continue 3)issue of workorder40)create workorder by name";
var rgx = new Regex(pat);
var output = new List<string>();
var matches = rgx.Matches(str);
for(int i=0;i<matches.Count-1;i++)
{
output.Add(str.Substring(matches[i].Index, matches[i+1].Index- matches[i].Index));
Console.WriteLine(str.Substring(matches[i].Index, matches[i + 1].Index - matches[i].Index));
}
output.Add(str.Substring(matches[matches.Count - 1].Index));
Console.WriteLine(str.Substring(matches[matches.Count - 1].Index));
A straightforward approach is to split this string using a regular expression, and then work with the matched substrings:
string strType = "1) Step to start workorder 1 2)step 2 continue 3)issue of workorder4)create workorder by name";
var matches = Regex.Matches(strType, #"\d+\).*?(?=\d\)|$)");
foreach(Match match in matches)
Console.WriteLine(match.Value);
This will print
1) Step to start workorder 1
2)step 2 continue
3)issue of workorder
4)create workorder by name
The regular expression works as follows:
\d+\): Match "n)", where n is any decimal number
.*?: Match all characters until...
(?=\d\)|$): either the next "n)" follows, or the input string end is reached (this is called a lookahead)
If you want to cleanly replace the numbering by one with a more consistent formatting, you might use
string strType = "1) Step to start workorder 1 2)step 2 continue 3)issue of workorder4)create workorder by name";
int ctr = 0;
var matches = Regex.Matches(strType, #"\d+\)\s*(.*?)(?=\d\)|$)");
foreach(Match match in matches)
if(match.Groups.Count > 0)
Console.WriteLine($"{++ctr}) {match.Groups[1]}");
...which outputs
1) Step to start workorder 1
2) step 2 continue
3) issue of workorder
4) create workorder by name
The regular expression works similarly to first approach:
\d+\)\s*: Match "n)" and any following whitespace (to address inconsistent spacing)
(.*?): Match all characters and use this as match group #1
(?=\d\)|$): Lookahead, same as above
Note that only the match group #1 is printed, so the "n)" and the whitespace are omitted.
Assuming the schema is:
"[{number})Text] [{number})Text] [{number})Text]..."
Here is a solution:
string strType = "1) Step to start workorder 1 2)step 2 continue 3)issue of workorder 4)create workorder by name";
var result = new List<string>();
int count = strType.Count(c => c == ')');
if ( count > 0 )
{
int posCurrent = strType.IndexOf(')');
int delta = posCurrent - 1;
if ( count == 1 && posCurrent > 0)
result.Add(strType.Trim());
else
{
posCurrent = strType.IndexOf(')', posCurrent + 1);
int posFirst = 0;
int posSplit = 0;
do
{
for ( posSplit = posCurrent - 1; posSplit >= 0; posSplit--)
if ( strType[posSplit] == ' ' )
break;
if ( posSplit != -1 && posSplit > posFirst)
{
result.Add(strType.Substring(posFirst, posSplit - posFirst - 1 - 1 + delta).Trim());
posFirst = posSplit + 1;
}
posCurrent = strType.IndexOf(')', posCurrent + 1);
}
while ( posCurrent != -1 && posFirst != -1 );
result.Add(strType.Substring(posFirst).Trim());
}
}
foreach (string item in result)
Console.WriteLine(item);
Console.ReadKey();
You may use Regular Expression to achieve it. Following is code for your reference:
using System.Text.RegularExpressions;
string expr = #"\d+\)";
string[] matches = Regex.Split(strType, expr);
foreach(string m in matches){
Console.WriteLine(m);
}
My system does not have Visual Studio, so please test it in yours. It should be working with minor tweaks.

Skipping lines of text file

So I had some assistance with this yesterday but now I needed to make a change and my change doesn't seem to work. Wanted to see if someone would help me with what I'm doing wrong.
So here is an example of the file read in
Now originally this was reading the FSD number of 0.264 then reading the first number in the first column of 3.4572 and the last number of that column. This worked fine. But now I know longer need that FSD number and I no longer need the first number 3.4572. Instead I need the first number that doesn't have 0.00 in the last column, and then the last number still. So this is what I have but it isn't grabbing anything at all, if (dataWithAvgVolts.Count() > 1) is skipped.
public partial class FrmTravelTime : Form
{
const string FSD__Line_Identifier = "Drilling Data";
const string Data_Start_Point_Identifier = "AVG_VOLTS";
const string Spark_Point_Identifier = "0.00";
static readonly char[] splitter = { ' ', '\t' };
public FrmTravelTime()
{
InitializeComponent();
}
private void btnCalculate_Click(object sender, EventArgs e)
{
foreach (DataGridViewRow row in DGV_Hidden.Rows)
{
FileInfo info = new FileInfo();
{
var lines = File.ReadAllLines(row.Cells["colfilelocation"].Value.ToString());
var fsdLine = lines.FirstOrDefault(line => line.Contains(FSD__Line_Identifier));
var dataWithAvgVolts = lines.SkipWhile(line => line.Contains(Data_Start_Point_Identifier + Spark_Point_Identifier + FSD__Line_Identifier)).ToList();
if (dataWithAvgVolts.Count() > 1)
{
var data = dataWithAvgVolts[1].Split(splitter);
info.startvalue = Convert.ToDouble(data[0]);
data = dataWithAvgVolts[dataWithAvgVolts.Count - 1].Split(splitter);
info.endvalue = Convert.ToDouble(data[0]);
}
info.finalnum = info.startvalue - info.endvalue;
}
}
}
public class FileInfo
{
public double startvalue;
public double endvalue;
public double firstnum;
public double finalnum;
}
So you see here the first line that finally does not have 0.00 is
3.0164 7793 1 0 0.159 0.02
So my info.startvalue will be 3.0164
This goes down a good 100ish lines until the last line of
2.7182 8089 0 0 0.015 22.19
So my info.endvalue should be 2.7182
#Amit
var lines = File.ReadAllLines(row.Cells["colfilelocation"].Value.ToString());
var fsdLine = lines.FirstOrDefault(line => line.Contains(FSD__Line_Identifier));
info.FSD = fsdLine.Substring(fsdLine.IndexOf(FSD_Identifier) + FSD_Identifier.Length, 7);
var dataWithAvgVolts = lines.SkipWhile(line => !line.Contains(Data_Start_Point_Identifier)).ToList();
var data = lines.Where(line => (!line.Contains(Data_Start_Point_Identifier) && !line.Contains(FSD__Line_Identifier) && !line.EndsWith("0.00"))).ToList();
if (data.Count > 1)
{
var line = data[0];
var tempdata = data[0].Split(splitter);
info.startvalue = Convert.ToDouble(tempdata[0]);
var thisdata = data[data.Count - 1].Split(splitter);
info.endvalue = Convert.ToDouble(thisdata[0]);
}
Your SkipWhile condition is incorrect.
line => !line.Contains(Data_Start_Point_Identifier + Spark_Point_Identifier + FSD__Line_Identifier)
Will mean AVG_VOLTS0.00Drilling Data . You are saying, skip the text file till the line does not contain AVG_VOLTS0.00Drilling Data. You do not have this is one line anywhere. So, basically, you seem to be skipping the entire file. Put up a else condition, that will work.
This line returns all rows which do not have "Drilling Data" word, "AVG_VOLTS" word, and do not end with "0.00".
var data = lines.Where(line=>(!line.Contains(Data_Start_Point_Identifier) && !line.Contains(FSD__Line_Identifier) && !line.EndsWith("0.00"))).ToList();
Then, take the rows, and process.
If condition can be
if (data.Count > 1)
{
var line = data[0];
var tempdata = data[0].Split(splitter);
info.startvalue = Convert.ToDouble(tempdata[0]);
var thisdata = data[data.Count - 1].Split(splitter);
info.endvalue = Convert.ToDouble(thisdata[0]);
}

In C#, what is the best way to parse this WIKI markup?

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#
Here is an example table:
|| Owner || Action || Status || Comments ||
| Bill | Fix the lobby | In Progress | This is easy |
| Joe | Fix the bathroom | In Progress | Plumbing \\
\\
Electric \\
\\
Painting \\
\\
\\ |
| Scott | Fix the roof | Complete | This is expensive |
and here is how it comes in directly:
|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|
So as you can see:
The column headers have "||" as the separator
A row columns have a separator or "|"
A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.
I tried reading in line by line and then concatenating lines that had "\" in between then but that seemed a bit hacky.
I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.
Can anyone suggest the correct way to parse this data?
I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.
Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (i.e. that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.
The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.
public class TableParser
{
const StringSplitOptions SplitOpts = StringSplitOptions.None;
const string RowColSep = "|";
static readonly string[] HeaderColSplit = { "||" };
static readonly string[] RowColSplit = { RowColSep };
static readonly string[] MLColSplit = { #"\\" };
public class TableRow
{
public List<string[]> Cells;
}
public class Table
{
public string[] Header;
public TableRow[] Rows;
}
public static Table Parse(string text)
{
// Isolate the header columns and rows remainder.
var headerSplit = text.Split(HeaderColSplit, SplitOpts);
Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");
// Need to check whether there are any rows.
var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
var header = headerSplit.Skip(1)
.Take(headerSplit.Length - (hasRows ? 2 : 1))
.Select(c => c.Trim())
.ToArray();
if (!hasRows) // If no rows for this table, we are done.
return new Table() { Header = header, Rows = new TableRow[0] };
// Get all row columns from the remainder.
var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);
// Require same amount of columns for a row as the header.
Ensure((rowsCols.Length % (header.Length + 1)) == 1,
"The number of row colums does not match the number of header columns");
var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];
// Fill rows by sequentially taking # header column cells
for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
{
rows[ri] = new TableRow() {
Cells = rowsCols.Skip(start).Take(header.Length)
.Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
.ToList()
};
};
return new Table { Header = header, Rows = rows };
}
private static void Ensure(bool check, string errorMsg)
{
if (!check)
throw new InvalidDataException(errorMsg);
}
}
When used like this:
public static void Main(params string[] args)
{
var wikiLine = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var table = TableParser.Parse(wikiLine);
Console.WriteLine(string.Join(", ", table.Header));
foreach (var r in table.Rows)
Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}
It will produce the below output:
Where "\t# " represents a newline caused by the presence of \\ in the input.
Here's a solution which populates a DataTable. It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq.
var str = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();
var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r => tbl.Rows.Add(r.Split('|')));
This makes some assumptions but seems to work for your sample data. I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea.
It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do.
List<List<string>> table = new List<List<string>>();
var match = Regex.Match(raw, #"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
if (headerRow.Count > 0)
{
table.Add(headerRow);
}
}
match = Regex.Match(raw + "\r\n", #"[^\n]*\n" + #"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
if (cell.Trim(' ', '\t') == "\r\n")
{
if (!table.Contains(row) && row.Count > 0)
{
table.Add(row);
}
row = new List<string>();
}
else
{
row.Add(cell);
}
}
This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. For example, it will replace the Confluence line breaks \\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc.
private void ParseWikiTable(string input, string newLineReplacement = " ")
{
string separatorHeader = "||";
string separatorRow = "| |";
string separatorElement = "|";
input = Regex.Replace(input, #"[ \\]{2,}", newLineReplacement);
string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);
string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();
// do something with output data
TestPrint(headerArray);
foreach (var r in rowArray) { TestPrint(r); }
}
private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
input = input.Trim();
if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }
string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
if (trimWhitespace)
{
for (int i = 0; i < segments.Length; i++)
{
segments[i] = segments[i].Trim();
}
}
return segments;
}
private void TestPrint(string[] lst)
{
string joined = "[" + String.Join("::", lst) + "]";
Console.WriteLine(joined);
}
Console output from your direct input string:
[Owner::Action::Status::Comments]
[Bill::fix the lobby::In Progress::This is eary]
[Joe::fix the bathroom::In progress::plumbing Electric Painting]
[Scott::fix the roof::Complete::this is expensive]
A generic regex solution that populate a datatable and is a little flexible with the syntax.
var text = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
// Get Headers
var regHeaders = new Regex(#"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
var headers = regHeaders.Matches(text);
//Get Rows, based on number of headers columns
var regLinhas = new Regex(String.Format(#"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
var rows = regLinhas.Matches(text);
var tbl = new DataTable();
foreach (Match header in headers)
{
tbl.Columns.Add(header.Groups[1].Value);
}
foreach (Match row in rows)
{
tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
}
Here's a solution involving regular expressions. It takes a single string as input and returns a List of headers and a List> of rows/columns. It also trims white space, which may or may not be the desired behavior, so be aware of that. It even prints things nicely :)
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace parseWiki
{
class Program
{
static void Main(string[] args)
{
string content = #"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
content = content.Replace(#"\\", "");
string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
string cellContent = content.Substring(content.LastIndexOf("||") + 2);
MatchCollection headerMatches = new Regex(#"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
MatchCollection cellMatches = new Regex(#"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);
List<string> headers = new List<string>();
foreach (Match match in headerMatches)
{
if (match.Groups.Count > 1)
{
headers.Add(match.Groups[1].Value.Trim());
}
}
List<List<string>> body = new List<List<string>>();
List<string> newRow = new List<string>();
foreach (Match match in cellMatches)
{
if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
{
body.Add(newRow);
newRow = new List<string>();
}
else
{
newRow.Add(match.Groups[1].Value.Trim());
}
}
body.Add(newRow);
print(headers, body);
}
static void print(List<string> headers, List<List<string>> body)
{
var CELL_SIZE = 20;
for (int i = 0; i < headers.Count; i++)
{
Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));
for (int r = 0; r < body.Count; r++)
{
List<string> row = body[r];
for (int c = 0; c < row.Count; c++)
{
Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("");
}
Console.WriteLine("\n\n\n");
Console.ReadKey(false);
}
}
public static class StringExt
{
public static string Truncate(this string value, int maxLength)
{
if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
return value.Substring(0, maxLength - 3) + "...";
}
}
}
Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

C# Index of for space and next informations

Please, can you help me please. I have complete select adress from DB but this adress contains adress and house number but i need separately adress and house number.
I created two list for this distribution.
while (reader_org.Read())
{
string s = reader_org.GetString(0);
string ulice, cp, oc;
char mezera = ' ';
if (s.Contains(mezera))
{
Match m = Regex.Match(s, #"(\d+)");
string numStr = m.Groups[0].Value;
if (numStr.Length > 0)
{
s = s.Replace(numStr, "").Trim();
int number = Convert.ToInt32(numStr);
}
Match l = Regex.Match(s, #"(\d+)");
string numStr2 = l.Groups[0].Value;
if (numStr2.Length > 0)
{
s = s.Replace(numStr2, "").Trim();
int number = Convert.ToInt32(numStr2);
}
if (s.Contains('/'))
s = s.Replace('/', ' ').Trim();
MessageBox.Show("Adresa: " + s);
MessageBox.Show("CP:" + numStr);
MessageBox.Show("OC:" + numStr2);
}
else
{
Definitions.Ulice.Add(s);
}
}
You might find the street name consists of multiple words, or the number appears before the street name. Also potentially some houses might not have a number. Here's a way of dealing with all that.
//extract the first number found in the address string, wherever that number is.
Match m = Regex.Match(address, #"((\d+)/?(\d+))");
string numStr = m.Groups[0].Value;
string streetName = address.Replace(numStr, "").Trim();
//if a number was found then convert it to numeric
//also remove it from the address string, so now the address string only
//contains the street name
if (numStr.Length > 0)
{
string streetName = address.Replace(numStr, "").Trim();
if (numStr.Contains('/'))
{
int num1 = Convert.ToInt32(m.Groups[2].Value);
int num2 = Convert.ToInt32(m.Groups[3].Value);
}
else
{
int number = Convert.ToInt32(numStr);
}
}
Use .Split on your string that results. Then you can index into the result and get the parts of your string.
var parts = s.Split(' ');
// you can get parts[0] etc to access each part;
using (SqlDataReader reader_org = select_org.ExecuteReader())
{
while (reader_org.Read())
{
string s = reader_org.GetString(0); // this return me for example KarlĂ­nkova 514 but i need separately adress (karlĂ­nkova) and house number (514) with help index of or better functions. But now i dont know how can i make it.
var values = s.Split(' ');
var address = values.Count > 0 ? values[0]: null;
var number = values.Count > 1 ? int.Parse(values[1]) : 0;
//Do what ever you want with address and number here...
}
Here is a way to split it the address into House Number and Address without regex and only using the functions of the String class.
var fullAddress = "1111 Awesome Point Way NE, WA 98122";
var index = fullAddress.IndexOf(" "); //Gets the first index of space
var houseNumber = fullAddress.Remove(index);
var address = fullAddress.Remove(0, (index + 1));
Console.WriteLine(houseNumber);
Console.WriteLine(address);
Output: 1111
Output: Awesome Point Way NE, WA 98122

Categories