In C#, what is the best way to parse this WIKI markup?

In C#, what is the best way to parse this WIKI markup? - c#

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#
Here is an example table:
|| Owner || Action || Status || Comments ||
| Bill | Fix the lobby | In Progress | This is easy |
| Joe | Fix the bathroom | In Progress | Plumbing \\
\\
Electric \\
\\
Painting \\
\\
\\ |
| Scott | Fix the roof | Complete | This is expensive |
and here is how it comes in directly:
|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|
So as you can see:
The column headers have "||" as the separator
A row columns have a separator or "|"
A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.
I tried reading in line by line and then concatenating lines that had "\" in between then but that seemed a bit hacky.
I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.
Can anyone suggest the correct way to parse this data?

I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.
Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (i.e. that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.
The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.
public class TableParser
{
const StringSplitOptions SplitOpts = StringSplitOptions.None;
const string RowColSep = "|";
static readonly string[] HeaderColSplit = { "||" };
static readonly string[] RowColSplit = { RowColSep };
static readonly string[] MLColSplit = { #"\\" };
public class TableRow
{
public List<string[]> Cells;
}
public class Table
{
public string[] Header;
public TableRow[] Rows;
}
public static Table Parse(string text)
{
// Isolate the header columns and rows remainder.
var headerSplit = text.Split(HeaderColSplit, SplitOpts);
Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");
// Need to check whether there are any rows.
var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
var header = headerSplit.Skip(1)
.Take(headerSplit.Length - (hasRows ? 2 : 1))
.Select(c => c.Trim())
.ToArray();
if (!hasRows) // If no rows for this table, we are done.
return new Table() { Header = header, Rows = new TableRow[0] };
// Get all row columns from the remainder.
var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);
// Require same amount of columns for a row as the header.
Ensure((rowsCols.Length % (header.Length + 1)) == 1,
"The number of row colums does not match the number of header columns");
var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];
// Fill rows by sequentially taking # header column cells
for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
{
rows[ri] = new TableRow() {
Cells = rowsCols.Skip(start).Take(header.Length)
.Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
.ToList()
};
};
return new Table { Header = header, Rows = rows };
}
private static void Ensure(bool check, string errorMsg)
{
if (!check)
throw new InvalidDataException(errorMsg);
}
}
When used like this:
public static void Main(params string[] args)
{
var wikiLine = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var table = TableParser.Parse(wikiLine);
Console.WriteLine(string.Join(", ", table.Header));
foreach (var r in table.Rows)
Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}
It will produce the below output:
Where "\t# " represents a newline caused by the presence of \\ in the input.

Here's a solution which populates a DataTable. It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq.
var str = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();
var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r => tbl.Rows.Add(r.Split('|')));

This makes some assumptions but seems to work for your sample data. I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea.
It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do.
List<List<string>> table = new List<List<string>>();
var match = Regex.Match(raw, #"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
if (headerRow.Count > 0)
{
table.Add(headerRow);
}
}
match = Regex.Match(raw + "\r\n", #"[^\n]*\n" + #"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
if (cell.Trim(' ', '\t') == "\r\n")
{
if (!table.Contains(row) && row.Count > 0)
{
table.Add(row);
}
row = new List<string>();
}
else
{
row.Add(cell);
}
}

This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. For example, it will replace the Confluence line breaks \\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc.
private void ParseWikiTable(string input, string newLineReplacement = " ")
{
string separatorHeader = "||";
string separatorRow = "| |";
string separatorElement = "|";
input = Regex.Replace(input, #"[ \\]{2,}", newLineReplacement);
string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);
string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();
// do something with output data
TestPrint(headerArray);
foreach (var r in rowArray) { TestPrint(r); }
}
private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
input = input.Trim();
if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }
string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
if (trimWhitespace)
{
for (int i = 0; i < segments.Length; i++)
{
segments[i] = segments[i].Trim();
}
}
return segments;
}
private void TestPrint(string[] lst)
{
string joined = "[" + String.Join("::", lst) + "]";
Console.WriteLine(joined);
}
Console output from your direct input string:
[Owner::Action::Status::Comments]
[Bill::fix the lobby::In Progress::This is eary]
[Joe::fix the bathroom::In progress::plumbing Electric Painting]
[Scott::fix the roof::Complete::this is expensive]

A generic regex solution that populate a datatable and is a little flexible with the syntax.
var text = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
// Get Headers
var regHeaders = new Regex(#"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
var headers = regHeaders.Matches(text);
//Get Rows, based on number of headers columns
var regLinhas = new Regex(String.Format(#"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
var rows = regLinhas.Matches(text);
var tbl = new DataTable();
foreach (Match header in headers)
{
tbl.Columns.Add(header.Groups[1].Value);
}
foreach (Match row in rows)
{
tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
}

Here's a solution involving regular expressions. It takes a single string as input and returns a List of headers and a List> of rows/columns. It also trims white space, which may or may not be the desired behavior, so be aware of that. It even prints things nicely :)
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace parseWiki
{
class Program
{
static void Main(string[] args)
{
string content = #"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
content = content.Replace(#"\\", "");
string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
string cellContent = content.Substring(content.LastIndexOf("||") + 2);
MatchCollection headerMatches = new Regex(#"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
MatchCollection cellMatches = new Regex(#"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);
List<string> headers = new List<string>();
foreach (Match match in headerMatches)
{
if (match.Groups.Count > 1)
{
headers.Add(match.Groups[1].Value.Trim());
}
}
List<List<string>> body = new List<List<string>>();
List<string> newRow = new List<string>();
foreach (Match match in cellMatches)
{
if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
{
body.Add(newRow);
newRow = new List<string>();
}
else
{
newRow.Add(match.Groups[1].Value.Trim());
}
}
body.Add(newRow);
print(headers, body);
}
static void print(List<string> headers, List<List<string>> body)
{
var CELL_SIZE = 20;
for (int i = 0; i < headers.Count; i++)
{
Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));
for (int r = 0; r < body.Count; r++)
{
List<string> row = body[r];
for (int c = 0; c < row.Count; c++)
{
Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("");
}
Console.WriteLine("\n\n\n");
Console.ReadKey(false);
}
}
public static class StringExt
{
public static string Truncate(this string value, int maxLength)
{
if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
return value.Substring(0, maxLength - 3) + "...";
}
}
}

Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

Related

C# Selenium, getting text of Invisible elements is slow

I am trying to get invisible text that is on a given page using C# Selenium EdgeDriver. I am able to do it, i.e. I get the elements (such as within span, p or b tags, then I am filtering out the elements into a list based on Displayed property, and finally I am calling GetAttribute("textContent") to get the text. The problem I am having is that is slow, about 10 seconds for the page I am doing that on, do you think there is any better way, or making this faster?
Thanks,
public static string GetInvisibleText()
{
Stopwatch s500 = new Stopwatch();
s500.Start();
string returnable = "\r\n";
var elements = driver.FindElements(By.XPath("//b | //span | //p | //a | //h1 | //h2 | //h3 | //h4 | //h5 | //h6 | //div"));
List<string> list = new List<string>();
var displayed_elements = elements.Where(e => !e.Displayed);
foreach(var el in displayed_elements)
{
try
{
string val = el.GetAttribute("textContent");
val = val.Trim();
val = Regex.Replace(val, #"\s+", " ");
list.Add(val);
}
catch (Exception ex)
{
}
}
list = list.Distinct(StringComparer.OrdinalIgnoreCase).ToList();
foreach (string line in list)
{
returnable = returnable + line + "\r\n";
}
s500.Stop();
return returnable;
}

String split with specified string without delimeter

Updated - When searched value is in middle
string text = "Trio charged over alleged $100m money laundering syndicate at Merrylands, Guildford West";
string searchtext= "charged over";
string[] fragments = text.Split(new string[] { searchtext }, StringSplitOptions.None);
//Fragments
//if [0] is blank searched text is in the beginning - searchedtext + [1]
//if [1] is blank searched text is in the end - [0] + searched text
// If searched text is in middle then both items has value - [0] + seachedtext + [1]
//This loop will execute only two times because it can have maximum 2 values, issue will
//come when searched value is in middle (loop should run 3 times) as for the searched value i have to apply differnt logic (like change background color of the text)
// and dont change background color for head and tail
//How do i insert searched value in middle of [0] and [1] ??
I am having a string without delimeter which i am trying to split based on searched string. My requirement is split the string into two , one part contains string without the searchtext and other contains searchtext like below-
Original String - "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules"
String 1 - Bitcoin ATMs Highlight Flaws in EU
String 2 - Money Laundering Rules
I have written below code it works for the above sample value, but it failed for
Failed - Not returning String 1 and String 2, String is empty
string watch = " Money Laundering Rules Bitcoin ATMs Highlight Flaws in EU";
string serachetxt = "Money Laundering Rules";
This works -
List<string> matchedstr = new List<string>();
string watch = "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules";
string serachetxt = "Money Laundering Rules";
string compa = watch.Substring(0,watch.IndexOf(serachetxt)); //It returns "Bitcoin ATMs Highlight Flaws in EU"
matchedstr.Add(compa);
matchedstr.Add(serachetxt);
foreach(var itemco in matchedstr)
{
}

You could just consider "Money Laundering Rules" to be the delimiter. Then you can write
string[] result = watch.Split(new string[] { searchtext }, StringSplitOptions.None);
Then you can add the delimiter again
string result1 = result[0];
string result2 = searchtext + result[1];

Use string.Split.
string text = "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules";
string searchtext = "Money Laundering Rules";
string[] fragments = text.Split(new string[] { searchtext }, StringSplitOptions.None);
fragments will equal:
[0] "Bitcoin ATMs Highlight Flaws in EU "
[1] ""
Everywhere there is a gap between consecutive array elements, your search string appears. e.g.:
string originaltext = string.Join(searchtext, fragments);
Extended Description of String.Split Behaviour
Here is a quick table of the behaviour of string.Split when passed a string.
| Input | Split | Result Array |
+--------+-------+--------------------+
| "ABC" | "A" | { "", "BC" } |
| "ABC" | "B" | { "A", "C" } |
| "ABC" | "C" | { "AB", "" } |
| "ABC" | "D" | { "ABC" } |
| "ABC" | "ABC" | { "", "" } |
| "ABBA" | "A" | { "", "BB", "" } |
| "ABBA" | "B" | { "A", "", "A" } |
| "AAA" | "A" | { "", "", "", "" } |
| "AAA" | "AA" | { "", "A" } |
If you look at the table above, Every place there was a comma in the array (between two consecutive elements in the array), is a place that the split string was found.
If the string was not found, then the result array is only one element (the original string).
If the split string is found at the beginning of the input string, then an empty string is set as the first element of the result array to represent the beginning of the string. Similarly, if the split string is found at the end of the string, an empty string is set as the last element of the result array.
Also, an empty string is included between any consecutive occurrences of the search string in the input string.
In cases where there are ambiguous overlapping locations at which the string could be found in the input string: (e.g. splitting AAA on AA could be split as AA|A or A|AA - where AA is found at position 0 or position 1 in the input string) then the earlier location is used. (e.g. AA|A, resulting in { "", "A" } ).
Again, the invariant is that the original string can always be reconstructed by joining all the fragments and placing exactly one occurrence of the search text in between elements. The following will always be true:
string.Join(searchtext, fragments) == text
If you only want the first split...
You can merge all results after the first back together like this:
if (fragments.Length > 1) {
fragments = new string[] { fragments[0], string.Join(searchtext, fragments.Skip(1)) };
}
... or a more efficient way using String.IndexOf
If you just want to find the first location of the search text string then use String.IndexOf to get the position of the first occurrence of the search text in the input string.
Here's a complete function you can use
private static bool TrySplitOnce(string text, string searchtext, out string beforetext, out string aftertext)
{
int pos = text.IndexOf(searchtext);
if (pos < 0) {
// not found
beforetext = null;
aftertext = null;
return false;
} else {
// found at position `pos`
beforetext = text.Substring(0, pos); // may be ""
aftertext = text.Substring(pos + searchtext.Length); // may be ""
return true;
}
}
You can use this to produce an array, if you like.
usage:
string text = "red or white or blue";
string searchtext = "or";
if (TrySplitOnce(text, searchtext, out string before, out string after)) {
Console.WriteLine("{0}*{1}", before, after);
// output:
// red * white or blue
string[] array = new string[] { before, searchtext, after };
// array == { "red ", "or", " white or blue" };
Console.WriteLine(string.Join("|", array));
// output:
// red |or| white or blue
} else {
Console.WriteLine("Not found");
}
output:
red * white or blue
red |or| white or blue

You can write your own extension method for this:
// Splits s at sep with sep included at beginning of each part except first
// return no more than numParts parts
public static IEnumerable<string> SplitsBeforeInc(this string s, string sep, int numParts = Int32.MaxValue)
=> s.Split(new[] { sep }, numParts, StringSplitOptions.None).Select((p,i) => i > 0 ? sep+p : p);
And use it with:
foreach(var itemco in watch.SplitsBeforeInc(watch, serachetxt, 2))
Here is the same method in a non-LINQ version:
// Splits s at sep with sep included at beginning of each part except first
// return no more than numParts parts
public static IEnumerable<string> SplitsBeforeInc(this string s, string sep, int numParts = Int32.MaxValue) {
var startPos = 0;
var searchPos = 0;
while (startPos < s.Length && --numParts > 0) {
var sepPos = s.IndexOf(sep, searchPos);
sepPos = sepPos < 0 ? s.Length : sepPos;
yield return s.Substring(startPos, sepPos - startPos);
startPos = sepPos;
searchPos = sepPos+sep.Length;
}
if (startPos < s.Length)
yield return s.Substring(startPos);
}

You can try this
string text = "Trio charged over alleged $100m money laundering syndicate at Merrylands, Guildford West";
string searchtext = "charged over";
searchtextPattern = "(?=" + searchtext + ")";
string[] fragments= Regex.Split(text, searchtextPattern);
//fargments will have two elements here
// fragments[0] - "Trio"
// fragments[1] - "charged over alleged $100m money laundering syndicate at Merrylands, Guildford West"
now you can again split fragment which have search text i.e fragments[1] in this case.
see code below
var stringWithoutSearchText = fragments[1].Replace(searchtext, string.Empty);
you need to check whether each fragment contains search text or not. You can do that it your foreach loop on fragments. add below check over there
foreach (var item in fragments)
{
if (item.Contains(searchtext))
{
string stringWithoutSearchText = item.Replace(searchtext, string.Empty);
}
}
Reference : https://stackoverflow.com/a/521172/8652887

Parse a multiline email to var

I'm attempting to parse a multi-line email so I can get at the data which is on its own newline under the heading in the body of the email.
It looks like this:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
It appears I am getting everything on each messagebox when I use string reader readline, though all I want is the data under each ------ as shown
This is my code:
foreach (MailItem mail in publicFolder.Items)
{
if (mail != null)
{
if (mail is MailItem)
{
MessageBox.Show(mail.Body, "MailItem body");
// Creates new StringReader instance from System.IO
using (StringReader reader = new StringReader(mail.Body))
{
string line;
while ((line = reader.ReadLine()) !=null)
//Loop over the lines in the string.
if (mail.Body.Contains("Marketing ID"))
{
// var localno = mail.Body.Substring(247,15);//not correct approach
// MessageBox.Show(localrefno);
//MessageBox.Show("found");
//var conexid = mail.Body.Replace(Environment.NewLine);
var regex = new Regex("<br/>", RegexOptions.Singleline);
MessageBox.Show(line.ToString());
}
}
//var stringBuilder = new StringBuilder();
//foreach (var s in mail.Body.Split(' '))
//{
// stringBuilder.Append(s).AppendLine();
//}
//MessageBox.Show(stringBuilder.ToString());
}
else
{
MessageBox.Show("Nothing found for MailItem");
}
}
}
You can see I had numerous attempts with it, even using substring position and using regex. Please help me get the data from each line under the ---.

It is not a very good idea to do that with Regex because it is quite easy to forget the edge cases, not easy to understand, and not easy to debug. It's quite easy to get into a situation that the Regex hangs your CPU and times out. (I cannot make any comment to other answers yet. So, please check at least my other two cases before you pick your final solution.)
In your cases, the following Regex solution works for your provided example. However, some additional limitations are there: You need to make sure there are no empty values in the non-starting or non-ending column. Or, let's say if there are more than two columns and any one of them in the middle is empty will make the names and values of that line mismatched.
Unfortunately, I cannot give you a non-Regex solution because I don't know the spec, e.g.: Will there be empty spaces? Will there be TABs? Does each field has a fixed count of characters or will they be flexible? If it is flexible and can have empty values, what kind of rules to detected which columns are empty? I assume that it is quite possible that they are defined by the column name's length and will have only space as delimiter. If that's the case, there are two ways to solve it, two-pass Regex or write your own parser. If all the fields has fixed length, it would be even more easier to do: Just using the substring to cut the lines and then trim them.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
public class Program
{
public class Record{
public string Name {get;set;}
public string Value {get;set;}
}
public static void Main()
{
var regex = new Regex(#"(?<name>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<name>((?!-)[\w]+[ ]?)+)?)+(?:\r\n|\r|\n)(?>(?<splitters>(-+))(?>[ \t]+)?)+(?:\r\n|\r|\n)(?<value>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<value>((?!-)[\w]+[ ]?)+)?)+", RegexOptions.Compiled);
var testingValue =
#"EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144";
var matches = regex.Matches(testingValue);
var rows = (
from match in matches.OfType<Match>()
let row = (
from grp in match.Groups.OfType<Group>()
select new {grp.Name, Captures = grp.Captures.OfType<Capture>().ToList()}
).ToDictionary(item=>item.Name, item=>item.Captures.OfType<Capture>().ToList())
let names = row.ContainsKey("name")? row["name"] : null
let splitters = row.ContainsKey("splitters")? row["splitters"] : null
let values = row.ContainsKey("value")? row["value"] : null
where names != null && splitters != null &&
names.Count == splitters.Count &&
(values==null || values.Count <= splitters.Count)
select new {Names = names, Values = values}
);
var records = new List<Record>();
foreach(var row in rows)
{
for(int i=0; i< row.Names.Count; i++)
{
records.Add(new Record{Name=row.Names[i].Value, Value=i < row.Values.Count ? row.Values[i].Value : ""});
}
}
foreach(var record in records)
{
Console.WriteLine(record.Name + " = " + record.Value);
}
}
}
output:
Marketing ID = GR332230
Local Number = 0000232323
Dispatch Code = GX3472
Logic code = 1
Destination ID = 3411144
Destination details =
Please note that this also works for this kind of message:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
output:
Marketing ID = GR332230
Local Number = 0000232323
Dispatch Code = GX3472
Logic code = 1
Destination ID =
Destination details = 3411144
Or this:
EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144
output:
Marketing ID =
Local Number =
Dispatch Code = GX3472
Logic code = 1
Destination ID =
Destination details = 3411144

var dict = new Dictionary<string, string>();
try
{
var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int starts = 0, end = 0, length = 0;
while (!lines[starts + 1].StartsWith("-")) starts++;
for (int i = starts + 1; i < lines.Length; i += 3)
{
var mc = Regex.Matches(lines[i], #"(?:^| )-");
foreach (Match m in mc)
{
int start = m.Value.StartsWith(" ") ? m.Index + 1 : m.Index;
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length - 1) ;
length = Math.Min(end - start, lines[i - 1].Length - start);
string key = length > 0 ? lines[i - 1].Substring(start, length).Trim() : "";
end = start;
while (lines[i][end++] == '-' && end < lines[i].Length) ;
length = Math.Min(end - start, lines[i + 1].Length - start);
string value = length > 0 ? lines[i + 1].Substring(start, length).Trim() : "";
dict.Add(key, value);
}
}
}
catch (Exception ex)
{
throw new Exception("Email is not in correct format");
}
Live Demo
Using Regular Expressions:
var dict = new Dictionary<string, string>();
try
{
var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int starts = 0;
while (!lines[starts + 1].StartsWith("-")) starts++;
for (int i = starts + 1; i < lines.Length; i += 3)
{
var keys = Regex.Matches(lines[i - 1], #"(?:^| )(\w+\s?)+");
var values = Regex.Matches(lines[i + 1], #"(?:^| )(\w+\s?)+");
if (keys.Count == values.Count)
for (int j = 0; j < keys.Count; j++)
dict.Add(keys[j].Value.Trim(), values[j].Value.Trim());
else // remove bug if value of first key in a line has no value
{
if (lines[i + 1].StartsWith(" "))
{
dict.Add(keys[0].Value.Trim(), "");
dict.Add(keys[1].Value.Trim(), values[0].Value.Trim());
}
else
{
dict.Add(keys[0].Value, values[0].Value.Trim());
dict.Add(keys[1].Value.Trim(), "");
}
}
}
}
catch (Exception ex)
{
throw new Exception("Email is not in correct format");
}
Live Demo

Here is my attempt. I don't know if the email format can change (rows, columns, etc).
I can't think of an easy way to separate the columns besides checking for a double space (my solution).
class Program
{
static void Main(string[] args)
{
var emailBody = GetEmail();
using (var reader = new StringReader(emailBody))
{
var lines = new List<string>();
const int startingRow = 2; // Starting line to read from (start at Marketing ID line)
const int sectionItems = 4; // Header row (ex. Marketing ID & Local Number Line) + Dash Row + Value Row + New Line
// Add all lines to a list
string line = "";
while ((line = reader.ReadLine()) != null)
{
lines.Add(line.Trim()); // Add each line to the list and remove any leading or trailing spaces
}
for (var i = startingRow; i < lines.Count; i += sectionItems)
{
var currentLine = lines[i];
var indexToBeginSeparatingColumns = currentLine.IndexOf(" "); // The first time we see double spaces, we will use as the column delimiter, not the best solution but should work
var header1 = currentLine.Substring(0, indexToBeginSeparatingColumns);
var header2 = currentLine.Substring(indexToBeginSeparatingColumns, currentLine.Length - indexToBeginSeparatingColumns).Trim();
currentLine = lines[i+2]; //Skip dash line
indexToBeginSeparatingColumns = currentLine.IndexOf(" ");
string value1 = "", value2 = "";
if (indexToBeginSeparatingColumns == -1) // Use case of there being no value in the 2nd column, could be better
{
value1 = currentLine.Trim();
}
else
{
value1 = currentLine.Substring(0, indexToBeginSeparatingColumns);
value2 = currentLine.Substring(indexToBeginSeparatingColumns, currentLine.Length - indexToBeginSeparatingColumns).Trim();
}
Console.WriteLine(string.Format("{0},{1},{2},{3}", header1, value1, header2, value2));
}
}
}
static string GetEmail()
{
return #"EMAIL STARTING IN APRIL
Marketing ID Local Number
------------------- ----------------------
GR332230 0000232323
Dispatch Code Logic code
----------------- -------------------
GX3472 1
Destination ID Destination details
----------------- -------------------
3411144";
}
}
Output looks something like this:
Marketing ID,GR332230,Local Number,0000232323
Dispatch Code,GX3472,Logic code,1
Destination ID,3411144,Destination details,

Here is an aproach asuming you don't need the headers, info comes in order and mandatory.
This won't work for data that has spaces or optional fields.
foreach (MailItem mail in publicFolder.Items)
{
MessageBox.Show(mail.Body, "MailItem body");
// Split by line, remove dash lines.
var data = Regex.Split(mail.Body, #"\r?\n|\r")
.Where(l => !l.StartsWith('-'))
.ToList();
// Remove headers
for(var i = data.Count -2; lines >= 0; i -2)
{
data.RemoveAt(i);
}
// now data contains only the info you want in the order it was presented.
// Asuming info doesn't have spaces.
var result = data.SelectMany(d => d.Split(' '));
// WARNING: Missing info will not be present.
// {"GR332230", "0000232323", "GX3472", "1", "3411144"}
}

Parsing a complex table structure

I am trying to parse with C#
+-------------+-----------------------------------------------------------------------------------+----------------+
| 1 | 2 | 3 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 000 | Собственные средства (капитал), итого, | |
| | в том числе: | 1024231079 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100 |Источники базового капитала: | 1291298211 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1 |Уставный капитал кредитной организации: | 651033884 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.1 |сформированный обыкновенными акциями | 129605413 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.2 |сформированный привилегированными акциями | 521428471 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.3 |сформированный долями | 0 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2 |Эмиссионный доход: | 439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2.1 |кредитной организации в организационно-правовой форме акционерного общества, всего,| |
| | в том числе: | 439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+
My code is
string[] dels = { "\r\n" };
string[] strArr = someStr.Split(dels, StringSplitOptions.None);
Console.WriteLine(strArr);
foreach (String sourcestring in strArr)
{
if (sourcestring != null)
{
Console.WriteLine("Processing string: ");
Console.WriteLine(sourcestring);
//Regex regex = new Regex(#"^(\|)(.*)(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
//Regex regex = new Regex(#"^(\|)(\s?|\d+[\.?])(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
Regex regex = new Regex(#"^(\|)(\d+\.?\d+)");
MatchCollection mc = regex.Matches(sourcestring);
int mIdx = 0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, regex.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
Console.WriteLine("---------------------------------------------------------");
}
}
I need to extract values of lines
4 - ' 000 ', ' Собственные средства (капитал), итого, ', ' '
5 - ' ', ' в том числе: ', ' 1024231079 '
and line 7, 9...
The main issue now it that I don't know how to make reg exp to find in the first column values, that could be:
' 000 '
' '
' 100 '
' 100.1 '
' 100.1.1 '
and etc.
The second issue is in the second column. I've tried to parse it with the (.*[а-я]{3}.*), but it failed on lines, which contain such symbols, like '(', ',', '.', ':'.
I'll appreciate all possible solutions.

I think RegEx would be overkill in this case, a simple, manual parse approach would be a lot easier:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two approaches that might work in this case:
Parse the first line (+---+--- ...) to determine the length of each column and parse your data by separation it with Substring.
Split each column by |.
Below, I've outlines the basics for the second approach (No sanity checks).
If your data can contain | too, you might want to parse the data based on cell-size rather than splitting by it.
// Row is defined below - simple data storage for three the columns
List<Row> rows = new List<Row>();
Row currentRow = null;
// Process each line
foreach (string line in input.Split(new string[] {"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
// Row separator or content?
if (line.StartsWith("+"))
{
if (currentRow != null)
{
rows.Add(currentRow);
currentRow = null;
}
}
else if (line.StartsWith("|"))
{
string[] parts = line.Split(new char[] {'|'});
if(currentRow == null)
currentRow = new Row();
// Might need additional processing
currentRow.Column1 += parts[1].Trim();
currentRow.Column2 += parts[2].TrimEnd();
currentRow.Column3 += parts[3].TrimStart();
}
else
{
//Invalid data?
}
}
// Show result
foreach(Row row in rows)
{
Console.WriteLine("[{0}][{1}] = {2}", row.Column1, row.Column2, row.Column3);
}
Instead of a custom class you could of course use a Tuple<string,string,string> or whatever fits your data types.
public class Row
{
public string Column1 = "";
public string Column2 = "";
public string Column3 = "";
}
Example on DotNetFiddle

Find string in an array based on first word

I'm trying to filter out an array based on 2 keywords that must be in order, So far i've got this:
string[] matchedOne = Array.FindAll(converList, s => s.Contains(split[1]));
string[] matchedTwo = Array.FindAll(matchedOne, s => s.Contains(split[2]));
if (matchedTwo.Length == 0)
{
Console.Clear();
Console.WriteLine("Sorry, Your Conversion is invalid");
Main();
}
converlist =
ounce,gram,28.0
ounce,fake,28.0 - Fake one I added for examples
gram, ounce, 3.0 - Fake one I added for examples
pound,ounce,16.0
pound,kilogram,0.454
pint,litre,0.568
inch, centimetre,2.5
mile,inch,63360.0
If the user types in 5, ounce, gram, When passed through "matchedOne" it would find; "once,gram,28.0" and "ounce,fake,28.0" . not those, "pound,ounce,16.0" and "gram,ounce,3.0" as it does now.
Then in "matchedTwo" it would only find "once,gram,28.0" not that and "gram,ounce,3.0"
-- Just to add: I cant use anything over "system;".

var regex = new Regex(
string.Format("^.*{0}.*{1}.*$",
Regex.Escape(split[1]), Regex.Escape(split[2])),
RegexOptions.IgnoreCase
| RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
Match m = regex.Match(converlist);
which basically just matches a line where split[1] comes before split[2]
or without regex
string[] match = Array.FindAll(matchedOne, s => s.IndexOf(split[1])==-1?false: s.IndexOf(split[2], s.IndexOf(split[1])) != -1);
and working conversion...
const string converlist = "ounce,gram,28.0\r\nounce,fake,28.0\r\ngram, ounce, 3.0\r\npound,ounce,16.0\r\npound,kilogram,0.454\r\npint,litre,0.568\r\ninch, centimetre,2.5\r\nmile,inch,63360.0\r\n";
var split = "5,ounce,gram".Split(new[] { ',' });
var list = converlist.Split(new[]{"\r\n"}, StringSplitOptions.RemoveEmptyEntries).ToList();
var matches = list.FindAll(s => s.IndexOf(split[1])==-1?false: s.IndexOf(split[2], s.IndexOf(split[1])) != -1);
var conversionLine = matches[0];
Console.WriteLine(conversionLine);
var conversionFactor = decimal.Parse(conversionLine.Split(new[] { ',' })[2]);
var valueToConvert = decimal.Parse(split[0].Trim());
Console.WriteLine(string.Format("{0} {2}s is {1} {3}s", valueToConvert, conversionFactor * valueToConvert, split[1], split[2]));

Above answer satisfies what you need, however, this will also work.
string[] s = { "ounce,gram", "gram,ounce", "pound,carret" };
foreach (string temp in s.Where(x => (x.IndexOf("ounce")>-1) && (x.IndexOf("ounce") < x.IndexOf("gram"))))
Debug.WriteLine(temp);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

In C#, what is the best way to parse this WIKI markup? - c#

Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

Related

C# Selenium, getting text of Invisible elements is slow

String split with specified string without delimeter

Parse a multiline email to var

Parsing a complex table structure

Find string in an array based on first word

Categories

Resources