Parsing a complex table structure - c#

I am trying to parse with C#
+-------------+-----------------------------------------------------------------------------------+----------------+
| 1 | 2 | 3 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 000 | Собственные средства (капитал), итого, | |
| | в том числе: | 1024231079 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100 |Источники базового капитала: | 1291298211 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1 |Уставный капитал кредитной организации: | 651033884 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.1 |сформированный обыкновенными акциями | 129605413 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.2 |сформированный привилегированными акциями | 521428471 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.3 |сформированный долями | 0 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2 |Эмиссионный доход: | 439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2.1 |кредитной организации в организационно-правовой форме акционерного общества, всего,| |
| | в том числе: | 439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+
My code is
string[] dels = { "\r\n" };
string[] strArr = someStr.Split(dels, StringSplitOptions.None);
Console.WriteLine(strArr);
foreach (String sourcestring in strArr)
{
if (sourcestring != null)
{
Console.WriteLine("Processing string: ");
Console.WriteLine(sourcestring);
//Regex regex = new Regex(#"^(\|)(.*)(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
//Regex regex = new Regex(#"^(\|)(\s?|\d+[\.?])(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
Regex regex = new Regex(#"^(\|)(\d+\.?\d+)");
MatchCollection mc = regex.Matches(sourcestring);
int mIdx = 0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, regex.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
Console.WriteLine("---------------------------------------------------------");
}
}
I need to extract values of lines
4 - ' 000 ', ' Собственные средства (капитал), итого, ', ' '
5 - ' ', ' в том числе: ', ' 1024231079 '
and line 7, 9...
The main issue now it that I don't know how to make reg exp to find in the first column values, that could be:
' 000 '
' '
' 100 '
' 100.1 '
' 100.1.1 '
and etc.
The second issue is in the second column. I've tried to parse it with the (.*[а-я]{3}.*), but it failed on lines, which contain such symbols, like '(', ',', '.', ':'.
I'll appreciate all possible solutions.

I think RegEx would be overkill in this case, a simple, manual parse approach would be a lot easier:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two approaches that might work in this case:
Parse the first line (+---+--- ...) to determine the length of each column and parse your data by separation it with Substring.
Split each column by |.
Below, I've outlines the basics for the second approach (No sanity checks).
If your data can contain | too, you might want to parse the data based on cell-size rather than splitting by it.
// Row is defined below - simple data storage for three the columns
List<Row> rows = new List<Row>();
Row currentRow = null;
// Process each line
foreach (string line in input.Split(new string[] {"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
// Row separator or content?
if (line.StartsWith("+"))
{
if (currentRow != null)
{
rows.Add(currentRow);
currentRow = null;
}
}
else if (line.StartsWith("|"))
{
string[] parts = line.Split(new char[] {'|'});
if(currentRow == null)
currentRow = new Row();
// Might need additional processing
currentRow.Column1 += parts[1].Trim();
currentRow.Column2 += parts[2].TrimEnd();
currentRow.Column3 += parts[3].TrimStart();
}
else
{
//Invalid data?
}
}
// Show result
foreach(Row row in rows)
{
Console.WriteLine("[{0}][{1}] = {2}", row.Column1, row.Column2, row.Column3);
}
Instead of a custom class you could of course use a Tuple<string,string,string> or whatever fits your data types.
public class Row
{
public string Column1 = "";
public string Column2 = "";
public string Column3 = "";
}
Example on DotNetFiddle

Related

String split with specified string without delimeter

Updated - When searched value is in middle
string text = "Trio charged over alleged $100m money laundering syndicate at Merrylands, Guildford West";
string searchtext= "charged over";
string[] fragments = text.Split(new string[] { searchtext }, StringSplitOptions.None);
//Fragments
//if [0] is blank searched text is in the beginning - searchedtext + [1]
//if [1] is blank searched text is in the end - [0] + searched text
// If searched text is in middle then both items has value - [0] + seachedtext + [1]
//This loop will execute only two times because it can have maximum 2 values, issue will
//come when searched value is in middle (loop should run 3 times) as for the searched value i have to apply differnt logic (like change background color of the text)
// and dont change background color for head and tail
//How do i insert searched value in middle of [0] and [1] ??
I am having a string without delimeter which i am trying to split based on searched string. My requirement is split the string into two , one part contains string without the searchtext and other contains searchtext like below-
Original String - "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules"
String 1 - Bitcoin ATMs Highlight Flaws in EU
String 2 - Money Laundering Rules
I have written below code it works for the above sample value, but it failed for
Failed - Not returning String 1 and String 2, String is empty
string watch = " Money Laundering Rules Bitcoin ATMs Highlight Flaws in EU";
string serachetxt = "Money Laundering Rules";
This works -
List<string> matchedstr = new List<string>();
string watch = "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules";
string serachetxt = "Money Laundering Rules";
string compa = watch.Substring(0,watch.IndexOf(serachetxt)); //It returns "Bitcoin ATMs Highlight Flaws in EU"
matchedstr.Add(compa);
matchedstr.Add(serachetxt);
foreach(var itemco in matchedstr)
{
}
You could just consider "Money Laundering Rules" to be the delimiter. Then you can write
string[] result = watch.Split(new string[] { searchtext }, StringSplitOptions.None);
Then you can add the delimiter again
string result1 = result[0];
string result2 = searchtext + result[1];
Use string.Split.
string text = "Bitcoin ATMs Highlight Flaws in EU Money Laundering Rules";
string searchtext = "Money Laundering Rules";
string[] fragments = text.Split(new string[] { searchtext }, StringSplitOptions.None);
fragments will equal:
[0] "Bitcoin ATMs Highlight Flaws in EU "
[1] ""
Everywhere there is a gap between consecutive array elements, your search string appears. e.g.:
string originaltext = string.Join(searchtext, fragments);
Extended Description of String.Split Behaviour
Here is a quick table of the behaviour of string.Split when passed a string.
| Input | Split | Result Array |
+--------+-------+--------------------+
| "ABC" | "A" | { "", "BC" } |
| "ABC" | "B" | { "A", "C" } |
| "ABC" | "C" | { "AB", "" } |
| "ABC" | "D" | { "ABC" } |
| "ABC" | "ABC" | { "", "" } |
| "ABBA" | "A" | { "", "BB", "" } |
| "ABBA" | "B" | { "A", "", "A" } |
| "AAA" | "A" | { "", "", "", "" } |
| "AAA" | "AA" | { "", "A" } |
If you look at the table above, Every place there was a comma in the array (between two consecutive elements in the array), is a place that the split string was found.
If the string was not found, then the result array is only one element (the original string).
If the split string is found at the beginning of the input string, then an empty string is set as the first element of the result array to represent the beginning of the string. Similarly, if the split string is found at the end of the string, an empty string is set as the last element of the result array.
Also, an empty string is included between any consecutive occurrences of the search string in the input string.
In cases where there are ambiguous overlapping locations at which the string could be found in the input string: (e.g. splitting AAA on AA could be split as AA|A or A|AA - where AA is found at position 0 or position 1 in the input string) then the earlier location is used. (e.g. AA|A, resulting in { "", "A" } ).
Again, the invariant is that the original string can always be reconstructed by joining all the fragments and placing exactly one occurrence of the search text in between elements. The following will always be true:
string.Join(searchtext, fragments) == text
If you only want the first split...
You can merge all results after the first back together like this:
if (fragments.Length > 1) {
fragments = new string[] { fragments[0], string.Join(searchtext, fragments.Skip(1)) };
}
... or a more efficient way using String.IndexOf
If you just want to find the first location of the search text string then use String.IndexOf to get the position of the first occurrence of the search text in the input string.
Here's a complete function you can use
private static bool TrySplitOnce(string text, string searchtext, out string beforetext, out string aftertext)
{
int pos = text.IndexOf(searchtext);
if (pos < 0) {
// not found
beforetext = null;
aftertext = null;
return false;
} else {
// found at position `pos`
beforetext = text.Substring(0, pos); // may be ""
aftertext = text.Substring(pos + searchtext.Length); // may be ""
return true;
}
}
You can use this to produce an array, if you like.
usage:
string text = "red or white or blue";
string searchtext = "or";
if (TrySplitOnce(text, searchtext, out string before, out string after)) {
Console.WriteLine("{0}*{1}", before, after);
// output:
// red * white or blue
string[] array = new string[] { before, searchtext, after };
// array == { "red ", "or", " white or blue" };
Console.WriteLine(string.Join("|", array));
// output:
// red |or| white or blue
} else {
Console.WriteLine("Not found");
}
output:
red * white or blue
red |or| white or blue
You can write your own extension method for this:
// Splits s at sep with sep included at beginning of each part except first
// return no more than numParts parts
public static IEnumerable<string> SplitsBeforeInc(this string s, string sep, int numParts = Int32.MaxValue)
=> s.Split(new[] { sep }, numParts, StringSplitOptions.None).Select((p,i) => i > 0 ? sep+p : p);
And use it with:
foreach(var itemco in watch.SplitsBeforeInc(watch, serachetxt, 2))
Here is the same method in a non-LINQ version:
// Splits s at sep with sep included at beginning of each part except first
// return no more than numParts parts
public static IEnumerable<string> SplitsBeforeInc(this string s, string sep, int numParts = Int32.MaxValue) {
var startPos = 0;
var searchPos = 0;
while (startPos < s.Length && --numParts > 0) {
var sepPos = s.IndexOf(sep, searchPos);
sepPos = sepPos < 0 ? s.Length : sepPos;
yield return s.Substring(startPos, sepPos - startPos);
startPos = sepPos;
searchPos = sepPos+sep.Length;
}
if (startPos < s.Length)
yield return s.Substring(startPos);
}
You can try this
string text = "Trio charged over alleged $100m money laundering syndicate at Merrylands, Guildford West";
string searchtext = "charged over";
searchtextPattern = "(?=" + searchtext + ")";
string[] fragments= Regex.Split(text, searchtextPattern);
//fargments will have two elements here
// fragments[0] - "Trio"
// fragments[1] - "charged over alleged $100m money laundering syndicate at Merrylands, Guildford West"
now you can again split fragment which have search text i.e fragments[1] in this case.
see code below
var stringWithoutSearchText = fragments[1].Replace(searchtext, string.Empty);
you need to check whether each fragment contains search text or not. You can do that it your foreach loop on fragments. add below check over there
foreach (var item in fragments)
{
if (item.Contains(searchtext))
{
string stringWithoutSearchText = item.Replace(searchtext, string.Empty);
}
}
Reference : https://stackoverflow.com/a/521172/8652887

Replace repeated values in collection with its sum

I have a list of custom class ModeTime, its structure is below:
private class ModeTime
{
public DateTime Date { get; set; }
public string LineName { get; set; }
public string Mode { get; set; }
public TimeSpan Time { get; set; }
}
In this list I have some items, whose LineName and Modeare the same, and they are written in the list one by one. I need to sum Time property of such items and replace it with one item with sum of Time property without changing LineName and Mode, Date should be taken from first of replaced items. I will give an example below:
Original: Modified:
Date | LineName | Mode | Time Date | LineName | Mode | Time
01.09.2018 | Line1 | Auto | 00:30:00 01.09.2018 | Line1 | Auto | 00:30:00
01.09.2018 | Line2 | Auto | 00:10:00 01.09.2018 | Line2 | Auto | 00:15:00
01.09.2018 | Line2 | Auto | 00:05:00 01.09.2018 | Line2 | Manual | 00:02:00
01.09.2018 | Line2 | Manual | 00:02:00 01.09.2018 | Line2 | Auto | 00:08:00
01.09.2018 | Line2 | Auto | 00:08:00 01.09.2018 | Line1 | Manual | 00:25:00
01.09.2018 | Line1 | Manual | 00:25:00 01.09.2018 | Line2 | Auto | 00:24:00
01.09.2018 | Line2 | Auto | 00:05:00 02.09.2018 | Line1 | Auto | 00:05:00
02.09.2018 | Line2 | Auto | 00:12:00
02.09.2018 | Line2 | Auto | 00:07:00
02.09.2018 | Line1 | Auto | 00:05:00
I have tried to write method to do it, it partly works, but some not summarized items still remain.
private static List<ModeTime> MergeTime(List<ModeTime> modeTimes)
{
modeTimes = modeTimes.OrderBy(e => e.Date).ToList();
var mergedModeTimes = new List<ModeTime>();
for (var i = 0; i < modeTimes.Count; i++)
{
if (i - 1 != -1)
{
if (modeTimes[i].LineName == modeTimes[i - 1].LineName &&
modeTimes[i].Mode == modeTimes[i - 1].Mode)
{
mergedModeTimes.Add(new ModeTime
{
Date = modeTimes[i - 1].Date,
LineName = modeTimes[i - 1].LineName,
Mode = modeTimes[i - 1].Mode,
Time = modeTimes[i - 1].Time + modeTimes[i].Time
});
i += 2;
}
else
{
mergedModeTimes.Add(modeTimes[i]);
}
}
else
{
mergedModeTimes.Add(modeTimes[i]);
}
}
return mergedModeTimes;
}
I have also tried to wrap for with do {} while() and reduce source list modeTimes length. Unfortunately it leads to loop and memory leak (I waited till 5GB memory using).
Hope someone can help me. I searched this problem, in some familiar cases people use GroupBy. But I don't think it will work in my case, I must sum item with the same LineName and Mode, only if they are in the list one by one.
Most primitive solution would be something like this.
var items = GetItems();
var sum = TimeSpan.Zero;
for (int index = items.Count - 1; index > 0; index--)
{
var item = items[index];
var nextItem = items[index - 1];
if (item.LineName == nextItem.LineName && item.Mode == nextItem.Mode)
{
sum += item.Time;
items.RemoveAt(index);
}
else
{
item.Time += sum;
sum = TimeSpan.Zero;
}
}
items.First().Time += sum;
Edit: I missed last line, where you have to add leftovers. This only applies if first and second elements of the collection are the same. Without it, it would not assign aggregated time to first element.
You can use LINQ's GroupBy. To group only consecutive elements, this uses a trick. It stores the key values in a tuple together with a group index which is only incremented when LineName or Mode changes.
int i = 0; // Used as group index.
(int Index, string LN, string M) prev = default; // Stores previous key for later comparison.
var modified = original
.GroupBy(mt => {
var ret = (Index: prev.LN == mt.LineName && prev.M == mt.Mode ? i : ++i,
LN: mt.LineName, M: mt.Mode);
prev = (Index: i, LN: mt.LineName, M: mt.Mode);
return ret;
})
.Select(g => new ModeTime {
Date = g.Min(mt => mt.Date),
LineName = g.Key.LN,
Mode = g.Key.M,
Time = new TimeSpan(g.Sum(mt => mt.Time.Ticks))
})
.ToList();
This produces the expected 7 result rows.

how to parse two column in string in one cycle?

I have a string like this. I want to put the second row in an array(3,9,10,11...), and the third(5,8,4,3...) in an array
C8| 3| 5| 0| | 0|1|
C8| 9| 8| 0| | 0|1|
C8| 10| 4| 0| | 0|1|
C8| 11| 3| 0| | 0|1|
C8| 12| 0| 0| | 0|1|
C8| 13| 0| 0| | 0|1|
C8| 14| 0| 0| | 0|1|
This method originally parsed numbers by rows. now i have columns..
How to do this in this Parse method? I am trying for hours, i dont know what to do.
The Add method waits 2 integer. int secondNumberFinal, int thirdNumberFinal
private Parse(string lines)
{
const int secondColumn = 1;
const int thirdColum = 2;
var secondNumbers = lines[secondColumn].Split('\n'); // i have to split by new line, right?
var thirdNumbers = lines[thirdColum].Split('\n'); // i have to split by new line, right?
var res = new Collection();
for (var i = 0; i < secondNumbers.Length; i++)
{
try
{
var secondNumberFinal = Int32.Parse(secondNumbers[i]);
var thirdNumberFinal = Int32.Parse(thirdNumbers[i]);
res.Add(secondNumberFinal, thirdNumberFinal);
}
catch (Exception ex)
{
log.Error(ex);
}
}
return res;
}
thank you!
Below piece of code should do it for you. The logic is simple: Split the array with '\n' (please check if you need "\r\n" or some other line ending format) and then split with '|'. Returning the data as an IEnumerable of Tuple will provide flexibility and Lazy execution both. You can convert that into a List at the caller if you so desire using the Enumerable.ToList extension method
It uses LINQ (Select), instead of foreach loops due to its elegance in this situation
static IEnumerable<Tuple<int, int>> Parse(string lines) {
const int secondColumn = 1;
const int thirdColum = 2;
return lines.Split('\n')
.Select(line => line.Split('|'))
.Select(items => Tuple.Create(int.Parse(items[secondColumn]), int.Parse(items[thirdColum])));
}
If the original is a single string, then split once on newline to produce an array of string. Parse each of the new string by splitting on | & select the second & third values.
Partially rewriting your method for you :
private Parse(string lines)
{
const int secondColumn = 1;
const int thirdColum = 2;
string [] arrlines = lines.Split('\r');
foreach (string line in arrlines)
{
string [] numbers = line.Split('|');
var secondNumberFinal = Int32.Parse(numbers[secondNumbers]);
var thirdNumberFinal = Int32.Parse(numbers[thirdNumbers]);
// Whatever you want to do with them here
}
}

In C#, what is the best way to parse this WIKI markup?

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#
Here is an example table:
|| Owner || Action || Status || Comments ||
| Bill | Fix the lobby | In Progress | This is easy |
| Joe | Fix the bathroom | In Progress | Plumbing \\
\\
Electric \\
\\
Painting \\
\\
\\ |
| Scott | Fix the roof | Complete | This is expensive |
and here is how it comes in directly:
|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|
So as you can see:
The column headers have "||" as the separator
A row columns have a separator or "|"
A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.
I tried reading in line by line and then concatenating lines that had "\" in between then but that seemed a bit hacky.
I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.
Can anyone suggest the correct way to parse this data?
I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.
Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (i.e. that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.
The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.
public class TableParser
{
const StringSplitOptions SplitOpts = StringSplitOptions.None;
const string RowColSep = "|";
static readonly string[] HeaderColSplit = { "||" };
static readonly string[] RowColSplit = { RowColSep };
static readonly string[] MLColSplit = { #"\\" };
public class TableRow
{
public List<string[]> Cells;
}
public class Table
{
public string[] Header;
public TableRow[] Rows;
}
public static Table Parse(string text)
{
// Isolate the header columns and rows remainder.
var headerSplit = text.Split(HeaderColSplit, SplitOpts);
Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");
// Need to check whether there are any rows.
var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
var header = headerSplit.Skip(1)
.Take(headerSplit.Length - (hasRows ? 2 : 1))
.Select(c => c.Trim())
.ToArray();
if (!hasRows) // If no rows for this table, we are done.
return new Table() { Header = header, Rows = new TableRow[0] };
// Get all row columns from the remainder.
var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);
// Require same amount of columns for a row as the header.
Ensure((rowsCols.Length % (header.Length + 1)) == 1,
"The number of row colums does not match the number of header columns");
var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];
// Fill rows by sequentially taking # header column cells
for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
{
rows[ri] = new TableRow() {
Cells = rowsCols.Skip(start).Take(header.Length)
.Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
.ToList()
};
};
return new Table { Header = header, Rows = rows };
}
private static void Ensure(bool check, string errorMsg)
{
if (!check)
throw new InvalidDataException(errorMsg);
}
}
When used like this:
public static void Main(params string[] args)
{
var wikiLine = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var table = TableParser.Parse(wikiLine);
Console.WriteLine(string.Join(", ", table.Header));
foreach (var r in table.Rows)
Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}
It will produce the below output:
Where "\t# " represents a newline caused by the presence of \\ in the input.
Here's a solution which populates a DataTable. It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq.
var str = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();
var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r => tbl.Rows.Add(r.Split('|')));
This makes some assumptions but seems to work for your sample data. I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea.
It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do.
List<List<string>> table = new List<List<string>>();
var match = Regex.Match(raw, #"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
if (headerRow.Count > 0)
{
table.Add(headerRow);
}
}
match = Regex.Match(raw + "\r\n", #"[^\n]*\n" + #"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
if (cell.Trim(' ', '\t') == "\r\n")
{
if (!table.Contains(row) && row.Count > 0)
{
table.Add(row);
}
row = new List<string>();
}
else
{
row.Add(cell);
}
}
This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. For example, it will replace the Confluence line breaks \\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc.
private void ParseWikiTable(string input, string newLineReplacement = " ")
{
string separatorHeader = "||";
string separatorRow = "| |";
string separatorElement = "|";
input = Regex.Replace(input, #"[ \\]{2,}", newLineReplacement);
string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);
string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();
// do something with output data
TestPrint(headerArray);
foreach (var r in rowArray) { TestPrint(r); }
}
private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
input = input.Trim();
if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }
string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
if (trimWhitespace)
{
for (int i = 0; i < segments.Length; i++)
{
segments[i] = segments[i].Trim();
}
}
return segments;
}
private void TestPrint(string[] lst)
{
string joined = "[" + String.Join("::", lst) + "]";
Console.WriteLine(joined);
}
Console output from your direct input string:
[Owner::Action::Status::Comments]
[Bill::fix the lobby::In Progress::This is eary]
[Joe::fix the bathroom::In progress::plumbing Electric Painting]
[Scott::fix the roof::Complete::this is expensive]
A generic regex solution that populate a datatable and is a little flexible with the syntax.
var text = #"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
// Get Headers
var regHeaders = new Regex(#"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
var headers = regHeaders.Matches(text);
//Get Rows, based on number of headers columns
var regLinhas = new Regex(String.Format(#"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
var rows = regLinhas.Matches(text);
var tbl = new DataTable();
foreach (Match header in headers)
{
tbl.Columns.Add(header.Groups[1].Value);
}
foreach (Match row in rows)
{
tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
}
Here's a solution involving regular expressions. It takes a single string as input and returns a List of headers and a List> of rows/columns. It also trims white space, which may or may not be the desired behavior, so be aware of that. It even prints things nicely :)
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace parseWiki
{
class Program
{
static void Main(string[] args)
{
string content = #"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
content = content.Replace(#"\\", "");
string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
string cellContent = content.Substring(content.LastIndexOf("||") + 2);
MatchCollection headerMatches = new Regex(#"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
MatchCollection cellMatches = new Regex(#"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);
List<string> headers = new List<string>();
foreach (Match match in headerMatches)
{
if (match.Groups.Count > 1)
{
headers.Add(match.Groups[1].Value.Trim());
}
}
List<List<string>> body = new List<List<string>>();
List<string> newRow = new List<string>();
foreach (Match match in cellMatches)
{
if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
{
body.Add(newRow);
newRow = new List<string>();
}
else
{
newRow.Add(match.Groups[1].Value.Trim());
}
}
body.Add(newRow);
print(headers, body);
}
static void print(List<string> headers, List<List<string>> body)
{
var CELL_SIZE = 20;
for (int i = 0; i < headers.Count; i++)
{
Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));
for (int r = 0; r < body.Count; r++)
{
List<string> row = body[r];
for (int c = 0; c < row.Count; c++)
{
Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + " ");
}
Console.WriteLine("");
}
Console.WriteLine("\n\n\n");
Console.ReadKey(false);
}
}
public static class StringExt
{
public static string Truncate(this string value, int maxLength)
{
if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
return value.Substring(0, maxLength - 3) + "...";
}
}
}
Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

reading a CSV into a Datatable without knowing the structure

I am trying to read a CSV into a datatable.
The CSV maybe have hundreds of columns and only up to 20 rows.
It will look something like this:
+----------+-----------------+-------------+---------+---+
| email1 | email2 | email3 | email4 | … |
+----------+-----------------+-------------+---------+---+
| ccemail1 | anotherccemail1 | 3rdccemail1 | ccemail | |
| ccemail2 | anotherccemail2 | 3rdccemail2 | | |
| ccemail3 | anotherccemail3 | | | |
| ccemail4 | anotherccemail4 | | | |
| ccemail5 | | | | |
| ccemail6 | | | | |
| ccemail7 | | | | |
| … | | | | |
+----------+-----------------+-------------+---------+---+
i am trying to use genericparser for this; however, i believe that it requires you to know the column names.
string strID, strName, strStatus;
using (GenericParser parser = new GenericParser())
{
parser.SetDataSource("MyData.txt");
parser.ColumnDelimiter = "\t".ToCharArray();
parser.FirstRowHasHeader = true;
parser.SkipStartingDataRows = 10;
parser.MaxBufferSize = 4096;
parser.MaxRows = 500;
parser.TextQualifier = '\"';
while (parser.Read())
{
strID = parser["ID"]; //as you can see this requires you to know the column names
strName = parser["Name"];
strStatus = parser["Status"];
// Your code here ...
}
}
is there a way to read this file into a datatable without know the column names?
It's so simple!
var adapter = new GenericParsing.GenericParserAdapter(filepath);
DataTable dt = adapter.GetDataTable();
This will automatically do everything for you.
I looked at the source code, and you can access the data by column index too, like this
var firstColumn = parser[0]
Replace the 0 with the column number.
The number of colums can be found using
parser.ColumnCount
I'm not familiar with that GenericParser, i would suggest to use tools like TextFieldParser, FileHelpers or this CSV-Reader.
But this simple manual approach should work also:
IEnumerable<String> lines = File.ReadAllLines(filePath);
String header = lines.First();
var headers = header.Split(new[]{','}, StringSplitOptions.RemoveEmptyEntries);
DataTable tbl = new DataTable();
for (int i = 0; i < headers.Length; i++)
{
tbl.Columns.Add(headers[i]);
}
var data = lines.Skip(1);
foreach(var line in data)
{
var fields = line.Split(new[]{','}, StringSplitOptions.RemoveEmptyEntries);
DataRow newRow = tbl.Rows.Add();
newRow.ItemArray = fields;
}
i used generic parser to do it.
On the first run through the loop i get the columns names and then reference them to add them to a list
In my case i have pivoted the data but here is a code sample if it helps someone
bool firstRow = true;
List<string> columnNames = new List<string>();
List<Tuple<string, string, string>> results = new List<Tuple<string, string, string>>();
while (parser.Read())
{
if (firstRow)
{
for (int i = 0; i < parser.ColumnCount; i++)
{
if (parser.GetColumnName(i).Contains("FY"))
{
columnNames.Add(parser.GetColumnName(i));
Console.Log("Column found: {0}", parser.GetColumnName(i));
}
}
firstRow = false;
}
foreach (var col in columnNames)
{
double actualCost = 0;
bool hasValueParsed = Double.TryParse(parser[col], out actualCost);
csvData.Add(new ProjectCost
{
ProjectItem = parser["ProjectItem"],
ActualCosts = actualCost,
ColumnName = col
});
}
}

Categories