PROBLEM
I have recently started learning more about csvHelper and I need an advice on how to achieve my goal.
I have a CSV file containing some user records (thousands to hundreds of thousands records) and I need to parse the file and validate/process the data. What I need to do is two things:
I need a way to validate whole row while it is being read
the record contains date range and I need to verify it is a valid range
If it's not I need to write the offending line to the error file
one record can also be present multiple times with different date ranges and I need to validate that the ranges don't overlap and if they do, write the WHOLE ORIGINAL LINE to an error file
What I basically can get by with is a way to preserve the whole original row alongside the parsed data, but the way to verify the whole row while the raw data are still available would be better.
QUESTIONS
Are there some events/actions hidden somewhere I can use to validate row of data after it was created but before it was added to the collection?
If not is there a way to save the whole RAW row into the record so I can verify the row after parsing it AND if it is not valid do what I need with them?
CODE I HAVE
What I've created is the record class like this:
class Record
{ //simplified and omitted fluff for brevity
string Login
string Domain
DateTime? Created
DateTime? Ended
}
and a class map:
class RecordMapping<Record>
{ //simplified and omitted fluff for brevity
public RecordMapping(ConfigurationElement config)
{
//..the set up of the mapping...
}
}
and then use them like this:
public ProcessFile(...)
{
...
using(var reader = StreamReader(...))
using(var csvReader = new CsvReader(reader))
using(var errorWriter = new StreamWriter(...))
{
csvReader.Configuration.RegisterClassMap(new RadekMapping(config));
//...set up of csvReader configuration...
try
{
var records = csvReader.GetRecords<Record>();
}
catch (Exception ex)
{
//..in case of problems...
}
....
}
....
}
In this scenario the data might be "valid" from CsvHelper's viewpoint, because it can read the data, but invalid for more complex reasons (like an invalid date range.)
In that case, this might be a simple approach:
public IEnumerable<Thing> ReadThings(TextReader textReader)
{
var result = new List<Thing>();
using (var csvReader = new CsvReader(textReader))
{
while (csvReader.Read())
{
var thing = csvReader.GetRecord<Thing>();
if (IsThingValid(thing))
result.Add(thing);
else
LogInvalidThing(thing);
}
}
return result;
}
If what you need to log is the raw text, that would be:
LogInvalidRow(csvReader.Context.RawRecord);
Another option - perhaps a better one - might be to completely separate the validation from the reading. In other words, just read the records with no validation.
var records = csvReaader.GetRecords<Record>();
Your reader class returns them without being responsible for determining which are valid
and what to do with them.
Then another class can validate an IEnumerable<Record>, returning the valid rows and logging the invalid rows.
That way the logic for validation and logging isn't tied up with the code for reading. It will be easier to test and easier to re-use if you get a collection of Record from something other than a CSV file.
Related
I am trying to create multiple measures in Power BI through advanced scripting in Tabular editor that would replicate a Calculate(Sum(ColA),Keepfilters(ColB=i))
I understand that i would need a for loop and iterate over the values in ColB but i dont manage the iterating over the values in ColB.
How do i do that?
There is an ExecuteReader method available in Tabular Editor scripts, which returns an IDataReader-object that you can use to iterate the result. To add one measure for every value in a column, you would need to do something like this (inspired by this example):
using(var reader = ExecuteReader("EVALUATE VALUES(MyTable[ColB])");
{
while(reader.Read())
{
var colBvalue = reader.GetString(0); // Assumes [ColB] contains strings, otherwise use .GetInt64, .GetDecimal, .GetDouble, etc...
var measureName = string.Format("Measure filtered by {0}", colBvalue);
var measureDax = string.Format("CALCULATE(SUM(MyTable[ColA]), KEEPFILTERS(MyTable[ColB]=\"{0}\"))", colBvalue);
Model.Tables["MyTable"].AddMeasure(measureName, measureDax);
}
}
I work for a school district and we are having to manually create user logins (AD), and GAFE accounts. We would like automate this as much as we can. Currently, We have a CSV file that is exported daily from our SIS (Student information system) that has a list of all new students and I need to read that data, apply some formulas, and output two CSVs, one for GAFE and one for AD, with the results from my formulas.
My thoughts are to read the CSV and save it into a tuple data type, then write a new tuple with the output I need, then save to new CSVs. I thought tuple would work nicely, but I'm still new to C# that I'm not sure what would work best. If you guys have any recommendation on other data types I would love the input.
Here's the header-
"SchoolName","firstName","middleName","lastName","grade","studentNumber","Change","startDate","endDate","EnrStartStatus","CalcStartStatus","DateAdded"
"AHS","John","Smith","Doe","12","1779123445","New Student at School","2016-11-29 00:00:00","","","","2016-11-22 20:00:00"
So, I'm having some mental logic issues. I'm not sure on how to convert the CSV to tuple without having to do nested foreach loops (the way I'm thinking about going about it doesn't seem efficient.). I figured that there would be a library or something built into C# that would make it so much easier... Any input that is given would greatly be appreciated.
Thanks,
Throdne
There are several really powerful libraries to most of the work for you. One really good one is CSVHelper which will not only read and write the data for you, but perform type conversions so that your numbers and dates are stored as numbers and dates.
Given sample data similar to yours:
"FirstName","MiddleName","LastName","Grade","StudentNumber","EnrollDate"
"Ziggy","V.","Aurantium","12","4001809","12/13/2016 6:18:21 PM"
"Nancy","W.","Stackhouse","11","9762164","12/15/2016 7:06:20 PM"
"Sullivan","N.","Deroche","11","7887589","12/11/2016 1:31:50 PM"
1. Devise a class for the data
public class Student
{
public int StudentNumber { get; set; }
public string FirstName { get; set; }
public string MiddleName { get; set; }
public string LastName { get; set; }
public int Grade { get; set; }
public DateTime EnrollDate { get; set; }
public Student()
{ }
}
2. Load the Data
// a form/class level collection for the data
List<Student> myStudents;
Then to load the data:
using (var sr = new StreamReader(#"C:\Temp\students.csv", false))
using (var csv = new CsvReader(sr))
{
csv.Configuration.HasHeaderRecord = true;
csv.Configuration.QuoteAllFields = true;
myStudents = csv.GetRecords<Student>().ToList();
}
That's it: 3 lines of code. There are many other Configuration options to fine tune how it works. Also:
If there are a lot of rows, you can leave off the ToList() and work with the IEnumerable result and load each row as needed
If the Property names you want to use dont match the CSV header names, you can supply a Map to tell CSVHelper which fields map to which properties.
Ditto for when there are no field names.
Exporting your collection to new output CSVs is just as easy as reading them
You would also probably need a Map (or two) to control the output order for the output CSVs.
Best of all, it converts the data types for you. No, wait, best of all is that it wont split up fields with embedded commas (as in "Ziggy","V.","Aurantium, II","12"... note the last name data) the way String.Split(',') will.
I recommend to use a string array instead of Tuples.
You can easily convert a line of csv values into a string array with this line of code:
line.Split( new char[] { '"', ',' }, StringSplitOptions.RemoveEmptyEntries );
This returns a string array.
Using " and , both as separator characters lets you get rid of the "'s in the same step.
I have two DataGridView in the main form and the first one displays data from SAP and another displays data from Vertica DB, the FM I'm using is RFC_READ_TABLE, but there's en exception when calling this FM, which is, if there are too many columns in target table, SAP connector will returns an DATA_BUFFER_EXCEED exception, is there any other FMs or ways to retrieving data from SAP without exception?
I figured out a solution, is about split fields into several arrays and store each parts data into a datatable, then merge datatables, but I'm afraid it will cost a lot of time if the row count is too large.
screenshot of the program
here comes my codes:
RfcDestination destination = RfcDestinationManager.GetDestination(cmbAsset.Text);
readTable = destination.Repository.CreateFunction("RFC_READ_TABLE");
/*
* RFC_READ_TABLE will only extract data up to 512 chars per row.
* If you load more data, you will get an DATA_BUFFER_EXCEEDED exception.
*/
readTable.SetValue("query_table", table);
readTable.SetValue("delimiter", "~");//Assigns the given string value to the element specified by the given name after converting it appropriately.
if (tbRowCount.Text.Trim() != string.Empty) readTable.SetValue("rowcount", tbRowCount.Text);
t = readTable.GetTable("DATA");
t.Clear();//Removes all rows from this table.
t = readTable.GetTable("FIELDS");
t.Clear();
if (selectedCols.Trim() != "" )
{
string[] field_names = selectedCols.Split(",".ToCharArray());
if (field_names.Length > 0)
{
t.Append(field_names.Length);
int i = 0;
foreach (string n in field_names)
{
t.CurrentIndex = i++;
t.SetValue(0, n);
}
}
}
t = readTable.GetTable("OPTIONS");
t.Clear();
t.Append(1);//Adds the specified number of rows to this table.
t.CurrentIndex = 0;
t.SetValue(0, filter);//Assigns the given string value to the element specified by the given index after converting it appropriately.
try
{
readTable.Invoke(destination);
}
catch (Exception e)
{
}
first of all you should use BBP_READ_TABLE if it is available in your system. This one is better for much reasons. But that is not the point of your question. In RFC_READ_TABLE you have two Imports ROWCOUNT and ROWSKIPS. You have to use them.
I would recommend you a rowcount between 30.000 and 60.000. So you have to execute the RFC several times and each time you increment your ROWSKIPS. First loop: ROWCOUNT=30000 AND ROWSKIPS = 0, Second Loop: ROWCOUNT=30000 AND ROWSKIPS=30000 and so on...
Also be careful of float-fields when using the old RFC_READ_TABLE. There is one in table LIPS. This RFC has problems with them.
Use transaction
BAPI
press filter and set to all.
Under Logistics execution you will find deliveries.
The detail screen shows the function name.
Test them directly to find one thats suits then call that function instead of RFC_read_tab.
example:
BAPI_LIKP_GET_LIST_MSG
Another possibility is to have an ABAP RFC function developped to get your datas (with the advantage that you can get a structured / multi table response in one call, and the disadvantage that this is not a standard function / BAPI)
I have a C# program that looks through directories for .txt files and loads each into a DataTable.
static IEnumerable<string> ReadAsLines(string fileName)
{
using (StreamReader reader = new StreamReader(fileName))
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
public DataTable GetTxtData()
{
IEnumerable<string> reader = ReadAsLines(this.File);
DataTable txtData = new DataTable();
string[] headers = reader.First().Split('\t');
foreach (string columnName in headers)
txtData.Columns.Add(columnName);
IEnumerable<string> records = reader.Skip(1);
foreach (string rec in records)
txtData.Rows.Add(rec.Split('\t'));
return txtData;
}
This works great for regular tab-delimited files. However, the catch is that not every .txt file in the folders I need to use contains tab-delimited data. Some .txt files are actually SQL queries, notes, etc. that have been saved as plain text files, and I have no way of determining that beforehand. Trying to use the above code on such files clearly won't lead to the expected result.
So my question is this: How can I tell whether a .txt file actually contains tab-delimited data before I try to read it into a DataTable using the above code?
Just searching the file for any tab character won't work because, for example, a SQL query saved as plain text might have tabs for code formatting.
Any guidance here at all would be much appreciated!
If each line contains the same number of elements, then simply read each line, and verify that you get the correct number of fields in each record. If not error out.
if (headers.Count() != CORRECTNUMBER)
{
// ERROR
}
foreach (string rec in records)
{
string[] recordData = rec.Split('\t');
if (recordData.Count() != headers.Count())
{
// ERROR
}
txtData.Rows.Add(recordData);
}
To do this you need a set of "signature" logic providers which can check a given sample of the file for "signature" content. This is similar to how virus scanners work.
Consider you would create a set of classes where the ISignature was implemented by set of classes;
class TSVFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
class SQLFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
each one would read an appropriate number of bytes in and return the known file type, if it can be evaluated. Each file parser would need its own logic to determine how many bytes to read and on what basis to make its evaluation.
I am currently reading in an HTML document using CsQuery. This document has several HTML tables and I need to read in the data while preserving the structure. At the moment, I simply have a List of List of List of strings. This is a list of tables containing a list of rows containing a list of cells containing the content as a string.
List<List<List<string>>> page_tables = document_div.Cq().Find("TABLE")
.Select(table => table.Cq().Find("TR")
.Select(tr => tr.Cq().Find("td")
.Select(td => td.InnerHTML).ToList())
.ToList())
.ToList();
Is there a better way to store this data, so I can easily access particular tables, and specific rows and cells? I'm writing several methods that deal with this page_tables object so I need to nail down its formulation first.
Is there a better way to store this data, so I can easily access particular tables, and specific rows and cells?
On most occassions, well-formed HTML fits nicely into an XML structure so you could store it as an XML document. LINQ to XML would make querying very easy
XDocument doc = XDocument.parse("<html>...</html>");
var cellData = doc.Descendant("td").Select(x => x.Value);
Based on the comments I feel obliged to point out that there are a couple of other scenarios where this can fall over such as
When HTML-encoded content like is used
Valid HTML which doesn't require a closing tag e.g. <br> is used
(With that said, these things can be handled by some pre-processing)
To summarise, it's by all means not the most robust approach, however, if you can be sure that the HTML you are parsing fits the bill then it would be a pretty neat solution.
You could go fully OOP and write some model classes:
// Code kept short, minimal ctors
public class Cell
{
public string Content {get;set;}
public Cell() { this.Content = string.Empty; }
}
public class Row
{
public List<Cell> Cells {get;set;}
public Row() { this.Cells = new List<Cell>(); }
}
public class Table
{
public List<Row> Rows {get;set;}
public Table() { this.Rows = new List<Row>(); }
}
And then fill them up, for example like this:
var tables = new List<Table>();
foreach(var table in document_div.Cq().Find("TABLE"))
{
var t = new Table();
foreach(var tr in table.Cq().Find("TR"))
{
var r = new Row();
foreach(var td in tr.Cq().Find("td"))
{
var c = new Cell();
c.Contents = td.InnerHTML;
r.Cells.Add(c);
}
t.Rows.Add(r);
}
tables.Add(t);
}
// Assuming the HTML was correct, now you have a cleanly organized
// class structure representing the tables!
var aTable = tables.First();
var firstRow = aTable.Rows.First();
var firstCell = firstRow.Cells.First();
var firstCellContents = firstCell.Contents;
...
I'd probably choose this approach because I always prefer to know exactly what my data looks like, especially if/when I'm parsing from external/unsafe/unreliable sources.
Is there a better way to store this data, so I can easily access
particular tables, and specific rows and cells?
If you want to easily access table data, then create class which will hold data from table row with nicely named properties for corresponding columns. E.g. if you have users table
<table>
<tr><td>1</td><td>Bob</td></tr>
<tr><td>2</td><td>Joe</td></tr>
</table>
I would create following class to hold row data:
public class User
{
public int Id { get; set; }
public string Name { get; set; }
}
Second step would be parsing users from HTML. I suggest to use HtmlAgilityPack (available from NuGet) for parsing HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load("index.html");
var users = from r in doc.DocumentNode.SelectNodes("//table/tr")
let cells = r.SelectNodes("td")
select new User
{
Id = Int32.Parse(cells[0].InnerText),
Name = cells[1].InnerText
};
// NOTE: you can check cells count before accessing them by index
Now you have collection of strongly-typed user objects (you can save them to list, to array or to dictionary - it depends on how you are going to use them). E.g.
var usersDictionary = users.ToDictionary(u => u.Id);
// Getting user by id
var user = usersDictionary[2];
// now you can read user.Name
Since your parsing an HTML table. Could you use an ADO.Net DataTable? If the content doesn't have too many row or col spans this may be an option, you wouldn't have to roll your own and it could be easily saved to a database or list of entities or whatever. Plus you get the benefit of strongly typed data types. As long as the HTML tables are consistent I would prefer an approach like this to make interoperability with the rest of the framework seamless and a ton less work.