Determine if input file is usable by program - c#

I have a C# program that looks through directories for .txt files and loads each into a DataTable.
static IEnumerable<string> ReadAsLines(string fileName)
{
using (StreamReader reader = new StreamReader(fileName))
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
public DataTable GetTxtData()
{
IEnumerable<string> reader = ReadAsLines(this.File);
DataTable txtData = new DataTable();
string[] headers = reader.First().Split('\t');
foreach (string columnName in headers)
txtData.Columns.Add(columnName);
IEnumerable<string> records = reader.Skip(1);
foreach (string rec in records)
txtData.Rows.Add(rec.Split('\t'));
return txtData;
}
This works great for regular tab-delimited files. However, the catch is that not every .txt file in the folders I need to use contains tab-delimited data. Some .txt files are actually SQL queries, notes, etc. that have been saved as plain text files, and I have no way of determining that beforehand. Trying to use the above code on such files clearly won't lead to the expected result.
So my question is this: How can I tell whether a .txt file actually contains tab-delimited data before I try to read it into a DataTable using the above code?
Just searching the file for any tab character won't work because, for example, a SQL query saved as plain text might have tabs for code formatting.
Any guidance here at all would be much appreciated!

If each line contains the same number of elements, then simply read each line, and verify that you get the correct number of fields in each record. If not error out.
if (headers.Count() != CORRECTNUMBER)
{
// ERROR
}
foreach (string rec in records)
{
string[] recordData = rec.Split('\t');
if (recordData.Count() != headers.Count())
{
// ERROR
}
txtData.Rows.Add(recordData);
}

To do this you need a set of "signature" logic providers which can check a given sample of the file for "signature" content. This is similar to how virus scanners work.
Consider you would create a set of classes where the ISignature was implemented by set of classes;
class TSVFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
class SQLFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
each one would read an appropriate number of bytes in and return the known file type, if it can be evaluated. Each file parser would need its own logic to determine how many bytes to read and on what basis to make its evaluation.

Related

Confusion in row and column of DataTable

I am following a tutorial of an inventory stock management system in C# language.
The original csv file is a stock list, which contains four categories:
Item Code, Item Description, Item Count, OnOrder
The original csv file:
In the tutorial, the code is generating a DataTable object, which will be used in the GridView demo in the application.
Here is the code:
DataTable dataTable = new DataTable();
dataTable.Columns.Add("Item Code");
dataTable.Columns.Add("Item Description");
dataTable.Columns.Add("Current Count");
dataTable.Columns.Add("On Order");
string CSV_FilePath = "C:/Users/xxxxx/Desktop/stocklist.csv";
StreamReader streamReader = new StreamReader(CSV_FilePath);
string[] rawData = new string[File.ReadAllLines(CSV_FilePath).Length];
rawData = streamReader.ReadLine().Split(',');
while(!streamReader.EndOfStream)
{
rawData = streamReader.ReadLine().Split(',');
dataTable.Rows.Add(rawData[0], rawData[1], rawData[2], rawData[3]);
}
dataGridView1.DataSource = dataTable;
I am assuming that rawData = streamReader.ReadLine().Split(','); splits the file into an array object like this:
["A0001", "Horse on Wheels","5","No"]
["A0002","Elephant on Wheels","2","No"]
In the while loop, it literates through each line (each array) and assign each of the rawData[x] into corresponding column.
Is this right to understand this code snippet? Thanks in advance.
Another question is, why do I need to run
rawData = streamReader.ReadLine().Split(',');
in a while loop?
Thanks in advance.
Your code should actually look like this:
DataTable dataTable = new DataTable();
dataTable.Columns.Add("Item Code");
dataTable.Columns.Add("Item Description");
dataTable.Columns.Add("Current Count");
dataTable.Columns.Add("On Order");
string CSV_FilePath = "C:/Users/xxxxx/Desktop/stocklist.csv";
using(StreamReader streamReader = new StreamReader(CSV_FilePath))
{
// Skip the header row
streamReader.ReadLine();
while(!streamReader.EndOfStream)
{
string[] rawData = streamReader.ReadLine().Split(','); // read a row and split it into cells
dataTable.Rows.Add(rawData[0], rawData[1], rawData[2], rawData[3]); // add the elements from each cell as a row in the datatable
}
}
dataGridView1.DataSource = dataTable;
Changes I've made:
We've added a using block around StreamReader to ensure that the file handle is only open for as long as we need to read the file.
We now only read the file once, not twice.
Since we only need the rawData in the scope of the while loop, I've moved it into the loop.
Explaining what's wrong:
The following line reads the entire file, and then counts how many rows are in it. With this information, we initialize an array with as many positions as there are rows in the file. This means for a 500 row file, you can access positions rawData[0], rawData[1], ... rawData[499].
string[] rawData = new string[File.ReadAllLines(CSV_FilePath).Length];
With the next row you discard that array, and instead take the cells from the top of the file (the headers):
rawData = streamReader.ReadLine().Split(',');
This line states "read a single line from the file, and split it by comma". You then assign that result to rawData, replacing its old value. So the reason you need this again in the loop is because you're interested in more than the first row of the file.
Finally, you're looping through each row in the file and replacing rawData with the cells from that row. Finally, you add each row to the DataTable:
rawData = streamReader.ReadLine().Split(',');
dataTable.Rows.Add(rawData[0], rawData[1], rawData[2], rawData[3]);
Note that File.ReadAllLines(...) reads the entire file into memory as an array of strings. You're also using StreamReader to read through the file line-by-line, meaning that you are reading the entire file twice. This is not very efficient and you should avoid this where possible. In this case, we didn't need to do that at all.
Also note that your approach to reading a CSV file is fairly naïve. Depending on the software used to create them, some CSV files have cells that span more than one line in the file, some include quoted sections for text, and sometimes those quoted sections include commas which would throw off your split code. Your code also doesn't deal with the possibility of a file being badly formatted such that a row may have less cells than expected, or that there may be a trailing empty row at the end of the file. Generally it's better to use a dedicated CSV parser such as CsvHelper rather than trying to roll your own.

CsvHelper - validate whole row

PROBLEM
I have recently started learning more about csvHelper and I need an advice on how to achieve my goal.
I have a CSV file containing some user records (thousands to hundreds of thousands records) and I need to parse the file and validate/process the data. What I need to do is two things:
I need a way to validate whole row while it is being read
the record contains date range and I need to verify it is a valid range
If it's not I need to write the offending line to the error file
one record can also be present multiple times with different date ranges and I need to validate that the ranges don't overlap and if they do, write the WHOLE ORIGINAL LINE to an error file
What I basically can get by with is a way to preserve the whole original row alongside the parsed data, but the way to verify the whole row while the raw data are still available would be better.
QUESTIONS
Are there some events/actions hidden somewhere I can use to validate row of data after it was created but before it was added to the collection?
If not is there a way to save the whole RAW row into the record so I can verify the row after parsing it AND if it is not valid do what I need with them?
CODE I HAVE
What I've created is the record class like this:
class Record
{ //simplified and omitted fluff for brevity
string Login
string Domain
DateTime? Created
DateTime? Ended
}
and a class map:
class RecordMapping<Record>
{ //simplified and omitted fluff for brevity
public RecordMapping(ConfigurationElement config)
{
//..the set up of the mapping...
}
}
and then use them like this:
public ProcessFile(...)
{
...
using(var reader = StreamReader(...))
using(var csvReader = new CsvReader(reader))
using(var errorWriter = new StreamWriter(...))
{
csvReader.Configuration.RegisterClassMap(new RadekMapping(config));
//...set up of csvReader configuration...
try
{
var records = csvReader.GetRecords<Record>();
}
catch (Exception ex)
{
//..in case of problems...
}
....
}
....
}
In this scenario the data might be "valid" from CsvHelper's viewpoint, because it can read the data, but invalid for more complex reasons (like an invalid date range.)
In that case, this might be a simple approach:
public IEnumerable<Thing> ReadThings(TextReader textReader)
{
var result = new List<Thing>();
using (var csvReader = new CsvReader(textReader))
{
while (csvReader.Read())
{
var thing = csvReader.GetRecord<Thing>();
if (IsThingValid(thing))
result.Add(thing);
else
LogInvalidThing(thing);
}
}
return result;
}
If what you need to log is the raw text, that would be:
LogInvalidRow(csvReader.Context.RawRecord);
Another option - perhaps a better one - might be to completely separate the validation from the reading. In other words, just read the records with no validation.
var records = csvReaader.GetRecords<Record>();
Your reader class returns them without being responsible for determining which are valid
and what to do with them.
Then another class can validate an IEnumerable<Record>, returning the valid rows and logging the invalid rows.
That way the logic for validation and logging isn't tied up with the code for reading. It will be easier to test and easier to re-use if you get a collection of Record from something other than a CSV file.

C# - Excel Export a List

Hi i have this code To export a List to An Excel:
private DataTable ListaDatiReportQuietanzamento(List<DatiReportQuietanzamento> datiReportQuietanzamento)
{
DataTable dt = new DataTable("DatiReportQuietanzamento");
dt.Columns.Add("Polizza");
dt.Columns.Add("Posizione");
dt.Columns.Add("Codice Frazionamento");
var result = datiReportQuietanzamento.ToDataTable().AsEnumerable().Select(p =>
new
{
n_polizza = p.Field<long>("n_polizza"),
n_posizione = p.Field<byte>("n_posizione"),
c_frazionamento = p.Field<string>("c_frazionamento")
}).Distinct().ToList();
foreach (var item in result)
{
dt.Rows.Add(item.n_polizza, item.n_posizione, item.c_frazionamento);
}
return dt;
}
This method works with Lists that does not contain many items , but when the list is very large , the method takes too many time.
There is a way to avoid the foreach and add to the rows the items directly? Maybe with Lambda Expression?
Thank you.
While you have not specified how the data is ultimately to be supplied to Excel, generally it is supplied a CSV (Comma Separated Values) file for easy import.
So this being the case you can eliminate your data table conversion entirely and create a list of strings as follows:
private List<string> ListaDatiReportQuietanzamento(List<DatiReportQuietanzamento> datiReportQuietanzamento)
{
var result = new List<string>();
foreach (var item in datiReportQuietanzamento)
{
result.AppendLine($"{item.n_polizza},{item.n_posizione},{item.c_frazionamento}");
}
return result;
}
Now the only simplification I have made is not to worry about encoding because strings should actually be escaped so item.c_frazionamento should actually be escaped.
Instead of doing this all yourself, I suggest you have a look at a NuGet package such as CsvHelper which will help you with creating CSV files and take all the hassle with escaping things out of the equation. It can also directly deal with a list of objects and convert it into a CSV file for you see specifically the first example in https://joshclose.github.io/CsvHelper/writing#writing-all-records

how to represent a CSV File as a data structure in a C# program

I have a csv file I am going to read from disk. I do not know up front how many columns or the names of the columns.
Any thoughts on how I should represent the fields. Ideally I want to say something like,
string Val = DataStructure.GetValue(i,ColumnName).
where i is the ith Row.
Oh just as an aside I will be parsing using the TextFieldParser class
http://msdn.microsoft.com/en-us/library/cakac7e6(v=vs.90).aspx
That sounds as if you would need a DataTable which has a Rows and Columns property.
So you can say:
string Val = table.Rows[i].Field<string>(ColumnName);
A DataTable is a table of in-memory data. It can be used strongly typed (as suggested with the Field method) but actually it stores it's data as objects internally.
You could use this parser to convert the csv to a DataTable.
Edit: I've only just seen that you want to use the TextFieldParser. Here's a possible simple approach to convert a csv to a DataTable:
var table = new DataTable();
using (var parser = new TextFieldParser(File.OpenRead(path)))
{
parser.Delimiters = new[]{","};
parser.HasFieldsEnclosedInQuotes = true;
// load DataColumns from first line
String[] headers = parser.ReadFields();
foreach(var h in headers)
table.Columns.Add(h);
// load all other lines as data '
String[] fields;
while ((fields = parser.ReadFields()) != null)
{
table.Rows.Add().ItemArray = fields;
}
}
If the column names are in the first row read that and store in a Dictionary<string, int> that maps the column name to the column index.
You could then store the remaining rows in a simple structure like List<string[]>.
To get a column for a row you'd do csv[rowIndex][nameToIndex[ColumnName]];
nameToIndex[ColumnName] gets the column index from the name, csv[rowIndex] gets the row (string array) we want.
This could of course be wrapped in a class.
Use the csv parser if you want, but a text parser is something very easy to do by yourself if you need customization.
For you need, i would use one (or more) Dictionnary. At least one to have the PropertyString --> column index. And maybe the reverse one column index--> PropertyString if needed.
When i parse a file for csv, i usually put the result in a list while parsing, and then in an array once complete for speed reasons (List.ToArray()).

Print an array/list to excel in c#

I am able to save a single value into excel but I need help to save a full list/array into an excel sheet.
Code I have so far:
var MovieNames = session.Query<Movie>()
.ToArray();
List<string> MovieList = new List<string>();
foreach (var movie in MovieNames)
{
MovieList.Add(movie.MovieName);
}
//If I want to print a single value or a string,
//I can use the following to print/save to excel
// How can I do this if I want to print that entire
//list thats generated in "MovieList"
return File(new System.Text.UTF8Encoding().GetBytes(MovieList), "text/csv", "demo.csv");
You could use FileHelpers to serialize some strongly typed object into CSV. Just promise me to never roll your own CSV parser.
If you mean you want to create a .csv file with all movie names in one column so you can open it in Excel then simply loop over it:
byte[] content;
using (var ms = new MemoryStream())
{
using (var writer = new StreamWriter(ms))
{
foreach (var movieName in MovieList)
writer.WriteLine(movieName);
}
content = ms.ToArray();
}
return File(content, "text/csv", "demo.csv");
Edit
You can add more columns and get fancier with your output but then you run into the problem that you have check for special characters which need escaping (like , and "). If you want to do more than just a simple output then I suggest you follow #Darins suggestion and use the FileHelpers utilities. If you can't or don't want to use them then this article has an implementation of a csv writer.

Categories