First time using the csvReader - note it requires a custom class that defines the Headers found in the CSV file.
class DataRecord
{
//Should have properties which correspond to the Column Names in the file
public String Amount { get; set; }
public String InvoiceDate { get; set; }......
}
The example given then uses the class such:-
using (var sr = new StreamReader(#"C:\\Data\\Invoices.csv"))
{
var reader = new CsvReader(sr);
//CSVReader will now read the whole file into an enumerable
IEnumerable<DataRecord> records = reader.GetRecords<DataRecord>();
//First 5 records in CSV file will be printed to the Output Window
foreach (DataRecord record in records.Take(5))
{
Debug.Print("{0} {1}, {2}", record.Amount, record.InvoiceDate, ....);
}
Two questions :-
1. The app will be loading in files with differing headers so I need to be able to update this class on the fly - is this possible & how?
(I am able to extract the headers from the CSV file.)
CSV file is potentially multi millions of rows (gb size) so is this the best / most efficient way of importing the file.
Destination is a SQLite DB - debug line is used as example.
Thanks
The app will be loading in files with differing headers so I need to be able to update this class on the fly - is this possible & how?
Although it is definetely possible with reflecion or third part libraries, creating an object for a row will be inefficient for such a big files. Moreover, using C# for such a scenario is a bad idea (unless you have some business data transformation). I would consider something like this, or perhaps a SSIS package.
Related
I am very new to C# (been learning for approximately 6 months in my free time).
I have produced a script which can store data in an SQLite Database that seems to work fine for the minute. I am having trouble printing the entire table and/or specific data from the table. When I get to the point when it should print off the data it gives me the following:
enter image description here
This is the solution name and the data class.
**
Here is main program code relating to loading data from database:**
private void LoadGymList()
{
Console.WriteLine("Load Gym List Working");
List<GymEntry> entries = new List<GymEntry>();
entries = SQLiteDataAccess.LoadEntry();
Console.WriteLine(entries[0]);
}
**Here is the SQLiteDataAccess code for loading entries:
**
namespace GymAppLists
{
public class SQLiteDataAccess
{
public static List<GymEntry> LoadEntry()
{
using (IDbConnection cnn = new SQLiteConnection(LoadConnectionString()))
{
var output = cnn.Query<GymEntry>("select * from GeneralLog", new DynamicParameters());
return output.ToList();
}
}
Only other files are app.config and a basic class for 'id' and 'date' which are the only two database columns in the single table of the database.
I tried to print individual indexes of the list to see if it would print those, but it simply gave me the previous output. I am stumped as to why this is working this way. It is clearly accessing the database, but it must be formatted incorrectly or I am not using the correct method to access the specifics of the data.
If I print the list.count, it provides me with the correct number of rows in the db for example.
I imagine this is a very simple fix, any advice would be greatly appreciated.
thank you,
JH.
You are only writing the first element of the entries list (Console.WriteLine(entries[0])) and also you are only printing the object itself. Instead use a loop and print the properties. ie:
foreach(var x in entries)
{
Console.WriteLine($"Id:{x.Id}, Date:{x.Date}");
}
As the question says, using the FileHelpers library I am attempting to generate a CSV file along side a report file. The report file may have different (but finite) inputs/data structures and hence my CSV generation method is not explicitly typed. The CSV contains all of the report data as well as the report's header information. For my headers, I am using the class object properties because they are descriptive enough for my end use purpose.
My relevant code snippet is below:
// File location, where the .csv goes and gets stored.
string filePath = Path.Combine(destPath, fileName);
// First, write report header details based on header list
Type type = DetermineListType(headerValues);
var headerEngine = new FileHelperEngine(type);
headerEngine.HeaderText = headerEngine.GetFileHeader();
headerEngine.WriteFile(filePath, (IEnumerable<object>)headerValues);
// Next, append the report data below the report header data.
type = DetermineListType(reportData);
var reportDataEngine = new FileHelperEngine(type);
reportDataEngine.HeaderText = reportDataEngine.GetFileHeader();
reportDataEngine.AppendToFile(filePath, (IEnumerable<object>)reportData);
When this is executed, the CSV is successfully generated however the .AppendToFile() method does not add the reportDataEngine.HeaderText. From the documentation I do not see this functionality to .AppendToFile() and I am wondering if anyone has a known work-around for this or a suggestion how to output the headers of two different class objects in a single CSV file using FileHelpers.
The desired output would look something like this however in a single CSV file (This would be a contiguous CSV obviously; not tables)
Report_Name
Operator
Timestamp
Access Report
User1
14:50:12 28 Dec 2020
UserID
Login_Time
Logout_Time
User4
09:33:23
10:45:34
User2
11:32:11
11:44:11
User4
15:14:22
16:31:09
User1
18:55:32
19:10:10
I have looked also at the MultiRecordEngine in FileHelpers and while I think this may be helpful, I cannot figure out based on the examples how to actually write a multirecord CSV file in the required fashion I have above; if it is possible at all.
Thank you!
The best way is to merge the columns and make one big table then make your classes match the columns you need to separate them out when reading. CSV only allows for the first row to define the column names and that is optional based on your use case. Look at CSVHelper https://joshclose.github.io/CsvHelper/ it has a lot of built-in features with lots of examples. Let me know if you need additional help.
I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.
//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);
//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);
This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.
How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.
I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.
I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.
According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.
ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.
What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.
If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.
If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:
It's not recommended to have more than 5'000 rows in a single row
group for performance reasons
We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.
So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.
ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.
using System;
using ParquetSharp;
using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
// You can use the logical reader for automatic conversion to a fitting CLR type
// here `double` as an example
// (unfortunately this does not work well with complex schemas IME)
const int batchSize = 4000;
Span<double> buffer = new double[batchSize];
var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);
// or if you want raw Parquet (with Dremel data and physical type)
var resultObject = group.Column(0).Apply(new Visitor());
}
class Visitor : IColumnReaderVisitor<object>
{
public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
where TValue : unmanaged
{
// TValue will be the physical Parquet type
const int batchSize = 200000;
var buffer = new TValue[batchSize];
var definitionLevels = new short[batchSize];
var repetitionLevels = new short[batchSize];
long valuesRead;
var levelsRead = columnReader.ReadBatch(batchSize,
definitionLevels, repetitionLevels,
buffer, out valuesRead);
// Return stuff you are interested in here, will be `resultObject` above
return new object();
}
}
This question already has answers here:
Reading CSV files using C#
(12 answers)
Closed 5 years ago.
I've got ~500 csv files, without headers, and a legacy BASIC program where the data is imported and saved as variables:
OPEN "filename" FOR INPUT AS #1
INPUT #1, var1, var2, var3$, var4, etc
Most of the files have > 60 fields, and therefore I do not think the answer given here is applicable.
I've so far been unable to find a way to do this in C#.
The new program is a Windows Forms application, and I'm going to have classes for certain objects, and the data in the csv's relate to properties of the objects. I'm going to initialize the object using either a string depicting the file to open, or a dataset if that is the correct way to go about doing this.
Does anyone know of a way to do this in C#, either using 3rd party libraries or not?
I recommend you to use CsvHelper or FileHelpers.
Example with CsvHelper
Create a class with the structure of your CSV
public class Record {
public string Field1 {get;set;}
public int Field2 {get;set;}
public double Field3 {get;set;}
}
Read all records
using (var sr = new StreamReader("yourFile.csv"))
{
var csv = new CsvReader( sr );
var records = csv.GetRecords<Record>().ToList();
}
We are using FileHelpers 2.0 in our project. I have my record defined and have the data being imported correctly. After getting my array of generic objects:
var engine = new FileHelperEngine<UserRecord>();
engine.ErrorManager.ErrorMode = ErrorMode.SaveAndContinue;
UserRecord[] importedUsers = engine.ReadFile(_filepath);
After getting the records that errored due to formatting issues, I am iterating through the importedUsers array and doing validation to see if the information being imported is valid.
If the data is not valid, I want to be able to log the entire string from the original record from my file.
Is there a way to store the entire "RecordString" in the UserRecord class when the FileHelperEngine reads each of the records?
We do that often at work handling the BeforeRead event and storing it in a field mOriginalString that is marked this way:
[FieldNotInFile]
public string mOriginalString;
You must use the last version of the library from here:
http://teamcity.codebetter.com/repository/download/bt65/20313:id/FileHelpers_2.9.9_ReleaseBuild.zip
Cheers