access csv file with LINQ - c#

I have a c# lab question:
This is my code todo add data from the csv file, after compile it gives a error the name "rows" does not exist in current content
foreach (string row in rows)
{
if (string.IsNullOrEmpty(row)) continue;
string[] cols = row.Split(',');
DailyValues v = new DailyValues();
v.Open = Convert.To*(cols[0]);
v.High = Convert.To*(cols[1]);
v.Low = Convert.To*(cols[2]);
v.Close = Convert.To* (cols[3]);
v.Volume = Convert.To* (cols[4]);
v.AdjClose = Convert.To*(cols[5]);
v.Date = Convert.To*(cols[6]);
values.Add(v);
return values;
}

It looks like your CSV file has data which can't be converted into a Decimal. Run it in the debugger, and have a look at row when the exception is thrown.
If you use Decimal.TryParse(), the return value will tell you if the conversion was successful without an exception being thrown.
Edit:
As an example for TryParse:
Decimal _Open, _High;
if (!Decimal.TryParse(cols[0], out _Open))
{
Debug.Print("Error on row: {0}", row);
continue;
}
v.Open = _Open;
if (!Decimal.TryParse(cols[1], out _High))
{
Debug.Print("Error on row: {0}", row);
continue;
}
v.High = _High;

Related

Parquet.NET is generating huge parquet files in comparison with pyarrow

My application takes data from Azure EventHubs, which has a maximum of 1mb size, transforms it into a DataTable and then save it as a Parquet file somewhere.
The parquet generated by Parquet.Net is huge, it is always over 50mb even with the best compression method. When I read this 50mb parquet file using pandas and then re-write it into another file, it becomes less then 500kb.
See below the comparison between Parquet.Net (RED) and Pyarrow (BLUE):
As we can see, the number of columns and rows are the same.
I did check the content and it seems all okay.
Obs: There is one varchar(8000) column has lots of data.
That is how I got the parquet metadata:
import pandas as pd
import pyarrow.parquet as pq
# pd.set_option("max_colwidth", None)
# pd.set_option("max_seq_item", None)
# pd.set_option("min_rows", 2)
# pd.set_option("max_rows", None)
# pd.set_option('max_columns', None)
parquet_file_net = pq.ParquetFile("parquetnetFile.parquet")
print(parquet_file_net.metadata)
print()
parquet_file_py = pq.ParquetFile("pyarrowFile.parquet")
print(parquet_file_py.metadata)
print()
print()
print(parquet_file_net.metadata.row_group(0))
print()
print(parquet_file_py.metadata.row_group(0))
My c# code is based on the following one, but I did some changes:
https://github.com/dazfuller/datatable-to-parquet
So here is my C# code.
public static async Task<MemoryStream> ToParquetStream(DataTable dt)
{
var fields = GenerateSchema(dt);
var parquetStream = new MemoryStream();
using (var writer = new ParquetWriter(new Schema(fields), parquetStream))
{
writer.CompressionMethod = CompressionMethod.Gzip;
writer.CompressionLevel = 2;
var range = Enumerable.Range(0, dt.Columns.Count);
var result = await range.ForEachAsyncInParallel(async c =>
{
return await Task.Run(() =>
{
// Determine the target data type for the column
var targetType = dt.Columns[c].DataType;
if (targetType == typeof(DateTime))
{
targetType = typeof(DateTimeOffset);
}
// Generate the value type, this is to ensure it can handle null values
var valueType = targetType.IsClass ? targetType : typeof(Nullable<>).MakeGenericType(targetType);
// Create a list to hold values of the required type for the column
var valuesArray = Array.CreateInstance(valueType, dt.Rows.Count);
// Get the data to be written to the parquet stream
for (int r = 0; r < dt.Rows.Count; r++)
{
DataRow row = dt.Rows[r];
// Check if value is null, if so then add a null value
if (row[c] == null || row[c] == DBNull.Value)
{
valuesArray.SetValue(null, r);
}
else
{
// Add the value to the list, but if it's a DateTime then create it as a DateTimeOffset first
if (dt.Columns[c].DataType == typeof(DateTime))
{
valuesArray.SetValue(new DateTimeOffset((DateTime)row[c]), r);
}
else
{
valuesArray.SetValue(row[c], r);
}
}
}
return valuesArray;
});
});
using (var rgw = writer.CreateRowGroup())
{
for (int c = 0; c < dt.Columns.Count; c++)
{
rgw.WriteColumn(new Parquet.Data.DataColumn(fields[c], result[c]));
}
}
}
return parquetStream;
}
private static List<DataField> GenerateSchema(DataTable dt)
{
var fields = new List<DataField>(dt.Columns.Count);
foreach (DataColumn column in dt.Columns)
{
// Attempt to parse the type of column to a parquet data type
var success = Enum.TryParse<DataType>(column.DataType.Name, true, out var type);
// If the parse was not successful and it's source is a DateTime then use a DateTimeOffset, otherwise default to a string
if (!success && column.DataType == typeof(DateTime))
{
type = DataType.DateTimeOffset;
}
// In c# float is System.Single. That is why the parse fails
else if (!success && column.DataType == typeof(float))
{
type = DataType.Float;
}
else if (!success)
{
type = DataType.String;
}
fields.Add(new DataField(column.ColumnName, type));
}
return fields;
}
public static async Task<R[]> ForEachAsyncInParallel<T, R>(this IEnumerable<T> list, Func<T, Task<R>> func)
{
var tasks = new List<Task<R>>();
foreach (var value in list)
{
tasks.Add(func(value));
}
return await Task.WhenAll<R>(tasks);
}
So why is the file size so large?
Here are the files generated by parquet.net and pyarrow: https://easyupload.io/m/28jo48

Append string in a specific place of a text file

I have hundreds of files in a directory. Many of the text files have the Code Column values as blank and i need to iterate over all the text files and fill it. I am able to write the code to add the code value in a new line, but i am not able to write it under code column. String value is: "STRINGTOENTER". I only want it be entered in the 1st line after the header. The last line should be left alone
Id Code File_Number Suffix Check_Number Check_Date
047 7699 01 99999 11/11/2012
1 -6.15
Below is my code snippets that add the value at a newline. I think I need to do a regular expression or a tab delimited type solution here.
public static void AddAStringtoAllTextFiles()
{
try
{
string path = #"C:\Users\ur\Desktop\TestFiles\";
string[] fileEntries = Directory.GetFiles(path);
for (int i = 0; i < fileEntries.Length; i++)
{
File.AppendAllText(fileEntries[i], "STRINGTOENTER" + Environment.NewLine);
}
}
catch (Exception e)
{
throw e;
}
}
EDITED
please try this with the assumption that its space(s) delimited.
its working on my VS2017 and kindly add the using statement on the top as below .
using System.Text.RegularExpressions
public static void AddAStringtoAllTextFiles()
{
try
{
string path = #"C:\Users\ur\Desktop\TestFiles\";
var fileEntries = Directory.GetFiles(path);
int indexPosition2InsertData=1;
foreach (var entry in fileEntries)
{
var lines = File.ReadAllLines(entry);
for (var index = 1; index < lines.Length; index++) //starting from first row, leaving the header
{
var split= Regex.Split(lines[index].Trim(), #"\s{1,}"); //reading the line with space(s)
if(split.Length==5) //edited //checking if the row is not blank
{
var list = split.ToList(); //convert to list to insert
list.Insert(indexPosition2InsertData, "STRINGTOENTER"); //inserting at the index 1
lines[index] = string.Join("\t", list);
}
}
File.WriteAllLines(entry, lines);
}
}
catch (Exception e)
{
throw e;
}
}
I am getting this after running the code.
Id Code File_Number Suffix Check_Number Check_Date
047 STRINGTOENTER 7699 01 99999 11/11/2012
1 -6.15
Please let me know if this helps.
Assuming each file has the right tab delimitation (and that's a big assumption given the question quality)
// Get the files
var fileEntries = Directory.GetFiles(path);
// iterate through each file name
foreach (var entry in fileEntries)
{
// Load the File into the lines array
var lines = File.ReadAllLines(entry);
// Iterate over each line
if(lines.Length >1)
{
// Split the lines by tab
var split = lines[1].Split('\t');
// your code should be at array index 1
split[1] = "STRINGTOENTER";
// write the whole line back
lines[1] = string.Join("\t", split);
// write the file
File.WriteAllLines(entry, lines);
}
}
Note : you should probably do this with a CSV parser, this was only for academic purposes and totally untested
I want to show my desired solution based on your input. Amazing how a simple piece of code can contribute to solving a larger and a complex problem. Thanks again!
public static void AddClientCodetoAllTextFiles(string update_batch_with_clientcode, string batchfilepathtobeupdated)
{
try
{
var fileEntries = Directory.GetFiles(#batchfilepathtobeupdated.Trim());
foreach (var entry in fileEntries)
{
var lines = File.ReadAllLines(entry);
if (lines.Length > 1)
{
for (int i = 1; i < lines.Length - 1; i++)
{
var split = lines[i].Split('\t');
split[1] = update_batch_with_clientcode.Trim();
lines[i] = string.Join("\t", split);
File.WriteAllLines(entry, lines);
}
}
}
}
catch (Exception e)
{
throw e;
}
}

c# csv count a specified data in file or in datagridview

I have a csv file and would like to count the 2. column how many times contains 111.
the csv file has 46 separated columns with separator ; .
"first col" "second col" "....."
abc 111 a
abc 112 b
abc 113 c
abc 111 d
abc 112 e
abc 113 f
i would like to count the 111.
Filled up first the datagridview fom datatable.
dgv.DataSource = dgv_table;
string[] raw_text = File.ReadAllLines("d:\\"+lb_csv.Text);
string[] data_col = null;
int x = 0;
foreach (string text_line in raw_text)
{
// MessageBox.Show(text_line);
data_col = text_line.Split(';');
if (x == 0)
{
for (int i = 0; i <= data_col.Count() - 1; i++)
{
dgv_table.Columns.Add(data_col[i]);
}
//header
x++;
}
else
{
//data
dgv_table.Rows.Add(data_col);
}
I find a lots of solution to count the 2nd columns specified data:111
but all time i had problems.
int xCount = dgv.Rows.Cast<DataGridViewRow>().Select(row => row.Cells["second col"].Value).Where(s => s !=null && Equals(111)).Count();
this.lb_qty.Text = xCount.ToString();
But it gives error for row.Cells["second col"].Value
An unhandled exception of type 'System.ArgumentException' occurred in System.Windows.Forms.dll
Additional information: Column named second col cannot be found.
Can someone help me how to solve this problem and get the needed result?
I would suggest you to skip using DataGridView and use counter variable in your loop, like Arkadiusz suggested.
If you still want to work with DataTable, count values like this:
int xCount = dgv_table.Rows.Cast<DataRow>().Count(r => r["second col"] != null && r["second col"].ToString() == "111");
I would try to read the file into a DataTable and use it as DataSource for the DataGridView.
DataTable d_Table = new DataTable();
//fill the DataTable
this.dgv_table.DataSource = d_Table;
To count the rows wich contains 111 in the second column, you can select the DataTable like this:
DataTable d_Table = new DataTable();
//fill the DataTable
DataRow[] rowCount = d_Table.Select("secondCol = '111'");
this.lb_qty.Text = rowCount.Length.ToString();
Or you can do it in a foreach-loop:
int count = 0;
foreach(DataGridViewRow dgr in this.dgv_table.Rows)
{
if(dgr.Cells["secondCol"].Value.ToString() == "111") count++;
}
this.lb_qty.Text = count.ToString();
you can use this method to save the CSV into List of arrays List
public static List<string[]> readCSV(String filename)
{
List<string[]> result = new List<string[]>();
try
{
string[] line = File.ReadAllLines(filename);
foreach (string l in line)
{
string[] value= vrstica.Split(',');
result.Add(value);
}
}
catch (Exception e)
{
Console.WriteLine("Error: '{0}'", e);
}
return result;
}
every array will represent a column, so you can simply find the frequency of any value using LINQ or even loop:
foreach (var item in tmp[1].GroupBy(c => c))
{
Console.WriteLine("{0} : {1}", item.Key, item.Count());
}
int CountValues(string input, string searchedValue, int ColumnNumber, bool skipFirstLine = false)
{
int numberOfSearchedValue= 0;
string line;
using (StreamReader reader = new StreamReader (input))
{
if(skipFirstLine)
reader.ReadLine();
while ((line = reader.ReadLine()) != null)
{
if(line.Split(';')[ColumnNumber] == searchedValue)
numberOfSearchedValue++;
}
}
return numberOfSearchedValue;
}
Edit:
StreamReader.ReadLine() reads the line but also, using this method we are jumping to second line. If there is no more lines it returns null, so that is our ending condition. Rest of the code is readable, I think
:)
Didn't test that so be careful :)
It might be necessary to use Trim() or ToUpperCase() in some places (as usually when you are searching).

Parsing data from CSV file

Like in title I have problem with parsing data from CVS files. When i choose file with diffrent formating all i get is "Input string was not in a correct format".
My code works with files formatted like that:
16.990750 4.0
17.000250 5.0
17.009750 1.0
17.019250 6.0
But cant handle files formatted like this one:
Series1 - X;Series1 - Y;
285.75;798
285.79;764
285.84;578
285.88;690
This is code responsibile for reading data from file and creating chart from it:
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
string cos = File.ReadAllText(openFileDialog1.FileName);
string[] rows = cos.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
DataTable table = new DataTable();
table.Columns.Add("xValue", typeof(decimal));
table.Columns.Add("yValue", typeof(decimal));
foreach (string row in rows)
{
string[] values = row.Split(' ');
DataRow ch = table.NewRow();
ch[0] = Decimal.Parse(values[0], NumberStyles.AllowDecimalPoint, CultureInfo.InvariantCulture);
ch[1] = Decimal.Parse(values[1], NumberStyles.AllowDecimalPoint, CultureInfo.InvariantCulture);
table.Rows.Add(ch);
}
if (seria == false)
{
wykres.Series.Add("series");
wykres.Series["series"].ChartType = System.Windows.Forms.DataVisualization.Charting.SeriesChartType.Line;
wykres.Series["series"].XValueMember = "xValue";
wykres.Series["series"].YValueMembers = "yValue";
wykres.DataSource = table;
wykres.DataBind();
seria = true;
}
}
EDIT
I changed parsing method to this one:
foreach (string row in rows)
{
var values = row.Split(';');
var ch = table.NewRow();
decimal num = 0;
if (decimal.TryParse(values[0], out num))
ch[0] = num;
if (decimal.TryParse(values[1], out num))
ch[1] = num;
table.Rows.Add(ch);
}
It works okay but with one exception - It can't read decimals only integers from csv file(see the picture below).
View of table in locals
Why is this happening?
I suggest you don't re-invent the wheel, but use some well-tested library to parse the CSV (for example, your implementation does not handle quoted values well. It also doesn't allow the separator as part of a value).
And guess what: .NET includes something that could help you: the TextFieldParser class. Don't worry about the VisualBasicnamespace - it works in C#, too :-)
foreach (string row in rows)
{
var values = row.Split(';');
var ch = table.NewRow();
decimal num = 0;
if (decimal.TryParse(values[0], out num))
ch[0] = num;
if (decimal.TryParse(values[1], out num))
ch[1] = num;
table.Rows.Add(ch);
}
In the second text format the separator is(;) and first row of the text has two strings therefore to convert a string to decimal use decimal.TryParse() instead of decimal.Parse().the return type of method TryParse() is boolean
therefore if it returned true that means the string converted successful .

Casting a String as Integer in a .net DataTable

Disclaimer: This is my very first .net c# project
I am attempting to import a CSV into MSSQL but need to iterate through the CSV values first for sanitization purposes. Some of the columns in the CSV will be integer (will be used for calcuations later) and some are regular varchar.
My script above appears to force all values (that is row column values) in the DataTable as a string which throws an Exception later in my application when SQL cannot write a string as an integer.
Here is my method I am using for the getCSVImport which creates a datatable and populates it.
What I am thinking is to add another condition which checks if the value is an integer and then cast it as an integer (this kind of thing is new to me as PHP would does not handle types so strongly) but I fear that wont work as I am not sure if I can mix the values within a dataTable with various types.
So my question is, is there a way for me to have different values in a datatable as different types? My code below takes the line as a whole and writes it as a string, I need the values to be assigned either as string or as integer.
/*
* getCsvData()
* This method will create a datatable from the CSV file. We'll take the CSV file as is.
* and collect the data as needed:
*
* - Remove those original 4 lines (worthless info)
* - Line 5 starts with the headers, remove any of the brackets around the values
* - Iterate through the rest of the fields and sanitize them before we add it to the datatable
*
*/
private DataTable getCsvData(string csv_file_path)
{
// Create a new csvData tabletable object:
DataTable csvData = new DataTable();
try
{
using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
int row = 1;
while (!csvReader.EndOfData)
{
// Read the string and collect the row data
string[] rowData = csvReader.ReadFields();
if (row <= 4)
{
// We want to start on row 5 as first rows are nonsense :)
// Incriment the row so that we can do our magic above
row++;
continue;
} if(row == 5)
{
// Row 5 is the headers, we need to sanitize and continue:
foreach (string column in rowData)
{
// Remove the [ ] from the values:
var col = column.Substring(1, column.Length - 2);
DataColumn datecolumn = new DataColumn(col);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
// Incriment the row so that we can do our magic above
row++;
} else
{
// These are all of the actual rows, sanitize and add the rows:
//Making empty value as null
for (int i = 0; i < rowData.Length; i++)
{
// First remove the brackets:
if (rowData[i].Substring(0,1) == "[")
{
rowData[i] = rowData[i].Substring(1, rowData[i].Length - 2);
}
// Set blank to null:
if (rowData[i] == "" || rowData[i] == "-")
{
rowData[i] = null;
}
// Lastly, we need to do some calculations:
}
// Add the sanitized row to the DataTable:
csvData.Rows.Add(rowData);
}
}
}
}
catch (Exception ex)
{
throw new Exception("Could not parse the CSV file: "+ ex.Message);
}
return csvData;
}
You can cast the string to a int:
int j;
bool parsed=Int32.TryParse("-105", out j))
With TryParse you can check if it succeeded.
Then when you want to save it to the table again, cast it to string. You can simply do <variable>.ToString()
By default, data columns are initialized to a string data type.
There's an overload that allows you to specify the type, so I'd suggest you try that. Since your columns are known beforehand, you can easily handle this in your code.
private DataColumn AddColumn(string columnName, Type columnType)
{
// Remove the [ ] from the values:
var col = column.Substring(1, columnName.Length - 2);
DataColumn dataColumn = new DataColumn(col, columnType);
dataColumn.AllowDBNull = true;
return dataColumn;
}
if (row == 5)
{
csvData.Columns.Add(AddColumn(rowData[0], typeof(string)));
csvData.Columns.Add(AddColumn(rowData[1], typeof(int)));
csvData.Columns.Add(AddColumn(rowData[2], typeof(DateTime)));
csvData.Columns.Add(AddColumn(rowData[3], typeof(string)));
// etc
}
I'm not sure you'll even need to convert the other values before adding them to the DataTable, but if you do, many built-in types have TryParse methods, such as DateTime.TryParse and Int32.TryParse. You can call each of them in succession, and one of the "tries" succeeds, you'll know your type.
Alternatively, since you know the column types beforehand, you can just cast each value.
csvData.Rows.Add(Convert.ToString(rowData[0]),
Convert.ToInt32(rowData[1]),
Convert.ToDateTime(rowData[2]),
Convert.ToString(rowData[3]));
I would use *.TryParse(), ie: With this sample CSV:
*A sample csv file with
*some comment lines at top
-- with different comment
// comment strings.
[charField],[dateField],[intField],[decimalField]
"Sample char data 1",2016/1/2,123,123.45
"Sample char data 2",,2,1.5
"Sample char data 3",,3,
"Sample char data 4",,,
,,,
"Sample char data 6",2016/2/29 10:20,10,20.5
You might use TryParse on those datetime, int, decimal fields:
void Main()
{
var myData = ReadMyCSV(#"c:\MyPath\MyFile.csv");
// do whatever with myData
}
public IEnumerable<MyRow> ReadMyCSV(string fileName)
{
using (TextFieldParser tfp = new TextFieldParser(fileName))
{
tfp.HasFieldsEnclosedInQuotes = true;
tfp.SetDelimiters(new string[] { "," });
//tfp.CommentTokens = new string[] { "*","--","//" };
// instead of using comment tokens we are going to skip 4 lines
for (int j = 0; j < 4; j++)
{
tfp.ReadLine();
}
// header line.
tfp.ReadLine();
DateTime dt;
int i;
decimal d;
while (!tfp.EndOfData)
{
var data = tfp.ReadFields();
yield return new MyRow
{
MyCharData = data[0],
MyDateTime = DateTime.TryParse(data[1], out dt) ? dt : (DateTime?)null,
MyIntData = int.TryParse(data[2], out i) ? i : 0,
MyDecimal = decimal.TryParse(data[3], System.Globalization.NumberStyles.Any, null, out d) ? d : 0M
};
}
}
}
public class MyRow
{
public string MyCharData { get; set; }
public int MyIntData { get; set; }
public DateTime? MyDateTime { get; set; }
public decimal MyDecimal { get; set; }
}
I could further sanitize the data loaded, such as:
myData.Where( d => d.MyIntData != 0 );
Note: I didn't use a DataTable, which I could if I wanted to. For MSSQL loading, I would probably use an intermediate in-memory SQLite instance to save the sanitized data and then push to MSSQL using SqlBulkCopy class. A DataTable is of course an option (I just think it is less flexible).

Categories