How can I convert parquet-dotnet's columns to individual models?

How can I convert parquet-dotnet's columns to individual models? - c#

parquet-dotnet has an example I'm trying to work with that looks like this:
using (Stream fileStream = System.IO.File.OpenRead("c:\\test.parquet"))
{
using (var parquetReader = new ParquetReader(fileStream))
{
DataField[] dataFields = parquetReader.Schema.GetDataFields();
for(int i = 0; i < parquetReader.RowGroupCount; i++)
{
using (ParquetRowGroupReader groupReader = parquetReader.OpenRowGroupReader(i))
{
DataColumn[] columns = dataFields.Select(groupReader.ReadColumn).ToArray();
}
}
}
}
The concern I have is with the columns line. If I have data that looks like this, from a table perspective:
ID
Name
1
Test1
1
Test2
I want to map this data from the parquet file to a model that looks exactly like that. The issue that I have is that the data comes out from columns looking like this:
columns[0].Data[0] - 1
columns[0].Data[1] - 1
columns[1].Data[0] - Test1
columns[1].Data[1] - Test2
This might be a little hard to understand, but essentially, the columns variable is a collection of properties that has an array of values. That array is every value in the table for that column. So I'm having a hard time trying to figure out how to match the data in each array position with the data in the same array position in a different column and still keep everything together.
Also, I'm unable to do the normal deserialize because I have properties in the parquet file that look weird like __$something, so I can't map those to a similarly named property. Any ideas?

Related

Reading specific columns from DBF (Visual FoxPro) file in C#

I have been using DbfDataReader to read DBF files in my C# application. So far, I can read column name, column index, and iterate through the records successfully. There does not appear to be a way to read specific column data I'd like without using the column index. For example, I can get at the FIRSTNAME value with a statement like:
using DbfDataReader;
var dbfPath = "/CONTACTS.DBF";
using (var dbfTable = new DbfTable(dbfPath, EncodingProvider.UTF8))
{
var dbfRecord = new DbfRecord(dbfTable);
while (dbfTable.Read(dbfRecord))
{
Console.WriteLine(dbfRecord.Values[1].ToString()); // would prefer to use something like dbfRecord.Values["FIRSTNAME"].ToString()
Console.WriteLine(dbfRecord.Values[2].ToString()); // would prefer to use something like dbfRecord.Values["LASTNAME"].ToString()
}
}
Where 1 is the index of the FIRSTNAME column and 2 is the index of the LASTNAME column. Is there anyway to use "FIRSTNAME" (or the column name) as the key (or accessor) for what is essentially a name/value pair? My goal is to get all of the columns I care about without having to first build this map each time. (Please forgive me if the terms I am using are not exactly right).
Thanks so much for taking a look at this...

Use the DbfDataReader class as below:
var dbfPath = "/CONTACTS.DBF";
var options = new DbfDataReaderOptions
{
SkipDeletedRecords = true,
Encoding = EncodingProvider.UTF8
};
using (var dbfDataReader = new DbfDataReader.DbfDataReader(dbfPath, options))
{
while (dbfDataReader.Read())
{
Console.WriteLine(dbfDataReader["FIRSTNAME"])
Console.WriteLine(dbfDataReader["LASTNAME"])
}
}

Extract text from specific columns in c#?

I have been working on extracting text from a csv file and store the data in a string. But now, I would like to extract text from some of the specific columns and store the data in a string.I would like the wordDocContents variable to contain the specific columns and the data in those specific columns which is bank_account, bank_name, customer_name. Currently, my wordDocContents has the entire data from my csv file. Is there a way to filter out the specific columns and the data in those columns and store it in the variable wordDocContents. Thanks
Here is what I tried so far -
public void button1Clicked(object sender, EventArgs args)
{
button1.Text = "You clicked me";
var textExtractor = new TextExtractor();
var wordDocContents = textExtractor.Extract("t.csv");
Console.WriteLine(wordDocContents);
Console.ReadLine();
}
The contents of wordDocContents:-
ACCOUNT_NUMBER,CUSTOMER_NAMES,VALUE_DATE,BOOKING_DATE,TRANSACTION,ACCOUNT_TYPE,BALANCE_TYPE,REFERENCE,MONEY.OUT,MONEY.IN,RUNNING.BALANCE,BRANCH,EMAIL,ACTUAL.BALANCE,AVAILABLE.BALANCE
1000000001,TEST,,2847899,KES,Account,,,10/10/2016,9/11/2016,15181800,UPPER HILL BRANCH,another#yahoo.com,5403.75,5403.75,
1000000001,,9/11/2016,9/11/2016,Opening Balance,,,,,,4643.22,,,,,
1000000001,,12/10/2016,12/10/2016,Mobile Mpesa Transfer,,,,1533,,3110.22,,,,,
1000000001,,17-10-2016,17-10-2016,ATM Withdrawal,,,6.29006E+11,1000,,2110.22,,,,,
1000000001,,17-10-2016,17-10-2016,ATM Withdrawal,,,6.29118E+11,2000,,110.22,,,,,
1000000001,,17-10-2016,17-10-2016,Mobile Mpesa Transfer,,,,2083,,-1972.78,,,,,
1000000001,,17-10-2016,17-10-2016,Transfer from Mpesa,,,,0,4000,2027.22,,,,,
1000000001,,18-10-2016,18-10-2016,Mobile Mpesa Transfer,,,,333,,1694.22,,,,,

From my knowledge on how csv files are constructed. (Maybe post the first 2 lines of your output?)
string[] lines = wordDocContents.Split("\n");
string[] columns = lines[0].Split(",");
string[][] data = new string[lines.Length][columns.Length];
Now let's say customer_name is under columns[2], you can try to:
List<string> customerNames = new List<string>();
for (int i = 1; i < lines.Length; i++) {
customerNames.Add(data[i][2]);
}
Edit just saw the output, this code might need some adjusting for your particular case. I am not 100% sure if string.Split(",") works for multiple commas in a row, but it's worth a shot. Just change the [2] to whichever column you need.
It should be going from [0],[1],[2] etc.

Adding numbers from two data frames in Deedle using multi key index

I am new to Deedle. I searched everywhere looking for examples that can help me to complete the following task:
Index data frame using multiple columns (3 in the example - Date, ID and Title)
Add numeric columns in multiple data frames together (Sales column in the example)
Group and add together sales occurred on the same day
My current approach is given below. First of all - it does not work because of the missing values and I don't know how to handle them easily while adding data frames. Second - I wonder if there is a better more elegant way to do it.
// Remove unused columns
var df = dfRaw.Columns[new[] { "Date", "ID", "Title", "Sales" }];
// Index data frame using 3 columns
var dfIndexed = df.IndexRowsUsing(r => Tuple.Create(r.GetAs<DateTime>("Date"), r.GetAs<string>("ID"), r.GetAs<string>("Title")) );
// Remove indexed columns
dfIndexed.DropColumn("Date");
dfIndexed.DropColumn("ID");
dfIndexed.DropColumn("Title");
// Add data frames. Does not work as it will add only
// keys existing in both data frames
dfTotal += dfIndexed
Table 1
Date,ID,Title,Sales,Market
2014-03-01,ID1,Title1,1,US
2014-03-01,ID1,Title1,2,CA
2014-03-03,ID2,Title2,3,CA
Table 2
Date,ID,Title,Sales,Market
2014-03-02,ID1,Title1,2,US
2014-03-03,ID2,Title2,2,CA
Expected Results
Date,ID,Title,Sales
2014-03-01,ID1,Title1,3
2014-03-02,ID1,Title1,2
2014-03-03,ID2,Title2,5

I think that your approach with using tuples makes sense.
It is a bit unfortunate that there is no easy way to specify default values when adding!
The easiest solution I can think of is to realign both series to the same set of keys and use fill operation to provide defaults. Using simple series as an example, something like this should do the trick:
var allKeys = seris1.Keys.Union(series2.Keys);
var aligned1 = series1.Realign(allKeys).FillMissing(0.0);
var aligned2 = series2.Realign(allKeys).FillMissing(0.0);
var res = aligned1 + aligned2;

Retrieve "row pairs" from Excel

I am trying to retrieve data from an Excel spreadsheet using C#. The data in the spreadsheet has the following characteristics:
no column names are assigned
the rows can have varying column lengths
some rows are metadata, and these rows label the content of the columns in the next row
Therefore, the objects I need to construct will always have their name in the very first column, and its parameters are contained in the next columns. It is important that the parameter names are retrieved from the row above. An example:
row1|---------|FirstName|Surname|
row2|---Person|Bob------|Bloggs-|
row3|---------|---------|-------|
row4|---------|Make-----|Model--|
row5|------Car|Toyota---|Prius--|
So unfortunately the data is heterogeneous, and the only way to determine what rows "belong together" is to check whether the first column in the row is empty. If it is, then read all data in the row, and check which parameter names apply by checking the row above.
At first I thought the straightforward approach would be to simply loop through
1) the dataset containing all sheets, then
2) the datatables (i.e. sheets) and
3) the row.
However, I found that trying to extract this data with nested loops and if statements results in horrible, unreadable and inflexible code.
Is there a way to do this in LINQ ? I had a look at this article to start by filtering the empty rows between data but didn't really get anywhere. Could someone point me in the right direction with a few code snippets please ?
Thanks in advance !
hiro

I see that you've already accepted the answer, but I think that more generic solution is possible - using reflection.
Let say you got your data as a List<string[]> where each element in the list is an array of string with all cells from corresponding row.
List<string[]> data;
data = LoadData();
var results = new List<object>();
string[] headerRow;
var en = data.GetEnumerator();
while(en.MoveNext())
{
var row = en.Current;
if(string.IsNullOrEmpty(row[0]))
{
headerRow = row.Skip(1).ToArray();
}
else
{
Type objType = Type.GetType(row[0]);
object newItem = Activator.CreateInstance(objType);
for(int i = 0; i < headerRow.Length; i++)
{
objType.GetProperty(headerRow[i]).SetValue(newItem, row[i+1]);
}
results.Add(newItem);
}
}

how to represent a CSV File as a data structure in a C# program

I have a csv file I am going to read from disk. I do not know up front how many columns or the names of the columns.
Any thoughts on how I should represent the fields. Ideally I want to say something like,
string Val = DataStructure.GetValue(i,ColumnName).
where i is the ith Row.
Oh just as an aside I will be parsing using the TextFieldParser class
http://msdn.microsoft.com/en-us/library/cakac7e6(v=vs.90).aspx

That sounds as if you would need a DataTable which has a Rows and Columns property.
So you can say:
string Val = table.Rows[i].Field<string>(ColumnName);
A DataTable is a table of in-memory data. It can be used strongly typed (as suggested with the Field method) but actually it stores it's data as objects internally.
You could use this parser to convert the csv to a DataTable.
Edit: I've only just seen that you want to use the TextFieldParser. Here's a possible simple approach to convert a csv to a DataTable:
var table = new DataTable();
using (var parser = new TextFieldParser(File.OpenRead(path)))
{
parser.Delimiters = new[]{","};
parser.HasFieldsEnclosedInQuotes = true;
// load DataColumns from first line
String[] headers = parser.ReadFields();
foreach(var h in headers)
table.Columns.Add(h);
// load all other lines as data '
String[] fields;
while ((fields = parser.ReadFields()) != null)
{
table.Rows.Add().ItemArray = fields;
}
}

If the column names are in the first row read that and store in a Dictionary<string, int> that maps the column name to the column index.
You could then store the remaining rows in a simple structure like List<string[]>.
To get a column for a row you'd do csv[rowIndex][nameToIndex[ColumnName]];
nameToIndex[ColumnName] gets the column index from the name, csv[rowIndex] gets the row (string array) we want.
This could of course be wrapped in a class.

Use the csv parser if you want, but a text parser is something very easy to do by yourself if you need customization.
For you need, i would use one (or more) Dictionnary. At least one to have the PropertyString --> column index. And maybe the reverse one column index--> PropertyString if needed.
When i parse a file for csv, i usually put the result in a list while parsing, and then in an array once complete for speed reasons (List.ToArray()).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I convert parquet-dotnet's columns to individual models? - c#

Related

Reading specific columns from DBF (Visual FoxPro) file in C#

Extract text from specific columns in c#?

Adding numbers from two data frames in Deedle using multi key index

Retrieve "row pairs" from Excel

how to represent a CSV File as a data structure in a C# program

Categories

Resources