Loop across a dataframe in .NET Spark

Loop across a dataframe in .NET Spark - c#

I have a dataframe(created by reading a csv) in Spark, how do I loop across the rows in this dataframe in C#. There are 10 rows and 3 columns in the dataframe and I would like to get the value for each of the column as I navigate through the rows one by one. Below is what I am trying:
foreach (var obj in df)
{
Console.WriteLine("test");
}
foreach statement cannot operate on variables of type 'DataFrame' because 'DataFrame' does not contain a public instance definition for 'GetEnumerator'

The DataFrame is a reference to the actual data on the spark cluster. If you want to see the actual data (as opposed to running some transformation and writing to the output which is the typical use case) you need to collect the data over to your application.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.dataframe.collect?view=spark-dotnet
foreach (var obj in df.Collect())
{
Console.WriteLine("test");
}
This will give you an enumerable of Row which has Values which is an object array of the actual values.
If you just wanted to see the contents for debugging then you can do:
df.Show();
Show takes two arguments, the first is the number of rows and the second is how many chars width to show in case your data is truncated and you need to see all the columns:
df.Show(100, 10000);

Related

Using C# and Dapper/SQLite core packages how do I print the SQLite database data and how do I do the same for specific data entries?

I am very new to C# (been learning for approximately 6 months in my free time).
I have produced a script which can store data in an SQLite Database that seems to work fine for the minute. I am having trouble printing the entire table and/or specific data from the table. When I get to the point when it should print off the data it gives me the following:
enter image description here
This is the solution name and the data class.
**
Here is main program code relating to loading data from database:**
private void LoadGymList()
{
Console.WriteLine("Load Gym List Working");
List<GymEntry> entries = new List<GymEntry>();
entries = SQLiteDataAccess.LoadEntry();
Console.WriteLine(entries[0]);
}
**Here is the SQLiteDataAccess code for loading entries:
**
namespace GymAppLists
{
public class SQLiteDataAccess
{
public static List<GymEntry> LoadEntry()
{
using (IDbConnection cnn = new SQLiteConnection(LoadConnectionString()))
{
var output = cnn.Query<GymEntry>("select * from GeneralLog", new DynamicParameters());
return output.ToList();
}
}
Only other files are app.config and a basic class for 'id' and 'date' which are the only two database columns in the single table of the database.
I tried to print individual indexes of the list to see if it would print those, but it simply gave me the previous output. I am stumped as to why this is working this way. It is clearly accessing the database, but it must be formatted incorrectly or I am not using the correct method to access the specifics of the data.
If I print the list.count, it provides me with the correct number of rows in the db for example.
I imagine this is a very simple fix, any advice would be greatly appreciated.
thank you,
JH.

You are only writing the first element of the entries list (Console.WriteLine(entries[0])) and also you are only printing the object itself. Instead use a loop and print the properties. ie:
foreach(var x in entries)
{
Console.WriteLine($"Id:{x.Id}, Date:{x.Date}");
}

SSIS Object Variable populated in Foreach Loop set to last row outside of the loop, cannot access whole array

I have an SSIS package that will assemble dynamic SQL statement and execute on a different server with the results needing to be written back to the first server.
Because the SQL is created and passed in as a variable, a Foreach loop is used to run each instance. The results are put into an Object Variable and This works fine. If I put my script task in the Foreach loop itself, I can write the results back to the original server. However- I would really like, for performance reasons, to get the insert out of the Foreach loop and read the result set / object variable to open one connection and write all the data at one go. But when I pull the object doing the reading of the results and writing to the database out of the loop, it only write the last row of data, not all of them.
How can I get to all the rows in the result set outside of the Foreach loop? Is there a pointer to the first row or something? I can't imagine I'm the first person to need to do this but my search for answers has come up empty. Or maybe I'm just under caffeinated.

Well, it can be simplified, if some conditions are met. Generally, SSIS is metadata centric, i.e. set of columns and its types. So, if each of SQL query you run returns the same set of columns, including column names, data types; then you can try the following approach:
In the ForEach loop, run SQL commands and store its results into an Object Variable.
Then - create a Data Flow task with a Script Component Source, fetching its rows from the step 1 Object Variable. Add the rows to some SQL table as you usually do; if needed - you can add some other data like SQL query text. The resulted rows can be added to some table as a DFT Destination.
How to use Object Variable as a data source - here are already good answers to this question

How do I read only part of a column from a Parquet file using Parquet.net?

I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.
//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);
//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);
This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.
How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.
I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.
I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.

According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.
ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.
What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.
If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.

If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:
It's not recommended to have more than 5'000 rows in a single row
group for performance reasons
We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.
So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.

ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.
using System;
using ParquetSharp;
using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
// You can use the logical reader for automatic conversion to a fitting CLR type
// here `double` as an example
// (unfortunately this does not work well with complex schemas IME)
const int batchSize = 4000;
Span<double> buffer = new double[batchSize];
var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);
// or if you want raw Parquet (with Dremel data and physical type)
var resultObject = group.Column(0).Apply(new Visitor());
}
class Visitor : IColumnReaderVisitor<object>
{
public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
where TValue : unmanaged
{
// TValue will be the physical Parquet type
const int batchSize = 200000;
var buffer = new TValue[batchSize];
var definitionLevels = new short[batchSize];
var repetitionLevels = new short[batchSize];
long valuesRead;
var levelsRead = columnReader.ReadBatch(batchSize,
definitionLevels, repetitionLevels,
buffer, out valuesRead);
// Return stuff you are interested in here, will be `resultObject` above
return new object();
}
}

What is the fastest way to populate a C# DataTable with data stored on columns?

I have a DataTable object that I need to fill based on data stored in a stream of columns - i.e. the stream initially contains the schema of the DataTable, and subsequently, values that should go into it organised by column.
At present, I'm taking the rather naive approach of
Create enough empty rows to hold all data values.
Fill those rows per cell.
The result is a per-cell iteration, which is not especially quick to say the least.
That is:
// Create rows first...
// Then populate...
foreach (var col in table.Columns.Cast<DataColumn>)
{
List<object> values = GetValuesfromStream(theStream);
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
table.Rows[i][col] = values[i];
}
My guess is the backing DataStorage items for each column aren't expanding as the rows are added, but as values are added to each column, but I'm far from certain. Any tips for loading this kind of data.
NB that loading all lists first and then reading in by row is probably not sensible - this approach is being taken in the first place to mitigate potential out of memory exceptions that tend to result when serializing huge DataTable objects, so grabbing a clone of the entire data grid and reading it in would probably just move the problem elsewhere. There's definitely enough memory for the original table and another column of values, but there probably isn't for two copies of the DataTable.

Whilst I haven't found a way to avoid iterating cells, as per the comments above, I've found that writing to DataRow items that have already been added to the table turns out to be a bad idea, and was responsible for the vast majority of the slowdown I observed.
The final approach I used ended up looking something like this:
List<DataRow> rows = null;
// Start population...
var cols = table.Columns.Cast<DataColumn>.Where(c => string.IsNullOrEmpty(c.Expression));
foreach (var col in cols)
{
List<object> values = GetValuesfromStream(theStream);
// Create rows first if required.
if (rows == null)
{
rows = new List<DataRow>();
for (var i=0; i<values.Count; i++)
rows.Add(table.NewRow());
}
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
rows[i][col] = values[i];
}
rows.ForEach(r => table.Rows.Add(r));
This approach addresses two problems:
If you try to add an empty DataRow to a table that has null-restrictions or similar, then you'll get an error. This approach ensures all the data is there before it's added, which should address most such issues (although I haven't had need to check how it works with auto-incrementing PK columns).
Where expressions are involved, these are evaluated when row state changes for a row that has been added to a table. Consequently, where before I had re-calculation of all expressions taking place every time a value was added to a cell (expensive and pointless), now all calculation takes place just once after all base data has been added.
There may of course be other complications with writing to a table that I've not yet encountered because the tables I am making use of don't use those features of the DataTable class/model. But for simple cases, this works well.

Simplifying complexity in for a table object structure

I have an object structure that is mimicking the properties of an excel table. So i have a table object containing properties such as title, header row object and body row objects. Within the header row and each body row object, i have a cell object containing info on each cell per row. I am looking for a more efficient way to store this table structure since in one of my uses for this object, i am printing its structure to screen. Currently, i am doing an O(n^2) complexity for printing each row for each cell:
foreach(var row in Table.Rows){
foreach(var cell in row.Cells){
Console.WriteLine(cell.ToString())
}
}
Is there a more efficient way of storing this structure to avoid the n^2? I ask this because this printing functionality exists in another n^2 loop. Basically i have a list of tables titles and a list of tables. I need to find those tables whose titles are in the title list. Then for each of those tables, i need to print their rows and the cells in each row. Can any part of this operation be optimized by using a different data structure for storage perhaps? Im not sure how exactly they work but i have heard of hashing and dictionary?
Thanks

Since you are looking for tables with specific titles, you could use a dictionary to store the tables by title
Dictionary<string,Table> tablesByTitle = new Dictionary<string,Table>();
tablesByTitle.Add(table.Title, table);
...
table = tablesByTitle["SomeTableTitle"];
This would make finding a table an O(1) operation. Finding n tables would be an O(n) operation.
Printing the tables then of cause depends on the number of rows and columns. There is nothing, which can change that.
UPDATE:
string tablesFromGuiElement = "Employees;Companies;Addresses";
string[] selectedTables = tablesFromGuiElement.Split(';');
foreach (string title in selectedTables) {
Table tbl = tablesByTitle[title];
PrintTable(tbl);
}

There isn't anything more efficient than an N^2 operation for outputting an NxN matrix of values. Worst-case, you will always be doing this.
Now, if instead of storing the values in a multidimensional collection that defines the graphical relationship of rows and columns, you put them in a one-dimensional collection and included the row-column information with each cell, then you would only need to iterate through the cells that had values. Worst-case is still N^2 for a table of N rows and N columns that is fully populated (the one-dimensional array, though linear to enumerate, will have N^2 items), but the best case would be that only one cell in that table is populated (or none are) which would be constant-time.

This answer applies to the, printing the table part, but the question was extended.
for the getting the table part, see the other answer.
No, there is not.
Unless perhaps your values follow some predictable distribution, then you could use a function of x and y and store no data at all, or maybe a seed and a function.
You could cache the print output in a string or StringBuider if you require it multiple times.
If there is enough data I guess you might apply some compression algorithm but I wouldn't say that was simpler or more efficient.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Loop across a dataframe in .NET Spark - c#

Related

Using C# and Dapper/SQLite core packages how do I print the SQLite database data and how do I do the same for specific data entries?

SSIS Object Variable populated in Foreach Loop set to last row outside of the loop, cannot access whole array

How do I read only part of a column from a Parquet file using Parquet.net?

What is the fastest way to populate a C# DataTable with data stored on columns?

Simplifying complexity in for a table object structure

Categories

Resources