Managed Esent - quicker way to read data? - c#

Using the Managed Esent interface to read data from a table. I am doing this with (pseudo):
List<ColumnInfo> columns; //Three columns to be read
using (var table = Table(session,DBID,"tablename",OpenTableGrbit.Readonly))
{
while (Api.TryMoveNext(session, table))
{
foreach (ColumnInfo col in columns)
{
string data = GetFormattedColumnData(session,table,col);
}
}
}
I am interested in data from three columns only, which is around 4,000 rows. However, the table itself is 1,800,000 rows. Hence this approach is very slow to just read the data I want as I need to read all 1,800,000 rows. Is there a quicker way?

There are many things you can do. Here are a few things off the top of my head:
Set the minimum cache size SystemParameters.CacheSizeMin. The default cache sizing algorithm is a bit conservative sometimes.
Also set OpenTableGrbit.Squential when opening your table. This helps a little bit with prefetching.
Use Api.RetrieveColumns to retrieve the three values at once. This reduces the number of calls/pinvokes you'll do.
-martin

Related

How do I read only part of a column from a Parquet file using Parquet.net?

I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.
//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);
//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);
This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.
How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.
I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.
I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.
According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.
ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.
What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.
If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.
If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:
It's not recommended to have more than 5'000 rows in a single row
group for performance reasons
We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.
So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.
ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.
using System;
using ParquetSharp;
using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
// You can use the logical reader for automatic conversion to a fitting CLR type
// here `double` as an example
// (unfortunately this does not work well with complex schemas IME)
const int batchSize = 4000;
Span<double> buffer = new double[batchSize];
var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);
// or if you want raw Parquet (with Dremel data and physical type)
var resultObject = group.Column(0).Apply(new Visitor());
}
class Visitor : IColumnReaderVisitor<object>
{
public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
where TValue : unmanaged
{
// TValue will be the physical Parquet type
const int batchSize = 200000;
var buffer = new TValue[batchSize];
var definitionLevels = new short[batchSize];
var repetitionLevels = new short[batchSize];
long valuesRead;
var levelsRead = columnReader.ReadBatch(batchSize,
definitionLevels, repetitionLevels,
buffer, out valuesRead);
// Return stuff you are interested in here, will be `resultObject` above
return new object();
}
}

What is the fastest way to populate a C# DataTable with data stored on columns?

I have a DataTable object that I need to fill based on data stored in a stream of columns - i.e. the stream initially contains the schema of the DataTable, and subsequently, values that should go into it organised by column.
At present, I'm taking the rather naive approach of
Create enough empty rows to hold all data values.
Fill those rows per cell.
The result is a per-cell iteration, which is not especially quick to say the least.
That is:
// Create rows first...
// Then populate...
foreach (var col in table.Columns.Cast<DataColumn>)
{
List<object> values = GetValuesfromStream(theStream);
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
table.Rows[i][col] = values[i];
}
My guess is the backing DataStorage items for each column aren't expanding as the rows are added, but as values are added to each column, but I'm far from certain. Any tips for loading this kind of data.
NB that loading all lists first and then reading in by row is probably not sensible - this approach is being taken in the first place to mitigate potential out of memory exceptions that tend to result when serializing huge DataTable objects, so grabbing a clone of the entire data grid and reading it in would probably just move the problem elsewhere. There's definitely enough memory for the original table and another column of values, but there probably isn't for two copies of the DataTable.
Whilst I haven't found a way to avoid iterating cells, as per the comments above, I've found that writing to DataRow items that have already been added to the table turns out to be a bad idea, and was responsible for the vast majority of the slowdown I observed.
The final approach I used ended up looking something like this:
List<DataRow> rows = null;
// Start population...
var cols = table.Columns.Cast<DataColumn>.Where(c => string.IsNullOrEmpty(c.Expression));
foreach (var col in cols)
{
List<object> values = GetValuesfromStream(theStream);
// Create rows first if required.
if (rows == null)
{
rows = new List<DataRow>();
for (var i=0; i<values.Count; i++)
rows.Add(table.NewRow());
}
// Actual method has some DBNull checking here, but should
// be immaterial to any solution.
for (var i=0; i<values.Count; i++)
rows[i][col] = values[i];
}
rows.ForEach(r => table.Rows.Add(r));
This approach addresses two problems:
If you try to add an empty DataRow to a table that has null-restrictions or similar, then you'll get an error. This approach ensures all the data is there before it's added, which should address most such issues (although I haven't had need to check how it works with auto-incrementing PK columns).
Where expressions are involved, these are evaluated when row state changes for a row that has been added to a table. Consequently, where before I had re-calculation of all expressions taking place every time a value was added to a cell (expensive and pointless), now all calculation takes place just once after all base data has been added.
There may of course be other complications with writing to a table that I've not yet encountered because the tables I am making use of don't use those features of the DataTable class/model. But for simple cases, this works well.

DataTable memory huge consumption

I´m loading csv data from files into a datatable for processing.
The problem is, that I want to process several files and my tests with the datatable shows me huge memory consumption
I tested with a 37MB csv file and the memory growed up to 240MB, which is way to much IMHO.
I read, that there is overhead in the datatable and I could live with about 70MB in size , but not 240MB, which means it is six times the original size.
I read here, that datatables need more memory than POCOs, but that the difference is way too much.
I put on a memory profiler and looked, if I have memory leaks and where the memory is. I found, that the datatablecolumns have between 6MB and 19MB filled with strings and the datatable had about 20 columns. Are the values stored in the columns? Why is so much memory taken, what can I do to reduce memory consumption.
With this memory consumption datattables seem to be unusable.
Had somebody else such problems with datatables, or I´m doing something wrong?
PS: I tried a 70MB file and the datatable growed up to 500MB!
OK here is a small testcase:
The 37MB csv-file (21 columns) let the memory grow up to 179MB.
private static DataTable ReadCsv()
{
DataTable table = new DataTable();
table.BeginLoadData();
using (var reader = new StreamReader(File.OpenRead(#"C:\Develop\Tests\csv-Data\testdaten\test.csv")))
{
int y = 0;
int columnsCount = 0;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
if (y == 0)
{
columnsCount = values.Count();
// create columns
for (int x = 0; x < columnsCount; x++)
{
table.Columns.Add(new DataColumn(values[x], typeof(string)));
}
}
else
{
if (values.Length == columnsCount)
{
// add the data
table.Rows.Add(values);
}
}
y++;
}
table.EndLoadData();
table.AcceptChanges();
}
return table;
}
DataSet and its children DataTable, DataRow, etc. make up an in-memory relational database. There is a lot of overhead involved (though it does make [some] things very convenient.
If memory is an issue,
Build domain objects to represent each row in your CSV file with typed properties.
Create a custom collection (or just use IList<T> to hold them
Alternatively, build a light-weight class with the basic semantics of a DataTable:
the ability to select a row by number
the ability to select a column within a row by row number and either column name or number.
The ability to know the ordered set of column names
Bonus: The ability to select a column by name or ordinal number and receive a list of its values, one per row.
Are you sure you need an in-memory representation of your CSV files? Could you access them via an IDataReader like Sebastien Lorion's Fast CSV Reader?
DataTables are a generic solution of putting tablular data into memory and adding lots of table-related features. If the overhead is not acceptable for you have the option to 1) write your own DataTable class that eliminates the overhead that you don't need 2) Use an alternate representation that still accomplishes what you need, perhaps POCO based, or maybe an XMLDocument (May have just as much overhead maybe more, never really worried about it). 3) Stop trying to load everything into memory and just bring data in as needed from your external store.

Simplifying complexity in for a table object structure

I have an object structure that is mimicking the properties of an excel table. So i have a table object containing properties such as title, header row object and body row objects. Within the header row and each body row object, i have a cell object containing info on each cell per row. I am looking for a more efficient way to store this table structure since in one of my uses for this object, i am printing its structure to screen. Currently, i am doing an O(n^2) complexity for printing each row for each cell:
foreach(var row in Table.Rows){
foreach(var cell in row.Cells){
Console.WriteLine(cell.ToString())
}
}
Is there a more efficient way of storing this structure to avoid the n^2? I ask this because this printing functionality exists in another n^2 loop. Basically i have a list of tables titles and a list of tables. I need to find those tables whose titles are in the title list. Then for each of those tables, i need to print their rows and the cells in each row. Can any part of this operation be optimized by using a different data structure for storage perhaps? Im not sure how exactly they work but i have heard of hashing and dictionary?
Thanks
Since you are looking for tables with specific titles, you could use a dictionary to store the tables by title
Dictionary<string,Table> tablesByTitle = new Dictionary<string,Table>();
tablesByTitle.Add(table.Title, table);
...
table = tablesByTitle["SomeTableTitle"];
This would make finding a table an O(1) operation. Finding n tables would be an O(n) operation.
Printing the tables then of cause depends on the number of rows and columns. There is nothing, which can change that.
UPDATE:
string tablesFromGuiElement = "Employees;Companies;Addresses";
string[] selectedTables = tablesFromGuiElement.Split(';');
foreach (string title in selectedTables) {
Table tbl = tablesByTitle[title];
PrintTable(tbl);
}
There isn't anything more efficient than an N^2 operation for outputting an NxN matrix of values. Worst-case, you will always be doing this.
Now, if instead of storing the values in a multidimensional collection that defines the graphical relationship of rows and columns, you put them in a one-dimensional collection and included the row-column information with each cell, then you would only need to iterate through the cells that had values. Worst-case is still N^2 for a table of N rows and N columns that is fully populated (the one-dimensional array, though linear to enumerate, will have N^2 items), but the best case would be that only one cell in that table is populated (or none are) which would be constant-time.
This answer applies to the, printing the table part, but the question was extended.
for the getting the table part, see the other answer.
No, there is not.
Unless perhaps your values follow some predictable distribution, then you could use a function of x and y and store no data at all, or maybe a seed and a function.
You could cache the print output in a string or StringBuider if you require it multiple times.
If there is enough data I guess you might apply some compression algorithm but I wouldn't say that was simpler or more efficient.

The right data structure to use for an Excel clone

Let say I'm working on an Excel clone in C#.
My grid is represented as follows:
private struct CellValue
{
private int column;
private int row;
private string text;
}
private List<CellValue> cellValues = new List<CellValue>();
Each time user add a text, I just package it as CellValue and add it into cellValues. Given a CellValue type, I can determine its row and column in O(1) time, which is great. However, given a column and a row, I need to loop through the entire cellValues to find which text is in that column and row, which is terribly slow. Also, given a text, I too need to loop through the entire thing. Is there any data structure where I can achive all 3 task in O(1) time?
Updated:
Looking through some of the answers, I don't think I had found one that I like. Can I:
Not keeping more than 2 copies of CellValue, in order to avoid sync-ing them. In C world I would have made nice use of pointers.
Rows and Columns can be dynamically added (Unlike Excel).
I would opt for a sparse array (a linked list of linked lists) to give maximum flexibility with minimum storage.
In this example, you have a linked list of rows with each element pointing to a linked list of cells in that row (you could reverse the cells and rows depending on your needs).
|
V
+-+ +---+ +---+
|1| -> |1.1| ----------> |1.3| -:
+-+ +---+ +---+
|
V
+-+ +---+
|7| ----------> |7.2| -:
+-+ +---+
|
=
Each row element has the row number in it and each cell element has a pointer to its row element, so that getting the row number from a cell is O(1).
Similarly, each cell element has its column number, making that O(1) as well.
There's no easy way to get O(1) for finding immediately the cell at a given row/column but a sparse array is as fast as it's going to get unless you pre-allocate information for every possible cell so that you can do index lookups on an array - this would be very wasteful in terms of storage.
One thing you could do is make one dimension non-sparse, such as making the columns the primary array (rather than linked list) and limiting them to 1,000 - this would make the column lookup indexed (fast), then a search on the sparse rows.
I don't think you can ever get O(1) for a text lookup simply because text can be duplicated in multiple cells (unlike row/column). I still believe the sparse array will be the fastest way to search for text, unless you maintain a sorted index of all text values in another array (again, that can make it faster but at the expense of copious amounts of memory).
I think you should use one of the indexed collections to make it work reasonably fast, the perfect one is the KeyedCollection
You need to create your own collection by extending this class. This way your object will still contain row and column (so you will not loose anything), but you will be able to search for them. Probably you will have to create a class encapsulating (row, column) and make it the key (so make it immutable and override equals and get hash code)
I'd create
Collection<Collection<CellValue>> rowCellValues = new Collection<Collection<CellValue>>();
and
Collection<Collection<CellValue>> columnCellValues = new Collection<Collection<CellValue>>();
The outer collection has one entry for each row or column, indexed by the row or column number, the inner collection has all the cells in that row or column. These collections should be populated as part of the process that creates new CellValue objects.
rowCellValues[newCellValue.Row].Add(newCellValue);
columnCellValues[newCellValue.Column].Add(newCellValue);
This smells of premature optimization.
That said, there's a few features of excel that are important in choosing a good structure.
First is that excel uses the cells in a moderately non-linear fashion. The process of resolving formulas involves traversing the spreadsheets in effectively random order. The structure will need a mechanism of easily looking up values of random keys cheaply, marking them dirty, resolved, or unresolvable due to circular reference. It will also need some way to know when there are no more unresolved cells left, so that it can stop working. Any solution that involves a linked list is probably sub-optimal for this, since they would require a linear scan to get those cells.
Another issue is that excel displays a range of cells at one time. This may seem trivial, and to a large extent it is, but It will certainly be ideal if the app can pull all of the data needed to draw a range of cells in one shot. part of this may be keeping track of the display height and width of the rows and columns, so that the display system can iterate over the range until the desired width and height of cells has been collected. The need to iterate in this manner may preclude the use of a hashing strategy for sparse storage of cells.
On top of that, there are some weaknesses of the representational model of spreadsheets that could be addressed much more effectively by taking a slightly different approach.
For example, column aggregates are sort of clunky. A column total is easy enough to implement in excel, but it has a sort of magic behavior that works most of the time but not all of the time. For instance, if you add a row into the aggregated area, further calculations on that aggregate may continue to work, or not, depending on how you added it. If you copy and insert a row (and replace the values) everything works fine, but if you cut and paste the cells one row down, things don't work out so well.
Given that the data is 2-dimensional, I would have a 2D array to hold it in.
Well, you could store them in three Dictionaries: two Dictionary<int,CellValue> objects for rows and columns, and one Dictionary<string,CellValue> for text. You'd have to keep all three carefully in sync though.
I'm not sure that I wouldn't just go with a big two-dimensional array though...
If it's an exact clone, then an array-backed list of CellValue[256] arrays. Excel has 256 columns, but a growable number of rows.
If rows and columns can be added "dynamically", then you shouldn't store the row/column as an numeric attribute of the cell, but rather as a reference to a row or column object.
Example:
private struct CellValue
{
private List<CellValue> _column;
private List<CellValue> _row;
private string text;
public List<CellValue> column {
get { return _column; }
set {
if(_column!=null) { _column.Remove(this); }
_column = value;
_column.Add(this);
}
}
public List<CellValue> row {
get { return _row; }
set {
if(_row!=null) { _row.Remove(this); }
_row = value;
_row.Add(this);
}
}
}
private List<List<CellValue>> MyRows = new List<List<CellValue>>;
private List<List<CellValue>> MyColumns = new List<List<CellValue>>;
Each Row and Column object is implemented as a List of the CellValue objects. These are unordered--the order of the cells in a particular Row does not correspond to the Column index, and vice-versa.
Each sheet has a List of Rows and a list of Columns, in order of the sheet (shown above as MyRows and MyColumns).
This will allow you to rearrange and insert new rows and columns without looping through and updating any cells.
Deleting a row should loop through the cells on the row and delete them from their respective columns before deleting the row itself. And vice-versa for columns.
To find a particular Row and Column, find the appropriate Row and Column objects, then find the CellValue that they contain in common.
Example:
public CellValue GetCell(int rowIndex, int colIndex) {
List<CellValue> row = MyRows[rowIndex];
List<CellValue> col = MyColumns[colIndex];
return row.Intersect(col)[0];
}
(I'm a little fuzzy on these Extension methods in .NET 3.5, but this should be in the ballpark.)
If I recall correctly, there was an article about how Visicalc did it, maybe in Byte Magazine in the early 80s. I believe it was a sparse array of some sort. But I think there were links both up-and-down and left-and-right, so that any given cell had a pointer to the cell above it (however many cells away that may be), below it, to the left of it, and to the right of it.

Categories