Reading file with Apache Arrow ArrowFileReader in .net - c#

I am trying to read the content of an arrow file but I was not able to find the functions to get the actual data from it. I am not able to find any useful example to read the data too. For example here.
The code example for writing and reading in C#:
// Write
var recordBatch = new Apache.Arrow.RecordBatch.Builder(memoryAllocator)
.Append("Column A", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(5, 15))))
.Build();
using (var stream = File.OpenWrite(filePath))
using (var writer = new Apache.Arrow.Ipc.ArrowFileWriter(stream, recordBatch.Schema, true))
{
await writer.WriteRecordBatchAsync(recordBatch);
await writer.WriteEndAsync();
}
// Read
var reader = Apache.Arrow.Ipc.ArrowFileReader.FromFile(filePath);
var readBatch = await reader.ReadNextRecordBatchAsync();
var col = readBatch.Column(0);
By debugging the code, I can see the values in the col Values property but I have no way of accessing this information in the code.
Am I missing anything or is there a different approach to read the data?

The Apache.Arrow package does not do any compute today. It will read in the file and you will have access to the raw buffers of data. This is sufficient for a number of intermediary tasks (e.g. services that shuttle data to and from or aggregate data files). So if you want to do a lot of operations on the data you may want some kind of dataframe library.
One such library is the Microsoft.Data.Analysis library which has added a DataFrame type which can be created from an Arrow RecordBatch. There is some explanation and examples of the library in this blog post.
I haven't worked with that library much but I was able to put together a short example of reading an Arrow file and printing the data:
using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow.Ipc;
using Microsoft.Data.Analysis;
namespace DataframeExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
Console.WriteLine("Read record batch with {0} column(s)", recordBatch.ColumnCount);
var dataframe = DataFrame.FromArrowRecordBatch(recordBatch);
var columnX = dataframe["x"];
foreach (var value in columnX)
{
Console.WriteLine(value);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}
I created the test file with a small python script:
import pyarrow as pa
import pyarrow.ipc as ipc
tab = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
with ipc.RecordBatchFileWriter('/tmp/test.arrow', schema=tab.schema) as writer:
writer.write_table(tab)
You could presumably also create the test file using C# with Apache.Arrow's array builders.
Update (Using Apache.Arrow directly)
On the other hand, if you want to use Apache.Arrow directly, and still get access to the data, then you can use typed arrays (e.g. Int32Array, Int64Array). You will first need to determine the type of your array somehow (either through prior knowledge of the schema or as / is style checks or pattern matching).
Here is an example using Apache.Arrow alone:
using System;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow;
using Apache.Arrow.Ipc;
namespace ArrayValuesExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
// Here I am relying on the fact that I know column
// 0 is an int64 array.
var columnX = (Int64Array) recordBatch.Column(0);
for (int i = 0; i < columnX.Values.Length; i++)
{
Console.WriteLine(columnX.Values[i]);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}

Adding to the second approach proposed by Pace, an utility function like below can be used to get the values
private static dynamic GetArrayData(IArrowArray array)
{
return array switch
{
Int32Array int32array =>int32array.Values.ToArray(),
Int16Array int16array => int16array.Values.ToArray(),
StringArray stringArray => stringArray.Values.ToArray(),
FloatArray floatArray => floatArray.Values.ToArray(),
Int64Array int64Array => int64Array.Values.ToArray(),
DoubleArray doubleArray => doubleArray.Values.ToArray(),
Time32Array time32Array => time32Array.Values.ToArray(),
Time64Array time64Array => time64Array.Values.ToArray(),
BooleanArray booleanArray => booleanArray.Values.ToArray(),
Date32Array date32Array => date32Array.Values.ToArray(),
Date64Array date64Array => date64Array.Values.ToArray(),
Int8Array int8Array => int8Array.Values.ToArray(),
UInt16Array uint6Array => uint6Array.Values.ToArray(),
UInt8Array uInt8Array => uInt8Array.Values.ToArray(),
UInt64Array uInt64Array => uInt64Array.Values.ToArray(),
_ => throw new NotImplementedException(),
};
}
then iterate over the recordBatch as
object[,] results = new Object[recordBatch.Length, recordBatch.ColumnCount];
var col = 0;
foreach (var array in recordBatch.Arrays)
{
var row = 0;
foreach (var data in GetArrayData(array))
{
results[row++, col] = data;
}
col++;
}
return results;
Worth noting however that StringArrays return Bytes so you need to convert to back to string for example using
System.Text.Encoding.Unicode.GetString(stringArray.Values)

Related

FlatFile library, delimited layout, wrong parsing when multiple fields are empty at the end of the row

We use in some of our applications the FlatFile library (https://github.com/forcewake/FlatFile) to parse some files delimited with separator (";"), since a lot of time without problems.
We faced yesterday a problem receiving files having multiple fields empty at the end of the row.
I replicated the problem with short console application to show and permit you to verify in a simple way:
using FlatFile.Delimited;
using FlatFile.Delimited.Attributes;
using FlatFile.Delimited.Implementation;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace FlatFileTester
{
class Program
{
static void Main(string[] args)
{
var layout = GetLayout();
var factory = new DelimitedFileEngineFactory();
using (MemoryStream ms = new MemoryStream())
using (FileStream file = new FileStream(#"D:\shared\dotnet\FlatFileTester\test.csv", FileMode.Open, FileAccess.Read))
{
byte[] bytes = new byte[file.Length];
file.Read(bytes, 0, (int)file.Length);
ms.Write(bytes, 0, (int)file.Length);
var flatFile = factory.GetEngine(layout);
ms.Position = 0;
List<TestObject> records = flatFile.Read<TestObject>(ms).ToList();
foreach(var record in records)
{
Console.WriteLine(string.Format("Id=\"{0}\" - DescriptionA=\"{1}\" - DescriptionB=\"{2}\" - DescriptionC=\"{3}\"", record.Id, record.DescriptionA, record.DescriptionB, record.DescriptionC));
}
}
Console.ReadLine();
}
public static IDelimitedLayout<TestObject> GetLayout()
{
IDelimitedLayout<TestObject> layout = new DelimitedLayout<TestObject>()
.WithDelimiter(";")
.WithQuote("\"")
.WithMember(x => x.Id)
.WithMember(x => x.DescriptionA)
.WithMember(x => x.DescriptionB)
.WithMember(x => x.DescriptionC)
;
return layout;
}
}
[DelimitedFile(Delimiter = ";", Quotes = "\"")]
public class TestObject
{
[DelimitedField(1)]
public int Id { get; set; }
[DelimitedField(2)]
public string DescriptionA { get; set; }
[DelimitedField(3)]
public string DescriptionB { get; set; }
[DelimitedField(4)]
public string DescriptionC { get; set; }
}
}
This is an example of file:
1;desc1;desc1;desc1
2;desc2;desc2;desc2
3;desc3;;desc3
4;desc4;desc4;
5;desc5;;
So the first 4 rows are parsed as expected:
All fields with values in the first and second row
empty string for third field of third row
empty string for fouth field of fourth row
in the fifth row we expect empty string on third and fourth field, like this:
Id=5
DescriptionA="desc5"
DescriptionB=""
DescriptionC=""
instead we receive this:
Id=5
DescriptionA="desc5"
DescriptionB=";" // --> THE SEPARATOR!!!
DescriptionC=""
We can't understand if is a problem of configuration, bug of the library, or some other problem in the code...
Anyone have some similar experiences with this library, or can note some problem in the code above not linked with the library but causing the error...?
I took a look and debug the source code of the open source library: https://github.com/forcewake/FlatFile.
It seems there's a problem, in particular in this case, in witch there are 2 empty fields, at the end of a row, the bug take effects on the field before the last of the row.
I opened an issue for this libray, hoping some contributor of the library could invest some time to investigate, and, if it is so, to fix: https://github.com/forcewake/FlatFile/issues/80
For now we decided to fix the wrong values of the list, something like:
string separator = ",";
//...
//...
//...
records.ForEach(x => {
x.DescriptionC = x.DescriptionC.Replace(separator, "");
});
For our case, anyway, it make not sense to have a character corresponding to the separator as value of that field...
...even if it would be better to have bug fixing of the library

Reading endless XML fragments from Linux FIFO file using XmlReader and Reactive

I am trying to read endless XML fragments come from FIFO, and convert it to IObservable<T> by using XmlReader on Linux.
My sample code below works on .NET Core 2. But XmlReader.ReadToFollowing method does not return "false" (blocked), even if all resources have released.
How do I fix and call an OnCompleted?
using System;
using System.IO;
using System.Reactive.Concurrency;
using System.Reactive.Disposables;
using System.Reactive.Linq;
using System.Xml;
namespace ConsoleApp
{
internal class Program
{
private static readonly XmlReaderSettings XmlReaderSettings =
new XmlReaderSettings {ConformanceLevel = ConformanceLevel.Fragment /*Async = true, CloseInput = true*/};
private static void Main(string[] args)
{
var fifoPath = "/tmp/fifo";
using (var fifoStream = new FileStream(fifoPath, FileMode.Open))
using (var fifoReader = new StreamReader(fifoStream))
using (var xmlReader = XmlReader.Create(fifoReader, XmlReaderSettings))
{
var disposable = GetObservable(xmlReader)
.SubscribeOn(new EventLoopScheduler())
.Subscribe(Console.WriteLine, Console.WriteLine,() => Console.WriteLine("OnCompleted called."));
Console.ReadLine();
disposable.Dispose();
}
Console.ReadLine();
}
private static IObservable<string> GetObservable(XmlReader xmlReader)
{
return Observable.Create<string>(o =>
{
while (xmlReader.ReadToFollowing("item"))
{
// Actually, parse item element and return it.
o.OnNext("OnNext item.");
}
o.OnCompleted();
Console.WriteLine("OnCompleted.");
return Disposable.Empty;
});
}
}
}
Repro steps
1. Make fifo. mkfifo /tmp/fifo
2. Run sample code.
3. Simulate endless xml. echo "<item/><item/><item/>" > /tmp/fifo
4. Press any key. Not show "OnCompleted".

Display csv read data in console application in c#

I am completely new to programming and trying to get the complete row data from csv file based on column value in c#. Example data is as follows:
Mat_No;Device;Mat_Des;Dispo_lvl;Plnt;MS;IPDS;TM;Scope;Dev_Cat
1111;BLB A601;BLB A601;T2;PW01;10;;OP_ELE;LED;
2222;ALP A0001;ALP A0001;T2;PW01;10;;OP_ELE;LED;
If user enters a Mat_No he gets the full row data of that particular number.
I have two files program.cs and filling.cs
overViewArea.cs contain following code for csv file reading:I dont know how to access the read values from program.cs file and display in console
`using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data;
namespace TSDB
{
class fillData
{
public static fillData readCsv()
{
fillData getData= new fillData ();
using (var reader = new StreamReader(#"myfile.csv"))
{
List<string> headerList = null;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if(headerList==null)
{
headerList = line.Split(';').ToList();
}
else
{
var values = line.Split(';');
for(int i = 0; i< headerList.Count; i++)
{
Console.Write(headerList[i] + "=" + values[i]+";");
}
Console.WriteLine();
}
}
}
return fillData;
}
}
}`
Program.cs has following code
class Program
{
static void Main(string[] args)
{
fillData data= fillData.readCsv();
Console.ReadLine();
}
}
First, please, do not reinvent the wheel: there are many CSV readers available: just use one of them. If you have to use your own routine (say, for a student project), I suggest extracting method. Try using File class instead of Stream/StreamReader:
// Simple: quotation has not been implemented
// Disclamer: demo only, do not use your own CSV readers
public static IEnumerable<string[]> ReadCsvSimple(string file, char delimiter) {
return File
.ReadLines(file)
.Where(line => !string.IsNullOrEmpty(line)) // skip empty lines if any
.Select(line => line.Split(delimiter));
}
Having this routine implemented, you can use Linq to query the data, e.g.
If user enters a Mat_No he gets the full row data of that particular
number.
Console.WriteLine("Mat No, please?");
string Mat_No_To_Filter = Console.ReadLine();
var result = ReadCsvSimple(#"myfile.csv", ';')
.Skip(1)
.Where(record => record[0] == Mat_No_To_Filter);
foreach (var items in result)
Console.WriteLine(string.Join(";", items));

How to sort DBML objects alphabetically?

I've got a DBML file in my project with all my LinqToSql objects. Initially I imported them from the DB, and all was well. Now as my DB has been growing, I've been adding the new tables to the diagram in the O/R Designer, but they always get appended to the end of the XML. This is a bit of a pain, because when I'm defining foreign keys, it always lists the available tables in the order in which they appear in the XML.
Any ideas how to sort the XML table declarations alphabetically according to the table name?
I know this is old, but I also want to sort the tables and functions in my DBML to make it more manageable in Git. The following console application code seems to work pretty well. You can drag and drop a DBML file onto the exe, or you could set up a bat file or build event in your project(s).
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml.Linq;
namespace DbmlSorter
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 0)
return;
var fileName = args[0];
try
{
if (!File.Exists(fileName))
return;
SortElements(fileName);
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
Console.WriteLine();
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
private static void SortElements(string fileName)
{
var root = XElement.Load(fileName);
var connections = new SortedDictionary<string, XElement>();
var tables = new SortedDictionary<string, XElement>();
var functions = new SortedDictionary<string, XElement>();
var others = new SortedDictionary<string, XElement>();
foreach (var element in root.Elements())
{
var key = element.ToString();
if (key.StartsWith("<Connection"))
connections.Add(key, element);
else if (key.StartsWith("<Table"))
tables.Add(key, element);
else if (key.StartsWith("<Function"))
functions.Add(key, element);
else
others.Add(key, element);
}
root.RemoveNodes();
foreach (var pair in connections)
{
root.Add(pair.Value);
Console.WriteLine(pair.Key);
}
foreach (var pair in tables)
{
root.Add(pair.Value);
Console.WriteLine(pair.Key);
}
foreach (var pair in functions)
{
root.Add(pair.Value);
Console.WriteLine(pair.Key);
}
foreach (var pair in others)
{
root.Add(pair.Value);
Console.WriteLine(pair.Key);
}
root.Save(fileName);
}
}
}
A possible solution is to write a small application that reads in the XML, sorts it to your liking and outputs the updated version.

How to add a property to a PNG file

I have a PNG file to which I want to add the properties
Pixels per unit, X axis
Pixels per unit, Y axis
Unit specifier: meters
These properties are explained in the PNG specification: http://www.w3.org/TR/PNG-Chunks.html
I have programmatically read the properties of the .png to check if the properties exists, so that I can set the value for this properties, but I could not see this properties in the .png file.
(Refer pixel-per-unit.JPG)
How can we add properties to the .png file?
regards
Try using pngcs library (you need to rename the downloaded dll to "pngcs.dll")
I needed to add some custom text properties, but you can easily do much more.
Here is my implementation for adding custom text properties:
using Hjg.Pngcs; // https://code.google.com/p/pngcs/
using Hjg.Pngcs.Chunks;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace MarkerGenerator.Utils
{
class PngUtils
{
public string getMetadata(string file, string key)
{
PngReader pngr = FileHelper.CreatePngReader(file);
//pngr.MaxTotalBytesRead = 1024 * 1024 * 1024L * 3; // 3Gb!
//pngr.ReadSkippingAllRows();
string data = pngr.GetMetadata().GetTxtForKey(key);
pngr.End();
return data; ;
}
public static void addMetadata(String origFilename, Dictionary<string, string> data)
{
String destFilename = "tmp.png";
PngReader pngr = FileHelper.CreatePngReader(origFilename); // or you can use the constructor
PngWriter pngw = FileHelper.CreatePngWriter(destFilename, pngr.ImgInfo, true); // idem
//Console.WriteLine(pngr.ToString()); // just information
int chunkBehav = ChunkCopyBehaviour.COPY_ALL_SAFE; // tell to copy all 'safe' chunks
pngw.CopyChunksFirst(pngr, chunkBehav); // copy some metadata from reader
foreach (string key in data.Keys)
{
PngChunk chunk = pngw.GetMetadata().SetText(key, data[key]);
chunk.Priority = true;
}
int channels = pngr.ImgInfo.Channels;
if (channels < 3)
throw new Exception("This example works only with RGB/RGBA images");
for (int row = 0; row < pngr.ImgInfo.Rows; row++)
{
ImageLine l1 = pngr.ReadRowInt(row); // format: RGBRGB... or RGBARGBA...
pngw.WriteRow(l1, row);
}
pngw.CopyChunksLast(pngr, chunkBehav); // metadata after the image pixels? can happen
pngw.End(); // dont forget this
pngr.End();
File.Delete(origFilename);
File.Move(destFilename, origFilename);
}
public static void addMetadata(String origFilename,string key,string value)
{
Dictionary<string, string> data = new Dictionary<string, string>();
data.Add(key, value);
addMetadata(origFilename, data);
}
}
}
I think you are looking for SetPropertyItem. You can find the property ids here
You would use the property id to get and then set the property item for your meta-data.
EDIT
The three id's that you need (I think) are:
0x5111 - Pixel Per Unit X
0x5112 - Pixel Per Unit Y
0x5110 - Pixel Unit

Categories