Copy contents of the Excel file from a stream asynchronously using SqlBulkCopy - c#

I have a project where I need to copy the contents of the .xlsx file I received in Web API Controller (in the form of the Stream from MultipartReader) to SQL Server Database. I'm using SqlBulkCopy for copying itself (I already did a similar task for .csv files), but all of the solutions I was able to find suffer from one or more of the following problems:
Require saving the file to the disk first (not possible in my case)
Don't have any way of reading the file asynchronously
Load entire file into memory first (I'm expecting to deal with fairly large files, so this is not acceptable for me)
Are commercially licensed
Are there any ways of doing this?

Jeroen is correct, in that it is not possible to handle Excel files in a purely streaming manner. While it might require loading the entire .xlsx file in memory, the efficiency of the library can have an even larger impact on the memory usage than the file size. I say this as the author of the most efficient Excel reader for .NET: Sylvan.Data.Excel.
In benchmarks comparing it to other libraries, you can see that not only is it significantly faster than other implementations, but it also uses only a tiny fraction of the memory that other libraries consume.
With the exception of "Load entire file into memory first", it should satisfy all of your requirements. It can process data out of a MemoryStream, it doesn't need to write to disk. It implements DbDataReader which provides ReadAsync. The ReadAsync implementation defaults to the base DbDataReader implementation which defers to the synchronous Read() method, but when the file is buffered in a MemoryStream this doesn't present a problem, and allows the SqlBulkCopy.WriteToServerAsync to process it asynchronously. Finally, it is MIT licensed, so you can do whatever you want with it.
using Sylvan.Data;
using Sylvan.Data.Excel;
using System.Data.Common;
using System.Data.SqlClient;
// provide a schema that maps the columns in the Excel file to the names/types in your database.
var opts = new ExcelDataReaderOptions
{
Schema = MyDataSchemaProvider.Instance
};
var filename = "mydata.xlsx";
var ms = new MemoryStream();
// asynchronously load the file into memory
// this might be loading from an Asp.NET IFormFile instead
using(var f = File.OpenRead(filename))
{
await f.CopyToAsync(ms);
ms.Seek(0, SeekOrigin.Begin);
}
// determine the workbook type from the file-extension
var workbookType = ExcelDataReader.GetWorkbookType(filename);
var edr = ExcelDataReader.Create(ms, workbookType, opts);
// "select" the columns to load. This extension method comes from the Sylvan.Data library.
var dataToLoad = edr.Select("PartNumber", "ServiceDate");
// bulk copy the data to the server.
var conn = new SqlConnection("Data Source=.;Initial Catalog=mydb;Integrated Security=true;");
conn.Open();
var bc = new SqlBulkCopy(conn);
bc.DestinationTableName = "MyData";
bc.EnableStreaming = true;
await bc.WriteToServerAsync(dataToLoad);
// Implement an ExcelSchemaProvider that maps the columns in the excel file
sealed class MyDataSchemaProvider : ExcelSchemaProvider
{
public static ExcelSchemaProvider Instance = new MyDataSchemaProvider();
static readonly DbColumn PartNumber = new MyColumn("PartNumber", typeof(int));
static readonly DbColumn ServiceDate = new MyColumn("ServiceDate", typeof(DateTime));
// etc...
static readonly Dictionary<string, DbColumn> Mapping = new Dictionary<string, DbColumn>(StringComparer.OrdinalIgnoreCase)
{
{ "partnumber", PartNumber },
{ "number", PartNumber },
{ "prt_nmbr", PartNumber },
{ "servicedate", ServiceDate },
{ "service_date", ServiceDate },
{ "svc_dt", ServiceDate },
{ "sd", ServiceDate },
};
public override DbColumn? GetColumn(string sheetName, string? name, int ordinal)
{
if (string.IsNullOrEmpty(name))
{
// There was no name in the header row, can't map to anything.
return null;
}
if (Mapping.TryGetValue(name, out DbColumn? col))
{
return col;
}
// header name is unknown. Might be better to throw in this case.
return null;
}
class MyColumn : DbColumn
{
public MyColumn(string name, Type type, bool allowNull = false)
{
this.ColumnName = name;
this.DataType = type;
this.AllowDBNull = allowNull;
}
}
public override bool HasHeaders(string sheetName)
{
return true;
}
}
The most complicated part of this is probably the "schema provider" which is used to provide header name mappings and define the column types, which are required for SqlBulkCopy to operate correctly.
I also maintain the Sylvan.Data.Csv library, which provides very similar capabilities for CSV files, and is a fully asynchronous streaming CSV reader impelementation. The API it provides is nearly identical to the Sylvan ExcelDataReader. It is also the fastest CSV reader for .NET.
If you end up trying these libraries and have any troubles, open an issue in the github repo and I can take a look.

Related

split binary file into chunks or Parts upload / download

I am using couchdb for some reason as a content management to upload files as binary data, there is no GridFs support like mongoDB to upload large files, so I need to upload files as chunks then retrieve them as one file.
here is my code
public string InsertDataToCouchDb(string dbName, string id, string filename, byte[] image)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
// HERE I NEED TO UPLOAD MY IMAGE BYTE[] AS CHUNKS
var artist = new couchdb
{
_id = id,
filename = filename,
Image = image
};
var response = db.Entities.PutAsync(artist);
return response.Result.Content._id;
}
}
public byte[] FetchDataFromCouchDb(string dbName, string id)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
//HERE I NEED TO RETRIVE MY FULL IMAGE[] FROM CHUNKS
var test = db.Documents.GetAsync(id, null);
var doc = db.Serializer.Deserialize<couchdb>(test.Result.Content);
return doc.Image;
}
}
THANK YOU
Putting image data in a CouchDB document is a terrible idea. Just don't. This is the purpose of CouchDB attachments.
The potential of bloating the database with redundant blob data via document updates alone will surely have major, negative consequences for anything other than a toy database.
Further there seems to be a lack of understanding how async/await works as the code in the OP is invoking async methods, e.g. db.Entities.PutAsync(artist), without an await - the call surely will fail every time (if the compiler even allows the code). I highly recommend grok'ing the Microsoft document Asynchronous programming with async and await.
Now as for "chunking": If the image data is so large that it needs to be otherwise streamed, the business of passing it around via a byte array looks bad. If the images are relatively small, just use Attachment.PutAsync as it stands.
Although Attachment.PutAsync at MyCouch v7.6 does not support streams (effectively chunking) there exists the Support Streams for attachments #177 PR, which does, and it looks pretty good.
Here's a one page C# .Net Core console app that uploads a given file as an attachment to a specific document using the very efficient streaming provided by PR 177. Although the code uses PR 177, it most importantly uses Attachments for blob data. Replacing a stream with a byte array is rather straightforward.
MyCouch + PR 177
In a console get MyCouch sources and then apply PR 177
$ git clone https://github.com/danielwertheim/mycouch.git
$ cd mycouch
$ git pull origin 15a1079502a1728acfbfea89a7e255d0c8725e07
(I don't know git so there's probably a far better way to get a PR)
MyCouchUploader
With VS2019
Create a new .Net Core console app project and solution named "MyCouchUploader"
Add the MyCouch project pulled with PR 177 to the solution
Add the MyCouch project as MyCouchUploader dependency
Add the Nuget package "Microsoft.AspNetCore.StaticFiles" as a MyCouchUploader dependency
Replace the content of Program.cs with the following code:
using Microsoft.AspNetCore.StaticFiles;
using MyCouch;
using MyCouch.Requests;
using MyCouch.Responses;
using System;
using System.IO;
using System.Linq;
using System.Net;
using System.Security.Cryptography;
using System.Threading.Tasks;
namespace MyCouchUploader
{
class Program
{
static async Task Main(string[] args)
{
// args: scheme, database, file path of asset to upload.
if (args.Length < 3)
{
Console.WriteLine("\nUsage: MyCouchUploader scheme dbname filepath\n");
return;
}
var opts = new
{
scheme = args[0],
dbName = args[1],
filePath = args[2]
};
Action<Response> check = (response) =>
{
if (!response.IsSuccess) throw new Exception(response.Reason);
};
try
{
// canned doc id for this app
const string docId = "SO-68998781";
const string attachmentName = "Image";
DbConnectionInfo cnxn = new DbConnectionInfo(opts.scheme, opts.dbName)
{ // timely fail if scheme is bad
Timeout = TimeSpan.FromMilliseconds(3000)
};
MyCouchClient client = new MyCouchClient(cnxn);
// ensure db is there
GetDatabaseResponse info = await client.Database.GetAsync();
check(info);
// delete doc for succcessive program runs
DocumentResponse doc = await client.Documents.GetAsync(docId);
if (doc.StatusCode == HttpStatusCode.OK)
{
DocumentHeaderResponse del = await client.Documents.DeleteAsync(docId, doc.Rev);
check(del);
}
// sniff file for content type
FileExtensionContentTypeProvider provider = new FileExtensionContentTypeProvider();
if (!provider.TryGetContentType(opts.filePath, out string contentType))
{
contentType = "application/octet-stream";
}
// create a hash for silly verification
using var md5 = MD5.Create();
using Stream stream = File.OpenRead(opts.filePath);
byte[] fileHash = md5.ComputeHash(stream);
stream.Position = 0;
// Use PR 177, sea-locks:stream-attachments.
DocumentHeaderResponse put = await client.Attachments.PutAsync(new PutAttachmentStreamRequest(
docId,
attachmentName,
contentType,
stream // :-D
));
check(put);
// verify
AttachmentResponse verify = await client.Attachments.GetAsync(docId, attachmentName);
check(verify);
if (fileHash.SequenceEqual(md5.ComputeHash(verify.Content)))
{
Console.WriteLine("Atttachment verified.");
}
else
{
throw new Exception(String.Format("Attachment failed verification with status code {0}", verify.StatusCode));
}
}
catch (Exception e)
{
Console.WriteLine("Fail! {0}", e.Message);
}
}
}
}
To run:
$ MyCouchdbUploader http://name:password#localhost:5984 dbname path-to-local-image-file
Use Fauxton to visually verify the attachment for the doc.

Convert uploaded .csv to memorystream and then to c# datatable in .Net Core

I am trying to make an application that will take in multiple file types and convert them to a c# DataTable. To do this I am first copying the file to a MemoryStream and recording the file's extension. Then based on the extension I need to read the stream in different ways.
I'm having difficulty when it comes to uploading a .csv file. I first copy it to a memory stream but then I am not being able to read from it correctly. Please help.
Example Code
public async Task<IActionResult> UploadFile(IFormFile file) {
Datatable dt = new DataTable();
var extension = Path.GetExtension(file.FileName);
if (file.length > 0) {
using (var ms = new MemoryStream()) {
await file.CopyToAsync(ms);
dt = ConvertFileToDataTable(ms, extension);
}
}
}
public DataTable ConvertFileToDataTable(MemoryStream stream, string ext) {
switch (ext.ToLower()) {
case ".xlsx":
// Already have this working
break;
case ".csv":
// This is where I need help
}
}
With the CSV I am making the assumption that the first row contains the headers. If I could just convert the MemoryStream back into a csv string then I could handle the logic from there, I just don't know how to do that part.
The reason I need to do the conversion to a MemoryStream is because I'm working on a .Net Standard Library that wouldn't have access to IFormFile. It would take in the stream and return a data table. Basically, it handles the code in the method ConvertFileToDataTable above.
You can get the byte array as a string by using:
Encoding.UTF8.GetString(stream.ToArray());

Uploading Stream to Database

I have a FileForUploading class which should be uploaded to a database.
public class FileForUploading
{
public FileForUploading(string filename, Stream stream)
{
this.Filename = filename;
this.Stream = stream;
}
public string Filename { get; private set; }
public Stream Stream { get; private set; }
}
I am using the Entity Framework to convert it to a FileForUploadingEntity
which is a very simple class that however only contains the Filename property. I don't want to store the Stream in memory but rather upload it directly to the database.
What would be the best way to 'stream' the Stream directly to the database?
So far I have come up with this
private void UploadStream(string name, Stream stream)
{
var sqlQuery = #"UPDATE dbo.FilesForUpload SET Content =#content WHERE Name=#name;";
var nameParameter = new SqlParameter()
{
ParameterName = "#name",
Value = name
};
var contentParameter = new SqlParameter()
{
ParameterName = "#content",
Value = ConvertStream(stream),
SqlDbType = SqlDbType.Binary
};
// the database context used throughout the application.
this.context.Database.ExecuteSqlCommand(sqlQuery, contentParameter, nameParameter);
}
And here is my ConvertStream which converts the Stream to a byte[]. (It is stored as a varbinary(MAX) in the database.
private static byte[] ConvertStream(Stream stream)
{
using (var memoryStream = new MemoryStream())
{
stream.CopyTo(memoryStream);
return memoryStream.ToArray();
}
}
Is the above solution good enough? Will it perform well if the Stream is large?
I don't want to store the Stream in memory but rather upload it directly to the database.
With the above solution you proposed you still have the content of the stream in memory in your application which you mentioned initially is something you were trying to avoid.
Your best bet is to go around EF and use the async function to upload the stream. The following example is taken from MSDN article SqlClient Streaming Support.
// Application transferring a large BLOB to SQL Server in .Net 4.5
private static async Task StreamBLOBToServer() {
using (SqlConnection conn = new SqlConnection(connectionString)) {
await conn.OpenAsync();
using (SqlCommand cmd = new SqlCommand("INSERT INTO [BinaryStreams] (bindata) VALUES (#bindata)", conn)) {
using (FileStream file = File.Open("binarydata.bin", FileMode.Open)) {
// Add a parameter which uses the FileStream we just opened
// Size is set to -1 to indicate "MAX"
cmd.Parameters.Add("#bindata", SqlDbType.Binary, -1).Value = file;
// Send the data to the server asynchronously
await cmd.ExecuteNonQueryAsync();
}
}
}
}
You could convert this sample to the following to make it work for you. Note that you should change the signature on your method to make it async so you can take advantage of not having a thread blocked during a long lasting database update.
// change your signature to async so the thread can be released during the database update/insert act
private async Task UploadStreamAsync(string name, Stream stream) {
var conn = this.context.Database.Connection; // SqlConnection from your DbContext
if(conn.State != ConnectionState.Open)
await conn.OpenAsync();
using (SqlCommand cmd = new SqlCommand("UPDATE dbo.FilesForUpload SET Content =#content WHERE Name=#name;", conn)) {
cmd.Parameters.Add(new SqlParameter(){ParameterName = "#name",Value = name});
// Size is set to -1 to indicate "MAX"
cmd.Parameters.Add("#content", SqlDbType.Binary, -1).Value = stream;
// Send the data to the server asynchronously
await cmd.ExecuteNonQueryAsync();
}
}
One more note. If you want to save large unstructured data sets (ie. the Streams you are getting uploaded) then it might be a better idea to not save them in the database. There are numerous reasons why but foremost is that relational database were not really designed with this in mind, its cumbersome to work with the data, and they can chew up database space real fast making other operations more difficult (ie. backups, restores, etc).
There is an alternative that still natively allows you to save a pointer in the record but have the actual unstructured data reside on disk. You can do this using the Sql Server FileStream. In ADO.NET you would be working with SqlFileStream. Here is a good walk through on how to configure your Sql Server and database to allow for Sql File Streams. It also has some Vb.net examples on how to use the SqlFileStream class.
An Introduction to SQL Server FileStream
I did assume you were using Microsoft Sql Server as your data repository. If this assumption is not correct please update your question and also add a tag for the correct database service you are connecting to.

How can I save entire MongoDB collection to json/bson file using C#?

I have process, which first, generates lots of data which is save into mongoDB collection, then data is analyzed, and last - I want to save the whole collection to file on disk, and erase the collection.
I know I could do it easily with MongoDump.exe, but I was wondering is there any way to do it directly from c#? - I mean not running console precess with - but using some functionality that is inside MOngo C# driver.
And, if it can be done - how would I do the reverse operation in c# ? - namely: loading .bson file into collection?
Here's two methods that you can use to accomplish this:
public static async Task WriteCollectionToFile(IMongoDatabase database, string collectionName, string fileName)
{
var collection = database.GetCollection<RawBsonDocument>(collectionName);
// Make sure the file is empty before we start writing to it
File.WriteAllText(fileName, string.Empty);
using (var cursor = await collection.FindAsync(new BsonDocument()))
{
while (await cursor.MoveNextAsync())
{
var batch = cursor.Current;
foreach (var document in batch)
{
File.AppendAllLines(fileName, new[] { document.ToString() });
}
}
}
}
public static async Task LoadCollectionFromFile(IMongoDatabase database, string collectionName, string fileName)
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
var collection = database.GetCollection<BsonDocument>(collectionName);
string line;
while ((line = sr.ReadLine()) != null)
{
await collection.InsertOneAsync(BsonDocument.Parse(line));
}
}
}
And here's an example of how you would use them:
// Obviously you'll need to change all these values to your environment
var connectionString = "mongodb://localhost:27017";
var database = new MongoClient(connectionString).GetDatabase("database");
var fileName = #"C:\mongo_output.txt";
var collectionName = "collection name";
// This will save all of the documents in the file you specified
WriteCollectionToFile(database, collectionName, fileName).Wait();
// This will drop all of the documents in the collection
Task.Factory.StartNew(() => database.GetCollection(collectionName).DeleteManyAsync(new BsonDocument())).Wait();
// This will restore all the documents from the file you specified
LoadCollectionFromFile(database, collectionName, fileName).Wait();
Note that this code was written using version 2.0 of the MongoDB C# driver, which you can obtain via Nuget. Also, the file reading code in the LoadCollectionFromFile method was obtained from this answer.
You can use C# BinaryFormatter to serialize object graph to disk. Also you can deserialize back to object graph.
Serialize:
https://msdn.microsoft.com/en-us/library/c5sbs8z9%28v=VS.110%29.aspx
Deserialize:
https://msdn.microsoft.com/en-us/library/b85344hz%28v=vs.110%29.aspx
However that is not mongodb or C# driver feature.
After serializing you can use the driver to drop the collection. And after deserializing you can use the driver to insert objects into a new collection.
Based on your rules, you may want to do some locking on that collection at the time you are doing the export process before you drop it.

How to save a human readable file

Currently i have an application that reads and writes several properties from one or two basic classes to a .txt file using the Binary Serializer.
I've opened up the .txt file in NotePad and as it's formatted for the application it's not very readable to the human eye, not for me anyway =D
I've heard of using XML but pretty much most of my searches seem to overcomplicate things.
The kind of data im trying to save is simply a collection of "Person.cs" classes,nothing more than a name and address, all private strings but with properties and marked as Serializable.
What would be the best way to actually save my data in a way that can be easily read by a person? It would also make it easier to make small changes to the application's data directly in the file instead of having to load it, change it and save it.
Edit:
I have added the current way i am saving and loading my data, my _userCollection is as it suggests and the nUser/nMember are an integer.
#region I/O Operations
public bool SaveData()
{
try
{
//Open the stream using the Data.txt file
using (Stream stream = File.Open("Data.txt", FileMode.Create))
{
//Create a new formatter
BinaryFormatter bin = new BinaryFormatter();
//Copy data in collection to the file specified earlier
bin.Serialize(stream, _userCollection);
bin.Serialize(stream, nMember);
bin.Serialize(stream, nUser);
//Close stream to release any resources used
stream.Close();
}
return true;
}
catch (IOException ex)
{
throw new ArgumentException(ex.ToString());
}
}
public bool LoadData()
{
//Check if file exsists, otherwise skip
if (File.Exists("Data.txt"))
{
try
{
using (Stream stream = File.Open("Data.txt", FileMode.Open))
{
BinaryFormatter bin = new BinaryFormatter();
//Copy data back into collection fields
_userCollection = (List<User>)bin.Deserialize(stream);
nMember = (int)bin.Deserialize(stream);
nUser = (int)bin.Deserialize(stream);
stream.Close();
//Sort data to ensure it is ordered correctly after being loaded
_userCollection.Sort();
return true;
}
}
catch (IOException ex)
{
throw new ArgumentException(ex.ToString());
}
}
else
{
//Console.WriteLine present for testing purposes
Console.WriteLine("\nLoad failed, Data.txt not found");
return false;
}
}
Replace your BinaryFormatter with XMLSerializer and run the same exact code.
The only change you need to make is the BinaryFormatter takes an empty constructor, while for the XMLSerializer you need to declare the type in the constructor:
XmlSerializer serializer = new XmlSerializer(typeof(Person));
Using XmlSerializer is not really complicated. Have a look at this MSDN page for an example: http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx
You could implement your own PersonsWriter, that takes a StreamWriter as constructor argument and has a Write method that takes an IList<Person> as input to parse out a nice text representation.
For example:
public class PersonsWriter : IDisposable
{
private StreamWriter _wr;
public PersonsWriter(IList<Person> persons, StreamWriter writer)
{
this._wr = writer;
}
public void Write(IList<Persons> people) {
foreach(Person dude in people)
{
_wr.Write(#"{0} {1}\n{2}\n{3} {4}\n\n",
dude.FirstName,
dude.LastName,
dude.StreetAddress,
dude.ZipCode,
dude.City);
}
}
public void Dispose()
{
_wr.Flush();
_wr.Dispose();
}
}
YAML is another option for human readable markup that is also easy to parse. there are libraries available for c# as well as almost all other popular languages. Here's a sample of what yaml looks like:
invoice: 34843
date : 2001-01-23
bill-to: &id001
given : Chris
family : Dumars
address:
lines: |
458 Walkman Dr.
Suite #292
city : Royal Oak
state : MI
postal : 48046
Frankly, as a human, I don't find XML to be all that readable. In fact, it's not really designed to be read by humans.
If you want a human readable format, then you have to build it.
Say, you have a Person class that has a First Name, a last Name and a SSN as properties. Create your file, and have it write out 3 lines, with a description of the field in the first fifty (random number from my head) and then with character 51 have the value start being written.
This will produce a file that looks like:
First Name-------Stephen
Last Name -------Wrighton
SSN -------------XXX-XX-XXXX
Then, reading it back in, your program would know where the data begins on each line, and what each line is for (the program would know that Line 3 is the SSN value).
But remember, to truly gain human readability, you sacrifice data portability.
Try the DataContractSerializer
It serializes objects to XML and is very easy to use
Write a CSV reader writer if you want a good compromise between human and machine readable in a Windows environment
Loads into Excel too.
There's a discussion about it here:
http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html
EDIT
That is a C# article... it just confusingly has "C" in the URL.
I really think you should go with XML (look into DataContractSerializer). Its not that complicated. You could probably even just replace BinarySerializer with XMLSerializer and go.
If you still don't want to do that, though, you can write a delimited text file. Then you'll have to write your own reader method (although, it could almost just use the split method).
//Inside the Person class:
public override string ToString()
{
List<String> propValues = new List<String>();
// Get the type.
Type t = this.GetType();
// Cycle through the properties.
foreach (PropertyInfo p in t.GetProperties())
{
propValues.add("{0}:={1}", p.Name, p.GetValue(o, null));
}
return String.Join(",". propValues.ToArray())
}
using (System.IO.TextWriter tw = new System.IO.StreamWriter("output.txt"))
{
tw.WriteLine(person.ToString());
}

Categories