ParquetWriter not sending all information to blob storage - c#

public async Task UploadParquetFromObjects<T>(string fileName, T objects)
{
var stringJson = JArray.FromObject(objects).ToString();
var parsedJson = ChoJSONReader.LoadText(stringJson);
var desBlob = blobClient.GetBlockBlobClient(fileName);
using (var outStream = await desBlob.OpenWriteAsync(true).ConfigureAwait(false))
using (ChoParquetWriter parser = new ChoParquetWriter(outStream))
{
parser.Write(parsedJson);
}
}
I'm using this code to send some data to a file on an Azure Blob Storage. At first, it worked fine, it created the file, put some information on it and it was readable, but with some investigation, it only write a fraction of the data I send. For example, I send a list of 15 items and it only writes 3. I tried different datasets, with different sizes and composed of different objects, the writer varies on the number of registers written, but it never gets to 100%.
Am I doing something wrong?

This issue is being tracked and addressed in GitHub issues section.
https://github.com/Cinchoo/ChoETL/issues/230
The issue was the input JSON has inconsistent members, hence missing datetime members are set as null by JSON reader. Parquet writer couldn't handle such null datetime values. Applied fix.
Sample fiddle: https://dotnetfiddle.net/PwxNWX
Packages used:
ChoETL.JSON.Core v1.2.1.49 (beta2)
ChoETL.Parquet v1.0.1.23 (beta6)

Related

Azure Form Recognizer only analyzes the first file in a stream

I am testing some AI Document analysis stuff, and am currently trying to allow users to Upload Files to a WebApp, which in turn sends them to Azure Form Recognizer and processes the results.
I am however not able to do so in a single Request.
This is how the Files are represented:
[BindProperty] public List<IFormFile> Upload { get; set; }
I can iterate over these and get the expected results, but this makes the operation take quite long. I would like to just send all of the files in one request (as shown below), but it only ever analyzes the first one. I am using Azure.AI.FormRecognizer.DocumentAnalysis, so the client and StartAnalyzeDocument Method is from there.
using (var stream = new MemoryStream())
{
foreach (IFormFile formFile in Upload)
{
formFile.CopyTo(stream);
}
stream.Seek(0, SeekOrigin.Begin);
AnalyzeDocumentOperation operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
Console.WriteLine("This many documents were analysed: " + operation.Value.Documents.Count);
result = operation.Value;
};
"result" is what I process later on. I am quite stumped on this, as I would have expected the appended stream to just work. If anyone has a solution or could point me in the right direction, it would be much appreciated.
Form Recognizer does not yet support processing multiple documents in a single analyze operation for prebuilt-invoice and custom models. Furthermore, most file formats cannot just be appended together to concatenate the content.
One way to speed up the analysis of multiple files in a batch is to call the analyze operation in parallel. Here is a sketch.
var results = Upload.AsParallel().ForAll(formFile =>
{
using (var stream = formFile.OpenReadStream())
{
var operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
return operation.Value;
}
}).ToArray();

Dispose IRandomAccessStream after DataPackage.SetData or DataPackage.GetDataAsync?

Consider putting data onto a windows clipboard DataPackage using SetData and later retrieving it using GetDataAsync, like this:
IEnumerable<T> objects = ...;
var randomAccessStream = new InMemoryRandomAccessStream();
using (XmlDictionaryWriter xmlWriter = XmlDictionaryWriter.CreateTextWriter(randomAccessStream.AsStreamForWrite(), Encoding.Unicode)) {
var serializer = new DataContractSerializer(typeof(T), knownTypes);
foreach (T obj in objects) {
serializer.WriteObject(xmlWriter, obj);
}
}
dataPackage.SetData(formatId, randomAccessStream);
Then later on (e.g. in Clipboard.ContentsChanged),
randomAccessStream = await dataPackageView.GetDataAsync(formatId) as IRandomAccessStream;
xmlReader = XmlDictionaryReader.CreateTextReader(randomAccessStream.AsStreamForRead(), Encoding.Unicode, XmlDictionaryReaderQuotas.Max, (OnXmlDictionaryReaderClose?)null);
var serializer = new DataContractSerializer(typeof(T), knownTypes);
while (serializer.IsStartObject(xmlReader)) {
object? obj = serializer.ReadObject(xmlReader);
...
}
xmlReader.Dispose(); // in the real code, this is in a finally clause
The question I have is, when do I dispose the randomAccessStream? I've done some searching and all the examples I've seen using SetData and GetDataAsync do absolutely nothing about disposing the object that is put into or obtain from the data package.
Should I dispose it after the SetData, after the GetDataAsync, in DataPackage.OperationCompleted, in some combination of these, or none of them?
sjb
P.S. If I can squeeze in a second question here ... when I put a reference into a DataPackage using for example dataPackage.Properties.Add( "IEnumerable<T>", entities), does it create a security risk -- can other apps access the reference
and use it?
tldr
The Clipboard is designed to pass content between applications and can only pass string content or a references to files, all other content must be either serialized to string, or saved to a file, or must behave like a file, to be access across application domains via the clipboard.
There is support and guidance for passing custom data and formats via the clipboard, ultimately this involves discrete management around what is "how to prepare the content on the provider side" and "how to interpret the content on the consumer side". If you can use simple serialization for this, then KISS.
IEnumerable<Test> objectsIn = new Test[] { new Test { Name = "One" }, new Test { Name = "two" } };
var dataPackage = new DataPackage();
dataPackage.SetData("MyCustomFormat", Newtonsoft.Json.JsonConvert.SerializeObject(objectsIn));
Clipboard.SetContent(dataPackage);
...
var dataPackageView = Clipboard.GetContent();
string contentJson = (await dataPackageView.GetDataAsync("MyCustomFormat")) as string;
IEnumerable<Test> objectsOut = Newtonsoft.Json.JsonConvert.DeserializeObject<IEnumerable<Test>>(contentJson);
In WinRT the DataPackageView class implementation does support passing streams however the normal rules apply for the stream in terms of lifecycle and if the stream is disposed or not. This is useful for transferring large content or when the consumer might request the content in different formats.
If you do not have an advanced need for it, or you are not transmitting file or image based resources, then you do not need to use a stream to transfer your data.
DataPackageView - Remarks
During a share operation, the source app puts the data being shared in a DataPackage object and sends that object to the target app for processing. The DataPackage class includes a number of methods to support the following default formats: text, Rtf, Html, Bitmap, and StorageItems. It also has methods to support custom data formats. To use these formats, both the source app and target app must already be aware that the custom format exists.
OPs attempt to save a stream to the Clipboard is in this case an example of saving an arbitrary or custom object to the clipboard, it is neither a string or a pointer to a file, so the OS level does not have a native way to handle this information.
Historically, putting string data, or a file reference onto the clipboard is effectively broadcasting this information to ALL applications on the same running OS, however Windows 10 extends this by making your clipboard content able to be synchronised across devices as well. The DataTransfer namespace implementation allows you to affect the scope of this availability, but ultimately this feature is designed to allow you to push data outside of your current application sandboxed domain.
So whether you choose serialize the content yourself, or you want the DataTransfer implementation to try and do it for you, the content will be serialized if it is not already a string or file reference format, and that serialized content, if it succeeds, is what will be made available to consumers.
In this way there is no memory leak or security issue where you might inadvertently provide external processes access to your current process memory or execution context, but data security is still a concern, so don't use the clipboard to pass sensitive content.
A simpler example for Arbitrary or Custom data
OPs example is to put an IEnumerable<T> collection of objects onto the clipboard, and to retrieve them later. OP is choosing to use XML serialization via the DataContractSerializer however a reference to the stream used by the serializer was saved to the clipboard, and not the actual content.
There is a lot of plumbing and first principals logic going on that for little benefit, streams are useful if you are going to stream the content, so if you are going to allow the consumer to control the stream but if you were going to write to the stream in a single synchronous process, then it is better to close off the stream altogether and pass around the buffer that you filled via your stream, we don't even try to re-use the same stream at a later point in time.
The following solution works for Clipboard access in WinRT to pre-serialize a collection of objects and pass them to a consumer:
IEnumerable<Test> objectsIn = new Test[] { new Test { Name = "One" }, new Test { Name = "two" } };
var dataPackage = new DataPackage();
string formatId = "MyCustomFormat";
var serial = Newtonsoft.Json.JsonConvert.SerializeObject(objectsIn);
dataPackage.SetData(formatId, serial);
Clipboard.SetContent(dataPackage);
Then in perhaps an entirely different application:
string formatId = "MyCustomFormat";
var dataPackageView = Clipboard.GetContent();
object content = await dataPackageView.GetDataAsync(formatId);
string contentString = content as string;
var objectsOut = Newtonsoft.Json.JsonConvert.DeserializeObject<IEnumerable<Test>>(contentString);
foreach (var o in objectsOut)
{
Console.WriteLine(o);
}
The definition of Test, in both the provider and the consumer application contexts:
public class Test
{
public string Name { get; set; }
}
when do I dispose the randomAccessStream?
Only Dispose the stream when you have finished using it, when you have Diposed the stream it will be no longer usable in any other contexts, even if you have stored or passed multiple references to it in other object instances.
If you are talking about the original stream referenced in the SetData() logic then look at this from the other angle, If you dispose too early, the consuming code will no longer have access to the stream and will fail.
As a general rule we should try to design the logic such that at any given point in time there is a clear and single Owner for any given stream, in that way it should be clear who has responsibility for Disposing the stream. This response to a slightly different scenario explains it well, https://stackoverflow.com/a/8791525/1690217 however as a general pattern only the scope that created the stream should be responsible for Disposing it.
The one exception to that is that if you need to access the stream outside of the creating method, then the parent class should hold a reference to it, in that scenario you should make the parent class implement IDisposable and make sure it cleans up any resources that might be hanging around.
The reason that you don't see this in documentation is often that the nuances around the timing for calling Dispose() are out of scope or will get lost in examples that are contrived for other purposes.
Specifically for examples where streams are passed via any mechanism and later used, as with DataPackage, it is too hard to show all of the orchestration code to cover the time in between storing the stream with DataPackage.SetData(...) and later accessing the stream via DataPackage.GetDataAsync(...)
Also consider the most common scenario for DataPackage where the consumer is not only in a different logical scope, but most likely in an entirely different application domain, to include all the code to cover when or if to call dispose should encompass the entire code base for 2 different applications.

Caching posted data and fall-backs

I'm currently working on a project that has an external site posting xml data to a specified url on our site. My initial thoughts were to first of all save the xml data to a physical file on our server as a backup. I then insert the data into the cache and from then on, all requests for the data will be made to the cache instead of the physical file.
At the moment I have the following:
[HttpPost]
public void MyHandler()
{
// filePath = path to my xml file
// Delete the previous file
if (File.Exists(filePath))
File.Delete(filePath));
using (Stream output = File.OpenWrite(filePath))
using (Stream input = request.InputStream)
{
input.CopyTo(output);
}
// Deserialize and save the data to the cache
var xml = new XmlTextReader(filePath);
var serializer = new XmlSerializer(typeof(MyClass));
var myClass = (MyClass)serializer.Deserialize(xml);
HttpContext.Current.Cache.Insert(myKey,
myClass,
null,
myTimespan,
Cache.NoSlidingExpiration,
CacheItemPriority.Default, null);
}
The issue I have is that I'm always getting exceptions thrown because the file that I'm saving to 'is in use' when I try a second post to update the data.
A colleague suggested using a Mutex class just before I left work on the Friday so I wonder if that is the correct approach here?
Basically I'm just trying to sanity check that this is a good way of managing the data? I can see there's clearly an issue with how I'm writing the data to a file but aside from this, does my approach make sense?
Thanks

Overriding WebHostBufferPolicySelector for Non-Buffered File Upload

In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}
WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."

write to specific position in .json file + serilaize size limit issue C#

I have a method that retrieves data from a json serialized string and writes it to a .json file using:
TextWriter writer = new StreamWriter("~/example.json");
writer2.Write("{\"Names\":" + new JavaScriptSerializer().Serialize(jsonData) + "}");
data(sample):
{"People":{"Quantity":"4"}, ,"info" :
[{"Name":"John","Age":"22"}, {"Name":"Jack","Age":"56"}, {"Name":"John","Age":"82"},{"Name":"Jack","Age":"95"}]
}
This works perfectly however the jsonData variable has content that is updated frequently. Instead of always deleting and creating a new example.json when the method is invoked,
Is there a way to write data only to a specific location in the file? in the above example say to the info section by appending another {"Name":"x","Age":"y"}?
My reasoning for this is I ran into an issue when trying to serialize a large amount of data using visual studio in C#. I got "The length of the string exceeds the value set on the maxJsonLength propertyā€¯ error. I tried to increase the max allowed size in the web.config using a few suggested methods in this forum but they never worked. As the file gets larger I feel I may run into the same issue again. Any other alternatives are always welcome. Thanks in advance.
I am not aware of a JSON serializer that works with chunks of JSON only. You may try using Json.NET which should work with larger data:
var data = JsonConvert.SerializeObject(new { Names = jsonData });
File.WriteAllText("example.json", data);

Categories