Azure Form Recognizer only analyzes the first file in a stream - c#

I am testing some AI Document analysis stuff, and am currently trying to allow users to Upload Files to a WebApp, which in turn sends them to Azure Form Recognizer and processes the results.
I am however not able to do so in a single Request.
This is how the Files are represented:
[BindProperty] public List<IFormFile> Upload { get; set; }
I can iterate over these and get the expected results, but this makes the operation take quite long. I would like to just send all of the files in one request (as shown below), but it only ever analyzes the first one. I am using Azure.AI.FormRecognizer.DocumentAnalysis, so the client and StartAnalyzeDocument Method is from there.
using (var stream = new MemoryStream())
{
foreach (IFormFile formFile in Upload)
{
formFile.CopyTo(stream);
}
stream.Seek(0, SeekOrigin.Begin);
AnalyzeDocumentOperation operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
Console.WriteLine("This many documents were analysed: " + operation.Value.Documents.Count);
result = operation.Value;
};
"result" is what I process later on. I am quite stumped on this, as I would have expected the appended stream to just work. If anyone has a solution or could point me in the right direction, it would be much appreciated.

Form Recognizer does not yet support processing multiple documents in a single analyze operation for prebuilt-invoice and custom models. Furthermore, most file formats cannot just be appended together to concatenate the content.
One way to speed up the analysis of multiple files in a batch is to call the analyze operation in parallel. Here is a sketch.
var results = Upload.AsParallel().ForAll(formFile =>
{
using (var stream = formFile.OpenReadStream())
{
var operation = client.StartAnalyzeDocument(modelId, stream);
operation.WaitForCompletion();
return operation.Value;
}
}).ToArray();

Related

ParquetWriter not sending all information to blob storage

public async Task UploadParquetFromObjects<T>(string fileName, T objects)
{
var stringJson = JArray.FromObject(objects).ToString();
var parsedJson = ChoJSONReader.LoadText(stringJson);
var desBlob = blobClient.GetBlockBlobClient(fileName);
using (var outStream = await desBlob.OpenWriteAsync(true).ConfigureAwait(false))
using (ChoParquetWriter parser = new ChoParquetWriter(outStream))
{
parser.Write(parsedJson);
}
}
I'm using this code to send some data to a file on an Azure Blob Storage. At first, it worked fine, it created the file, put some information on it and it was readable, but with some investigation, it only write a fraction of the data I send. For example, I send a list of 15 items and it only writes 3. I tried different datasets, with different sizes and composed of different objects, the writer varies on the number of registers written, but it never gets to 100%.
Am I doing something wrong?
This issue is being tracked and addressed in GitHub issues section.
https://github.com/Cinchoo/ChoETL/issues/230
The issue was the input JSON has inconsistent members, hence missing datetime members are set as null by JSON reader. Parquet writer couldn't handle such null datetime values. Applied fix.
Sample fiddle: https://dotnetfiddle.net/PwxNWX
Packages used:
ChoETL.JSON.Core v1.2.1.49 (beta2)
ChoETL.Parquet v1.0.1.23 (beta6)

Attached images in Folder or Database?

I'm currently working on a .NET (core 3.1) website project and I am a little stuck on how to handle images and as I could not find a proper response for my case, here it is.
I'm working on a reports system where the user should be allowed to create a report and attach images if necessary. My question is, should I store the images in a database or a folder? The images will not contain "National security threats" but I guess they could be of a private nature.
Is it a good practice to store them on a Database?
I found it a bit messy the procedure to store them:
public async Task<IActionResult> Create(IFormFile image)
{
if (ModelState.IsValid)
{
byte[] p1 = null; //As I understand, it should be store as byte[]
using (var fs1 = image.OpenReadStream())
using (var ms1 = new MemoryStream())
{
fs1.CopyTo(ms1);
p1 = ms1.ToArray();
}
Image img = new Image(); //This is my Image model
img.Img = p1; //The property .IMG is of type "varbinary" on the DB.
_imagesDB.Images.Add(img); //My context
await _imagesDB.SaveChangesAsync();
return RedirectToAction(nameof(Index)); //if everything went well go back to index-
}
return View(report);
}
This is more or less ok (I guess) but I was not able to read it back from the database and send it to the View for showing.
Any ideas on how to read back the images from my context and, specially, how to send it from the controller to the View?
Thanks in advance.-
Alvaro.
There are pros and cons of both methods of storing files. It's convenient to have your files where your data is - however it takes a toll on the database side.
Text (the file path) in the database is only a few thousand bytes max (varchar data type, not the text data type in SQL), while a file can be enormous.
Imagine you wanted to query 1,000,000 users (hypothetically) - you would also be querying 1,000,000 files. That an enormous amount of data. Storing text (the file path) is minimal and a query could retrieve 1,000,000 rows of text rather quickly.
This can slow down your web app by causing longer load times due to your queries. I've had this issue personally and had to actually make a lazy load workaround to speed up the app.
Also, you have to consider the backup/restore process for your database. The larger the database then the longer your backup/restore times will take - and databases only grow. I heard a story about a company who backed up their database nightly, but their backup time took longer than a day due to files in their database. They weren't even done with the backup the day prior when the next backup started.
There are other factors to consider but those few alone are significant considerations.
In regards to the C# view/controller process...
Files are stored as bytes in a database (varbinary). You'll have to query the data and store them in a byte[] just like you are now and convert it to a file.
Here's a simplified snippet of one of my controllers in my .NET Core 3.1 web app.
This was only to download 1 PDF file - you will have to change it for your needs of course.
public async Task<IActionResult> Download(string docId, string docSource)
{
// Some kinda of validation...
if (!string.IsNullOrEmpty(docId))
{
// These are my query parameters (I'm using Dapper)
var p = new
{
docId,
docSource // This is just a parameter for my specific query
};
// Query the database for the document
// DocumentModel doc = some kinda of async query using
// the p variables as parameters
// I cut this part out since your database methods may be different
try
{
// Return the file
return File(doc.Content, "application/pdf", doc.LeafName);
}
catch
{
// You'll probably want to pass some kind of error message to your view
return View();
}
}
return View();
}
The doc.Content are the bytes and the doc.LeafName is just the name of the document.
You can also pass the file back to your View by setting properties on it's ViewModel/Model.
return View(new YourViewModel
{
SomeViewModelProperty = someProp,
Documents = documents
});
If you use a file server that's accessible to your API or web app then I believe you can retrieve the file directly from there.

Write to S3 using PutObjectRequest while still generating stream

I am converting an application that currently uses the Windows file system to read and store files.
While reading each line of an input file, it modifies the data, and then writes it out to an output file:
using (var writer = new StreamWriter(#"C:\temp\out.txt", false))
{
using (var reader = new StreamReader(#"C:\temp\in.txt", Encoding.UTF8))
{
while ((line = reader.ReadLine()) != null)
{
//Create modifiedLine from line data
...
writer.WriteLine(modifiedLine);
}
}
}
So far, I have been able to write to S3 using a PutObjectRequest, but only with the entire file contents at once:
//Set up stream
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(theEntireModifiedFileContents);
writer.Flush();
stream.Position = 0;
var putRequest = new PutObjectRequest()
{
BucketName = destinationBucket,
Key = destinationKey,
InputStream = stream
};
var response = await s3Client.PutObjectAsync(putRequest);
Given that these are going to be large files, I would prefer to keep the line-by-line approach rather than having to send the entire file contents at once.
Is there any way to maintain a similar behavior to the file system example above with S3?
S3 is an object store and does not support modifications in-place, appending, etc.
However, it is possible to meet your goals if certain criteria is met / understood:
1) Realize that it will take more code to do this than simply modifying your code to buffer the line output and then upload as a single object.
2) You can upload each line as part of the REST API PUT stream. This means that you will need to continuously upload data until complete. Basically you are doing a slow upload of a single S3 object while you process each line.
3) You can use the multi-part API to upload each line as a single part of a multi-part transfer. Then combine parts once complete. Note: I don't remember if each part has to be the same size (except for the last part). The limit to the total number of parts is 1,000. If your number of lines is more than 1,000 than you will need to buffer, so go back to method #1 or add buffering to reduce the number of parts to 1,000.
Unless you are a really motivated developer, realize that method #1 is going to be far easier to implement and test. Methods #2 and #3 will require you to understand how S3 works at a very low level using HTTP PUT requests.

Out Of Memory Exception in Foreach

I am trying to create a function that will retrieve all the uploaded files (which are now saved as byte in the database) and download it in a single zip file. I currently have 6000 files to download (and the number could grow).
The functionality is already working (from retrieval to download) if I limit the number of files being downloaded, otherwise, I get an OutOfMemoryException on the ForEach loop.
Here's a pseudo code: (files variable is a list of byte array and file name)
var files = getAllFilesFromDB();
foreach (var file in files)
{
var tempFilePath = Path.Combine(path, filename);
using (FileStream stream = new FileStream(tempfileName, FileMode.Create, FileAccess.ReadWrite))
{
stream.Write(file.byteArray, 0, file.byteArray.Length);
}
}
private readonly IEntityRepository<File> fileRepository;
IEnumerable<FileModel> getAllFilesFromDb()
{
return fileRepository.Select(f => new FileModel(){ fileData = f.byteArray, filename = f.fileName});
}
My question is, is there any other way to do this to avoid getting such errors?
To avoid this problem, you could avoid loading all the contents of all the files in one go. Most likely you will need to split your database call in to two database calls.
Retrieve a list of all the files without their contents but with some identifier - like the PK of the table.
A method which retrieves the contents of an individual file.
Then your (pseudo)code becomes
get list of all files
for each file
get the file contents
write the file to disk
Another possibility is to alter the way your query works currently, so that it uses deferred execution - this means it will not actually load all the files at once, but stream them one at a time from the database - but without seeing more code from your repository implementation, I cannot/ will not guess the right solution for you.

Overriding WebHostBufferPolicySelector for Non-Buffered File Upload

In an attempt to create a non-buffered file upload I have extended System.Web.Http.WebHost.WebHostBufferPolicySelector, overriding function UseBufferedInputStream() as described in this article: http://www.strathweb.com/2012/09/dealing-with-large-files-in-asp-net-web-api/. When a file is POSTed to my controller, I can see in trace output that the overridden function UseBufferedInputStream() is definitely returning FALSE as expected. However, using diagnostic tools I can see the memory growing as the file is being uploaded.
The heavy memory usage appears to be occurring in my custom MediaTypeFormatter (something like the FileMediaFormatter here: http://lonetechie.com/). It is in this formatter that I would like to incrementally write the incoming file to disk, but I also need to parse json and do some other operations with the Content-Type:multipart/form-data upload. Therefore I'm using HttpContent method ReadAsMultiPartAsync(), which appears to be the source of the memory growth. I have placed trace output before/after the "await", and it appears that while the task is blocking the memory usage is increasing fairly rapidly.
Once I find the file content in the parts returned by ReadAsMultiPartAsync(), I am using Stream.CopyTo() in order to write the file contents to disk. This writes to disk as expected, but unfortunately the source file is already in memory by this point.
Does anyone have any thoughts about what might be going wrong? It seems that ReadAsMultiPartAsync() is buffering the whole post data; if that is true why do we require var fileStream = await fileContent.ReadAsStreamAsync() to get the file contents? Is there another way to accomplish the splitting of the parts without reading them into memory? The code in my MediaTypeFormatter looks something like this:
// save the stream so we can seek/read again later
Stream stream = await content.ReadAsStreamAsync();
var parts = await content.ReadAsMultipartAsync(); // <- memory usage grows rapidly
if (!content.IsMimeMultipartContent())
{
throw new HttpResponseException(HttpStatusCode.UnsupportedMediaType);
}
//
// pull data out of parts.Contents, process json, etc.
//
// find the file data in the multipart contents
var fileContent = parts.Contents.FirstOrDefault(
x => x.Headers.ContentDisposition.DispositionType.ToLower().Trim() == "form-data" &&
x.Headers.ContentDisposition.Name.ToLower().Trim() == "\"" + DATA_CONTENT_DISPOSITION_NAME_FILE_CONTENTS + "\"");
// write the file to disk
using (var fileStream = await fileContent.ReadAsStreamAsync())
{
using (FileStream toDisk = File.OpenWrite("myUploadedFile.bin"))
{
((Stream)fileStream).CopyTo(toDisk);
}
}
WebHostBufferPolicySelector only specifies if the underlying request is bufferless. This is what Web API will do under the hood:
IHostBufferPolicySelector policySelector = _bufferPolicySelector.Value;
bool isInputBuffered = policySelector == null ? true : policySelector.UseBufferedInputStream(httpContextBase);
Stream inputStream = isInputBuffered
? requestBase.InputStream
: httpContextBase.ApplicationInstance.Request.GetBufferlessInputStream();
So if your implementation returns false, then the request is bufferless.
However, ReadAsMultipartAsync() loads everything into MemoryStream - because if you don't specify a provider, it defaults to MultipartMemoryStreamProvider.
To get the files to save automatically to disk as every part is processed use MultipartFormDataStreamProvider (if you deal with files and form data) or MultipartFileStreamProvider (if you deal with just files).
There is an example on asp.net or here. In these examples everything happens in controllers, but there is no reason why you wouldn't use it in i.e. a formatter.
Another option, if you really want to play with streams is to implement a custom class inheritng from MultipartStreamProvider that would fire whatever processing you want as soon as it grabs part of the stream. The usage would be similar to the aforementioned providers - you'd need to pass it to the ReadAsMultipartAsync(provider) method.
Finally - if you are feeling suicidal - since the underlying request stream is bufferless theoretically you could use something like this in your controller or formatter:
Stream stream = HttpContext.Current.Request.GetBufferlessInputStream();
byte[] b = new byte[32*1024];
while ((n = stream.Read(b, 0, b.Length)) > 0)
{
//do stuff with stream bit
}
But of course that's very, for the lack of better word, "ghetto."

Categories