split binary file into chunks or Parts upload / download - c#

I am using couchdb for some reason as a content management to upload files as binary data, there is no GridFs support like mongoDB to upload large files, so I need to upload files as chunks then retrieve them as one file.
here is my code
public string InsertDataToCouchDb(string dbName, string id, string filename, byte[] image)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
// HERE I NEED TO UPLOAD MY IMAGE BYTE[] AS CHUNKS
var artist = new couchdb
{
_id = id,
filename = filename,
Image = image
};
var response = db.Entities.PutAsync(artist);
return response.Result.Content._id;
}
}
public byte[] FetchDataFromCouchDb(string dbName, string id)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
//HERE I NEED TO RETRIVE MY FULL IMAGE[] FROM CHUNKS
var test = db.Documents.GetAsync(id, null);
var doc = db.Serializer.Deserialize<couchdb>(test.Result.Content);
return doc.Image;
}
}
THANK YOU

Putting image data in a CouchDB document is a terrible idea. Just don't. This is the purpose of CouchDB attachments.
The potential of bloating the database with redundant blob data via document updates alone will surely have major, negative consequences for anything other than a toy database.
Further there seems to be a lack of understanding how async/await works as the code in the OP is invoking async methods, e.g. db.Entities.PutAsync(artist), without an await - the call surely will fail every time (if the compiler even allows the code). I highly recommend grok'ing the Microsoft document Asynchronous programming with async and await.
Now as for "chunking": If the image data is so large that it needs to be otherwise streamed, the business of passing it around via a byte array looks bad. If the images are relatively small, just use Attachment.PutAsync as it stands.
Although Attachment.PutAsync at MyCouch v7.6 does not support streams (effectively chunking) there exists the Support Streams for attachments #177 PR, which does, and it looks pretty good.
Here's a one page C# .Net Core console app that uploads a given file as an attachment to a specific document using the very efficient streaming provided by PR 177. Although the code uses PR 177, it most importantly uses Attachments for blob data. Replacing a stream with a byte array is rather straightforward.
MyCouch + PR 177
In a console get MyCouch sources and then apply PR 177
$ git clone https://github.com/danielwertheim/mycouch.git
$ cd mycouch
$ git pull origin 15a1079502a1728acfbfea89a7e255d0c8725e07
(I don't know git so there's probably a far better way to get a PR)
MyCouchUploader
With VS2019
Create a new .Net Core console app project and solution named "MyCouchUploader"
Add the MyCouch project pulled with PR 177 to the solution
Add the MyCouch project as MyCouchUploader dependency
Add the Nuget package "Microsoft.AspNetCore.StaticFiles" as a MyCouchUploader dependency
Replace the content of Program.cs with the following code:
using Microsoft.AspNetCore.StaticFiles;
using MyCouch;
using MyCouch.Requests;
using MyCouch.Responses;
using System;
using System.IO;
using System.Linq;
using System.Net;
using System.Security.Cryptography;
using System.Threading.Tasks;
namespace MyCouchUploader
{
class Program
{
static async Task Main(string[] args)
{
// args: scheme, database, file path of asset to upload.
if (args.Length < 3)
{
Console.WriteLine("\nUsage: MyCouchUploader scheme dbname filepath\n");
return;
}
var opts = new
{
scheme = args[0],
dbName = args[1],
filePath = args[2]
};
Action<Response> check = (response) =>
{
if (!response.IsSuccess) throw new Exception(response.Reason);
};
try
{
// canned doc id for this app
const string docId = "SO-68998781";
const string attachmentName = "Image";
DbConnectionInfo cnxn = new DbConnectionInfo(opts.scheme, opts.dbName)
{ // timely fail if scheme is bad
Timeout = TimeSpan.FromMilliseconds(3000)
};
MyCouchClient client = new MyCouchClient(cnxn);
// ensure db is there
GetDatabaseResponse info = await client.Database.GetAsync();
check(info);
// delete doc for succcessive program runs
DocumentResponse doc = await client.Documents.GetAsync(docId);
if (doc.StatusCode == HttpStatusCode.OK)
{
DocumentHeaderResponse del = await client.Documents.DeleteAsync(docId, doc.Rev);
check(del);
}
// sniff file for content type
FileExtensionContentTypeProvider provider = new FileExtensionContentTypeProvider();
if (!provider.TryGetContentType(opts.filePath, out string contentType))
{
contentType = "application/octet-stream";
}
// create a hash for silly verification
using var md5 = MD5.Create();
using Stream stream = File.OpenRead(opts.filePath);
byte[] fileHash = md5.ComputeHash(stream);
stream.Position = 0;
// Use PR 177, sea-locks:stream-attachments.
DocumentHeaderResponse put = await client.Attachments.PutAsync(new PutAttachmentStreamRequest(
docId,
attachmentName,
contentType,
stream // :-D
));
check(put);
// verify
AttachmentResponse verify = await client.Attachments.GetAsync(docId, attachmentName);
check(verify);
if (fileHash.SequenceEqual(md5.ComputeHash(verify.Content)))
{
Console.WriteLine("Atttachment verified.");
}
else
{
throw new Exception(String.Format("Attachment failed verification with status code {0}", verify.StatusCode));
}
}
catch (Exception e)
{
Console.WriteLine("Fail! {0}", e.Message);
}
}
}
}
To run:
$ MyCouchdbUploader http://name:password#localhost:5984 dbname path-to-local-image-file
Use Fauxton to visually verify the attachment for the doc.

Related

Merge Azure Blobs (PDFs) into one Blob and download to user via C# ASP.NET

I have an ASP.NET Azure web application written in C# that involves the user uploading different pdfs into Azure Blob storage. I'd like the user to later download a combined PDF inclusive of previously-uploaded blobs in a specific order. Any idea on the best way to accomplish this?
Here are 2 workarounds that you can try
Use of Azure Functions.
Download your pdf files from Azure Blob to your local computer, then merge them.
Use of Azure Functions
Create an azure function project and use the HTTP Trigger.
Make sure you install the below packages before getting started with coding.
Create the Function code.
Create Azure function in the portal.
Publish the code.
We are ready to start writing code. We need two files:
ResultClass.cs – returns the merged file(s) as a list.
Function1.cs – CCode that takes the file names from the URL, grabs them from the Storage account, merges them into one, and returns a download URL.
ResultClass.cs
using System;
using System.Collections.Generic;
namespace FunctionApp1
{
public class Result
{
public Result(IList<string> newFiles)
{
this.files = newFiles;
}
public IList<string> files { get; private set; }
}
}
Function1.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Configuration;
using Microsoft.WindowsAzure.Storage.Blob;
using Newtonsoft.Json;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
namespace FunctionApp1
{
public class Function1
{
static Function1()
{
// This is required to avoid the "No data is available for encoding 1252" exception when saving the PdfDocument
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
}
[FunctionName("Function1")]
public async Task<Result> SplitUploadAsync(
[HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] HttpRequestMessage req,
//container where files will be stored and accessed for retrieval. in this case, it's called temp-pdf
[Blob("temp-pdf", Connection = "")] CloudBlobContainer outputContainer,
ILogger log)
{
//get query parameters
string uriq = req.RequestUri.ToString();
string keyw = uriq.Substring(uriq.IndexOf('=') + 1);
//get file name in query parameters
String fileNames = keyw.Split("mergepfd&filenam=")[1];
//split file name
string[] files = fileNames.Split(',');
//process merge
var newFiles = await this.MergeFileAsync(outputContainer, files);
return new Result(newFiles);
}
private async Task<IList<string>> MergeFileAsync(CloudBlobContainer container, string[] blobfiles)
{
//init instance
PdfDocument outputDocument = new PdfDocument();
//loop through files sent in query
foreach (string fileblob in blobfiles)
{
String intfile = $"" + fileblob;
// get file
CloudBlockBlob blob = container.GetBlockBlobReference(intfile);
using (var memoryStream = new MemoryStream())
{
await blob.DownloadToStreamAsync(memoryStream);
//get file content
string contents = blob.DownloadTextAsync().Result;
//open document
var inputDocument = PdfReader.Open(memoryStream, PdfDocumentOpenMode.Import);
//get pages
int count = inputDocument.PageCount;
for (int idx = 0; idx < count; idx++)
{
//append
outputDocument.AddPage(inputDocument.Pages[idx]);
}
}
}
var outputFiles = new List<string>();
var tempFile = String.Empty;
//call save function to store output in container
tempFile = await this.SaveToBlobStorageAsync(container, outputDocument);
outputFiles.Add(tempFile);
//return file(s) url
return outputFiles;
}
private async Task<string> SaveToBlobStorageAsync(CloudBlobContainer container, PdfDocument document)
{
//file name structure
var filename = $"merge-{DateTime.Now.ToString("yyyyMMddhhmmss")}-{Guid.NewGuid().ToString().Substring(0, 4)}.pdf";
// Creating an empty file pointer
var outputBlob = container.GetBlockBlobReference(filename);
using (var stream = new MemoryStream())
{
//save result of merge
document.Save(stream);
await outputBlob.UploadFromStreamAsync(stream);
}
//get sas token
var sasBlobToken = outputBlob.GetSharedAccessSignature(new SharedAccessBlobPolicy()
{
SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5),
Permissions = SharedAccessBlobPermissions.Read
});
//return sas token
return outputBlob.Uri + sasBlobToken;
}
}
}
Download your pdf files from Azure Blob to your local computer, then merge them
internal static void combineNormalPdfFiles()
{
String inputFilePath1 = #"C:\1.pdf";
String inputFilePath2 = #"C:\2.pdf";
String inputFilePath3 = #"C:\3.pdf";
String outputFilePath = #"C:\Output.pdf";
String[] inputFilePaths = new String[3] { inputFilePath1, inputFilePath2, inputFilePath3 };
// Combine three PDF files and output.
PDFDocument.CombineDocument(inputFilePaths, outputFilePath);
}
REFERENCES:
Azure Function to combine PDF Blobs in Azure Storage Account (Blob container)
C# Merge PDF SDK: Merge, combine PDF files in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF

Why does SHA256 hash not match after uploading and downloading file?

I'm creating a feature for an app to store a file on a webserver while maintaining data about the file on SQL Server. I generate a SHA256 hash and store it as BINARY(32) and then upload the file to a WebDav server using HTTPClient. Later when I want to view the file in the app, I do a GET request, download the file, and check the SHA256 hash with the stored hash. It doesn't match :( Why?
I've tried checking the hash on the server and the local machine and it doesn't match either. I've done a ton of research and made sure I wasn't hashing the filename (you can see the code below).
public static byte[] GetSHA256(string path) {
using (var stream = File.OpenRead(path)) {
using (var sha256 = SHA256.Create()) {
return sha256.ComputeHash(stream);
}
}
}
To Upload a file:
public async Task<bool> Upload(string path, string name) {
var storedHash = GetSHA256(path/name);
//Store this hash in a database, omitted for brevity
using (var file = File.OpenRead(path)) {
var content = new MultipartFormDataContent();
content.Headers.ContentType.Media = "multipart/form-data";
content.Add(new StreamContent(file));
var result = await HttpClient.PutAsync(uri, content);
}
}
To download:
var result = await HttpClient.GetAsync(uri);
using (var stream = await result.Content.ReadAsStreamAsync()) {
var fileInfo = new FileInfo("TestFile");
using(var fileStream = fileInfo.Open(FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Delete)) {
await stream.CopyToAsync(fileStream);
}
}
var downloadedFileHash = GetSHA256("TestFile");
//check if downloadedFileHash matches the storedHash by comparing byte[] length and content with for loop.
I expect that the hash would match. I know I'm missing a few using statements and other code but I omitted a bunch for brevity.
EDIT: The hashes for the downloaded files stay the same so the problem isn't downloading but uploading. I uploaded the same files multiple times but get back different hashes for each one. But the different hashes stay constant.
Sorry y'all, you can delete this question because I found the problem/answer but am still confused why this is occurring.
Turns out webdav was adding extra headers to my file for some reason, see: Header info being written into file when PUT-ing to a Webdav server
Strangest thing. So I encountered this post. https://blogs.msdn.microsoft.com/robert_mcmurray/2011/10/18/sending-webdav-requests-in-net-revisited/
Rewrote my code to be
public static async Task<HttpResponseMessage> Upload(string path, string name, FileStream file) {
var method = new HttpMethod(#"PUT");
var message = new HttpRequestMessage(method, path/name) {
Content = new StreamContent(file)
};
return await HttpClient.SendAsync(message);
}
And it works... But I'm wonder how the two methods of uploading differ.

File API seems to always write corrupt files when used in a loop, except for the last file

I know the title is long, but it describes the problem exactly. I didn't know how else to explain it because this is totally out there.
I have a utility written in C# targeting .NET Core 2.1 that downloads and decrypts (AES encryption) files originally uploaded by our clients from our encrypted store, so they can be reprocessed through some of our services in the case that they fail. This utility is run via CLI using database IDs for the files as arguments, for example download.bat 101 102 103 would download 3 files with the corresponding IDs. I'm receiving byte data through a message queue (really not much more than a TCP socket) which describes a .TIF image.
I have a good reason to believe that the byte data is not ever corrupted on the server. That reason is when I run the utility with only one ID parameter, such as download.bat 101, then it works just fine. Furthermore, when I run it with multiple IDs, the last file that is downloaded by the utility is always intact, but the rest are always corrupted.
This odd behavior has persisted across two different implementations for writing the byte data to a file. Those implementations are below.
File.ReadAllBytes implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
File.WriteAllBytes(destination, outputStream.ToArray());
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
FileStream implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
using (FileStream fs = new FileStream(destination, FileMode.Create))
{
var bytes = outputStream.ToArray();
fs.Write(bytes, 0, envelope.Payload.Length);
_logger.LogStatement($"File byte content: [{string.Join(", ", bytes.Take(16))}]", LogLevel.Trace);
fs.Flush();
}
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
This method is called from a for loop which first receives the messages I described earlier and then feeds their payloads to the above method:
using (var requestSocket = new RequestSocket(fileServiceEndpoint))
{
// Envelopes is constructed beforehand
foreach (var envelope in envelopes)
{
var timer = Stopwatch.StartNew();
requestSocket.SendMoreFrame(messageTypeBytes);
requestSocket.SendMoreFrame(SerializationHelper.SerializeObjectToBuffer(envelope));
if (!requestSocket.TrySendFrame(_timeout, signedPayloadBytes, signedPayloadBytes.Length))
{
var message = $"Timeout exceeded while processing [{envelope.ActionType}] request.";
_logger.LogStatement(message, LogLevel.Error);
throw new Exception(message);
}
var responseReceived = requestSocket.TryReceiveFrameBytes(_timeout, out byte[] responseBytes);
...
var responseEnvelope = SerializationHelper.DeserializeObject<FileServiceResponseEnvelope>(responseBytes);
...
_logger.LogStatement($"Received response with payload of [{responseEnvelope.Payload.Length} bytes].", LogLevel.Info);
var destDir = downloadDetails.GetDestinationPath(responseEnvelope.FileId);
if (!Directory.Exists(destDir))
Directory.CreateDirectory(destDir);
var dest = Path.Combine(destDir, idsToFileNames[responseEnvelope.FileId]);
WriteMessageContents(responseEnvelope, dest, encryptionKey, macInitialVector);
}
}
I also know that TIFs have a very specific header, which looks something like this in raw bytes:
[73, 73, 42, 0, 8, 0, 0, 0, 20, 0...
It always begins with "II" (73, 73) or "MM" (77, 77) followed by 42 (probably a Hitchhiker's reference). I analyzed the bytes written by the utility. The last file always has a header that resembles this one. The rest are always random bytes; seemingly jumbled or mis-ordered image binary data. Any insight on this would be greatly appreciated because I can't wrap my mind around what I would even need to do to diagnose this.
UPDATE
I was able to figure out this problem with the help of elgonzo in the comments. Sometimes it isn't a direct answer that helps, but someone picking your brain until you look in the right place.
All right, as I suspected this was a dumb mistake (I had severe doubts that the File API was simply this flawed for so long). I just needed help thinking through it. There was an additional bit of code which I didn't post that was biting me, when I was retrieving the metadata for the file so that I could then request the file from our storage box.
byte[] encryptionKey = null;
byte[] macInitialVector = null;
...
using (var conn = new SqlConnection(ConnectionString))
using (var cmd = new SqlCommand(uploadedFileQuery, conn))
{
conn.Open();
var reader = cmd.ExecuteReader();
while (reader.Read())
{
FileServiceMessageEnvelope readAllEnvelope = null;
var originalFileName = reader["UploadedFileClientName"].ToString();
var fileId = Convert.ToInt64(reader["UploadedFileId"].ToString());
//var originalFileExtension = originalFileName.Substring(originalFileName.IndexOf('.'));
//_logger.LogStatement($"Scooped extension: {originalFileExtension}", LogLevel.Trace);
envelopes.Add(readAllEnvelope = new FileServiceMessageEnvelope
{
ActionType = FileServiceActionTypeEnum.ReadAll,
FileType = FileTypeEnum.UploadedFile,
FileName = reader["UploadedFileServerName"].ToString(),
FileId = fileId,
WorkerAuthorization = null,
BinaryTimestamp = DateTime.Now.ToBinary(),
Position = 0,
Count = Convert.ToInt32(reader["UploadedFileSize"]),
SignerFqdn = _messengerConfig.FullyQualifiedDomainName
});
readAllEnvelope.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayload = new SecureMessage { Payload = new byte[0] };
signedPayload.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayloadBytes = SerializationHelper.SerializeObjectToBuffer(signedPayload);
encryptionKey = (byte[])reader["UploadedFileEncryptionKey"];
macInitialVector = (byte[])reader["UploadedFileEncryptionMacInitialVector"];
}
conn.Close();
}
Eagle-eyed observers might realize that I have not properly coupled the encryptionKey and macInitialVector to the correct record, since each file has a unique key and vector. This means I was using the key for one of the files to decrypt all of them which is why they were all corrupt except for one file -- they were not properly decrypted. I solved this issue by coupling them together with the ID in a simple POCO and retrieving the appropriate key and vector for each file upon decryption.

How can I save entire MongoDB collection to json/bson file using C#?

I have process, which first, generates lots of data which is save into mongoDB collection, then data is analyzed, and last - I want to save the whole collection to file on disk, and erase the collection.
I know I could do it easily with MongoDump.exe, but I was wondering is there any way to do it directly from c#? - I mean not running console precess with - but using some functionality that is inside MOngo C# driver.
And, if it can be done - how would I do the reverse operation in c# ? - namely: loading .bson file into collection?
Here's two methods that you can use to accomplish this:
public static async Task WriteCollectionToFile(IMongoDatabase database, string collectionName, string fileName)
{
var collection = database.GetCollection<RawBsonDocument>(collectionName);
// Make sure the file is empty before we start writing to it
File.WriteAllText(fileName, string.Empty);
using (var cursor = await collection.FindAsync(new BsonDocument()))
{
while (await cursor.MoveNextAsync())
{
var batch = cursor.Current;
foreach (var document in batch)
{
File.AppendAllLines(fileName, new[] { document.ToString() });
}
}
}
}
public static async Task LoadCollectionFromFile(IMongoDatabase database, string collectionName, string fileName)
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
var collection = database.GetCollection<BsonDocument>(collectionName);
string line;
while ((line = sr.ReadLine()) != null)
{
await collection.InsertOneAsync(BsonDocument.Parse(line));
}
}
}
And here's an example of how you would use them:
// Obviously you'll need to change all these values to your environment
var connectionString = "mongodb://localhost:27017";
var database = new MongoClient(connectionString).GetDatabase("database");
var fileName = #"C:\mongo_output.txt";
var collectionName = "collection name";
// This will save all of the documents in the file you specified
WriteCollectionToFile(database, collectionName, fileName).Wait();
// This will drop all of the documents in the collection
Task.Factory.StartNew(() => database.GetCollection(collectionName).DeleteManyAsync(new BsonDocument())).Wait();
// This will restore all the documents from the file you specified
LoadCollectionFromFile(database, collectionName, fileName).Wait();
Note that this code was written using version 2.0 of the MongoDB C# driver, which you can obtain via Nuget. Also, the file reading code in the LoadCollectionFromFile method was obtained from this answer.
You can use C# BinaryFormatter to serialize object graph to disk. Also you can deserialize back to object graph.
Serialize:
https://msdn.microsoft.com/en-us/library/c5sbs8z9%28v=VS.110%29.aspx
Deserialize:
https://msdn.microsoft.com/en-us/library/b85344hz%28v=vs.110%29.aspx
However that is not mongodb or C# driver feature.
After serializing you can use the driver to drop the collection. And after deserializing you can use the driver to insert objects into a new collection.
Based on your rules, you may want to do some locking on that collection at the time you are doing the export process before you drop it.

Streaming files from amazon s3 with seek possibility in C#

I need to work with huge files in Amazon S3. How can I get part of huge file from S3? Best way would be get stream with the seek possibility.
Unfortunately, CanSeek property of response.ResponseStream is false:
GetObjectRequest request = new GetObjectRequest();
request.BucketName = BUCKET_NAME;
request.Key = NumIdToAmazonKey(numID);
GetObjectResponse response = client.GetObject(request);
You could do following to read a certain part of your file
GetObjectRequest request = new GetObjectRequest
{
BucketName = bucketName,
Key = keyName,
ByteRange = new ByteRange(0, 10)
};
See the documentation
I know this isn't exactly what OP is asking for but I needed a seekable s3 stream so I could read Parquet files without downloading them so I gave this a shot here: https://github.com/mukunku/RandomHelpers/blob/master/SeekableS3Stream.cs
Performance wasn't as bad as I expected. You can use the TimeWastedSeeking property to see how much time is being wasted by allowing Seek() on an s3 stream.
Here's an example on how to use it:
using (var client = new AmazonS3Client(credentials, Amazon.RegionEndpoint.USEast1))
{
using (var stream = SeekableS3Stream.OpenFile(client, "myBucket", "path/to/myfile.txt", true))
{
//stream is seekable!
}
}
After a frustrating afternoon with the same problem I found the static class AmazonS3Util
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/S3/TS3Util.html
Which has a MakeStreamSeekable method.
Way late for the OP, but I've just posted an article and code demonstration of a SeekableS3Stream that performs reasonably well in real-world use cases.
https://github.com/mlhpdx/seekable-s3-stream
Specifically, I demonstrate reading a single small file from a much larger ISO disk image using the DiscUtils library unmodified by implementing a random-access stream that uses Range requests to pull sections of the file as-needed and maintains them in an MRU list to prevent re-downloading ranges for hot data structures in the file (e.g. zip central directory records).
The use is similarly simple:
using System;
using System.IO;
using System.Threading.Tasks;
using Amazon.S3;
using DiscUtils.Iso9660;
namespace Seekable_S3_Stream
{
class Program
{
const string BUCKET = "rds.nsrl.nist.gov";
const string KEY = "RDS/current/RDS_ios.iso"; // "RDS/current/RDS_modern.iso";
const string FILENAME = "READ_ME.TXT";
static async Task Main(string[] args)
{
var s3 = new AmazonS3Client();
using var stream = new Cppl.Utilities.AWS.SeekableS3Stream(s3, BUCKET, KEY, 1 * 1024 * 1024, 4);
using var iso = new CDReader(stream, true);
using var file = iso.OpenFile(FILENAME, FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(file);
var content = await reader.ReadToEndAsync();
await Console.Out.WriteLineAsync($"{stream.TotalRead / (float)stream.Length * 100}% read, {stream.TotalLoaded / (float)stream.Length * 100}% loaded");
}
}
}

Categories