Streaming files from amazon s3 with seek possibility in C# - c#

I need to work with huge files in Amazon S3. How can I get part of huge file from S3? Best way would be get stream with the seek possibility.
Unfortunately, CanSeek property of response.ResponseStream is false:
GetObjectRequest request = new GetObjectRequest();
request.BucketName = BUCKET_NAME;
request.Key = NumIdToAmazonKey(numID);
GetObjectResponse response = client.GetObject(request);

You could do following to read a certain part of your file
GetObjectRequest request = new GetObjectRequest
{
BucketName = bucketName,
Key = keyName,
ByteRange = new ByteRange(0, 10)
};
See the documentation

I know this isn't exactly what OP is asking for but I needed a seekable s3 stream so I could read Parquet files without downloading them so I gave this a shot here: https://github.com/mukunku/RandomHelpers/blob/master/SeekableS3Stream.cs
Performance wasn't as bad as I expected. You can use the TimeWastedSeeking property to see how much time is being wasted by allowing Seek() on an s3 stream.
Here's an example on how to use it:
using (var client = new AmazonS3Client(credentials, Amazon.RegionEndpoint.USEast1))
{
using (var stream = SeekableS3Stream.OpenFile(client, "myBucket", "path/to/myfile.txt", true))
{
//stream is seekable!
}
}

After a frustrating afternoon with the same problem I found the static class AmazonS3Util
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/S3/TS3Util.html
Which has a MakeStreamSeekable method.

Way late for the OP, but I've just posted an article and code demonstration of a SeekableS3Stream that performs reasonably well in real-world use cases.
https://github.com/mlhpdx/seekable-s3-stream
Specifically, I demonstrate reading a single small file from a much larger ISO disk image using the DiscUtils library unmodified by implementing a random-access stream that uses Range requests to pull sections of the file as-needed and maintains them in an MRU list to prevent re-downloading ranges for hot data structures in the file (e.g. zip central directory records).
The use is similarly simple:
using System;
using System.IO;
using System.Threading.Tasks;
using Amazon.S3;
using DiscUtils.Iso9660;
namespace Seekable_S3_Stream
{
class Program
{
const string BUCKET = "rds.nsrl.nist.gov";
const string KEY = "RDS/current/RDS_ios.iso"; // "RDS/current/RDS_modern.iso";
const string FILENAME = "READ_ME.TXT";
static async Task Main(string[] args)
{
var s3 = new AmazonS3Client();
using var stream = new Cppl.Utilities.AWS.SeekableS3Stream(s3, BUCKET, KEY, 1 * 1024 * 1024, 4);
using var iso = new CDReader(stream, true);
using var file = iso.OpenFile(FILENAME, FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(file);
var content = await reader.ReadToEndAsync();
await Console.Out.WriteLineAsync($"{stream.TotalRead / (float)stream.Length * 100}% read, {stream.TotalLoaded / (float)stream.Length * 100}% loaded");
}
}
}

Related

split binary file into chunks or Parts upload / download

I am using couchdb for some reason as a content management to upload files as binary data, there is no GridFs support like mongoDB to upload large files, so I need to upload files as chunks then retrieve them as one file.
here is my code
public string InsertDataToCouchDb(string dbName, string id, string filename, byte[] image)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
// HERE I NEED TO UPLOAD MY IMAGE BYTE[] AS CHUNKS
var artist = new couchdb
{
_id = id,
filename = filename,
Image = image
};
var response = db.Entities.PutAsync(artist);
return response.Result.Content._id;
}
}
public byte[] FetchDataFromCouchDb(string dbName, string id)
{
var connection = System.Configuration.ConfigurationManager.ConnectionStrings["CouchDb"].ConnectionString;
using (var db = new MyCouchClient(connection, dbName))
{
//HERE I NEED TO RETRIVE MY FULL IMAGE[] FROM CHUNKS
var test = db.Documents.GetAsync(id, null);
var doc = db.Serializer.Deserialize<couchdb>(test.Result.Content);
return doc.Image;
}
}
THANK YOU
Putting image data in a CouchDB document is a terrible idea. Just don't. This is the purpose of CouchDB attachments.
The potential of bloating the database with redundant blob data via document updates alone will surely have major, negative consequences for anything other than a toy database.
Further there seems to be a lack of understanding how async/await works as the code in the OP is invoking async methods, e.g. db.Entities.PutAsync(artist), without an await - the call surely will fail every time (if the compiler even allows the code). I highly recommend grok'ing the Microsoft document Asynchronous programming with async and await.
Now as for "chunking": If the image data is so large that it needs to be otherwise streamed, the business of passing it around via a byte array looks bad. If the images are relatively small, just use Attachment.PutAsync as it stands.
Although Attachment.PutAsync at MyCouch v7.6 does not support streams (effectively chunking) there exists the Support Streams for attachments #177 PR, which does, and it looks pretty good.
Here's a one page C# .Net Core console app that uploads a given file as an attachment to a specific document using the very efficient streaming provided by PR 177. Although the code uses PR 177, it most importantly uses Attachments for blob data. Replacing a stream with a byte array is rather straightforward.
MyCouch + PR 177
In a console get MyCouch sources and then apply PR 177
$ git clone https://github.com/danielwertheim/mycouch.git
$ cd mycouch
$ git pull origin 15a1079502a1728acfbfea89a7e255d0c8725e07
(I don't know git so there's probably a far better way to get a PR)
MyCouchUploader
With VS2019
Create a new .Net Core console app project and solution named "MyCouchUploader"
Add the MyCouch project pulled with PR 177 to the solution
Add the MyCouch project as MyCouchUploader dependency
Add the Nuget package "Microsoft.AspNetCore.StaticFiles" as a MyCouchUploader dependency
Replace the content of Program.cs with the following code:
using Microsoft.AspNetCore.StaticFiles;
using MyCouch;
using MyCouch.Requests;
using MyCouch.Responses;
using System;
using System.IO;
using System.Linq;
using System.Net;
using System.Security.Cryptography;
using System.Threading.Tasks;
namespace MyCouchUploader
{
class Program
{
static async Task Main(string[] args)
{
// args: scheme, database, file path of asset to upload.
if (args.Length < 3)
{
Console.WriteLine("\nUsage: MyCouchUploader scheme dbname filepath\n");
return;
}
var opts = new
{
scheme = args[0],
dbName = args[1],
filePath = args[2]
};
Action<Response> check = (response) =>
{
if (!response.IsSuccess) throw new Exception(response.Reason);
};
try
{
// canned doc id for this app
const string docId = "SO-68998781";
const string attachmentName = "Image";
DbConnectionInfo cnxn = new DbConnectionInfo(opts.scheme, opts.dbName)
{ // timely fail if scheme is bad
Timeout = TimeSpan.FromMilliseconds(3000)
};
MyCouchClient client = new MyCouchClient(cnxn);
// ensure db is there
GetDatabaseResponse info = await client.Database.GetAsync();
check(info);
// delete doc for succcessive program runs
DocumentResponse doc = await client.Documents.GetAsync(docId);
if (doc.StatusCode == HttpStatusCode.OK)
{
DocumentHeaderResponse del = await client.Documents.DeleteAsync(docId, doc.Rev);
check(del);
}
// sniff file for content type
FileExtensionContentTypeProvider provider = new FileExtensionContentTypeProvider();
if (!provider.TryGetContentType(opts.filePath, out string contentType))
{
contentType = "application/octet-stream";
}
// create a hash for silly verification
using var md5 = MD5.Create();
using Stream stream = File.OpenRead(opts.filePath);
byte[] fileHash = md5.ComputeHash(stream);
stream.Position = 0;
// Use PR 177, sea-locks:stream-attachments.
DocumentHeaderResponse put = await client.Attachments.PutAsync(new PutAttachmentStreamRequest(
docId,
attachmentName,
contentType,
stream // :-D
));
check(put);
// verify
AttachmentResponse verify = await client.Attachments.GetAsync(docId, attachmentName);
check(verify);
if (fileHash.SequenceEqual(md5.ComputeHash(verify.Content)))
{
Console.WriteLine("Atttachment verified.");
}
else
{
throw new Exception(String.Format("Attachment failed verification with status code {0}", verify.StatusCode));
}
}
catch (Exception e)
{
Console.WriteLine("Fail! {0}", e.Message);
}
}
}
}
To run:
$ MyCouchdbUploader http://name:password#localhost:5984 dbname path-to-local-image-file
Use Fauxton to visually verify the attachment for the doc.

Why does SHA256 hash not match after uploading and downloading file?

I'm creating a feature for an app to store a file on a webserver while maintaining data about the file on SQL Server. I generate a SHA256 hash and store it as BINARY(32) and then upload the file to a WebDav server using HTTPClient. Later when I want to view the file in the app, I do a GET request, download the file, and check the SHA256 hash with the stored hash. It doesn't match :( Why?
I've tried checking the hash on the server and the local machine and it doesn't match either. I've done a ton of research and made sure I wasn't hashing the filename (you can see the code below).
public static byte[] GetSHA256(string path) {
using (var stream = File.OpenRead(path)) {
using (var sha256 = SHA256.Create()) {
return sha256.ComputeHash(stream);
}
}
}
To Upload a file:
public async Task<bool> Upload(string path, string name) {
var storedHash = GetSHA256(path/name);
//Store this hash in a database, omitted for brevity
using (var file = File.OpenRead(path)) {
var content = new MultipartFormDataContent();
content.Headers.ContentType.Media = "multipart/form-data";
content.Add(new StreamContent(file));
var result = await HttpClient.PutAsync(uri, content);
}
}
To download:
var result = await HttpClient.GetAsync(uri);
using (var stream = await result.Content.ReadAsStreamAsync()) {
var fileInfo = new FileInfo("TestFile");
using(var fileStream = fileInfo.Open(FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Delete)) {
await stream.CopyToAsync(fileStream);
}
}
var downloadedFileHash = GetSHA256("TestFile");
//check if downloadedFileHash matches the storedHash by comparing byte[] length and content with for loop.
I expect that the hash would match. I know I'm missing a few using statements and other code but I omitted a bunch for brevity.
EDIT: The hashes for the downloaded files stay the same so the problem isn't downloading but uploading. I uploaded the same files multiple times but get back different hashes for each one. But the different hashes stay constant.
Sorry y'all, you can delete this question because I found the problem/answer but am still confused why this is occurring.
Turns out webdav was adding extra headers to my file for some reason, see: Header info being written into file when PUT-ing to a Webdav server
Strangest thing. So I encountered this post. https://blogs.msdn.microsoft.com/robert_mcmurray/2011/10/18/sending-webdav-requests-in-net-revisited/
Rewrote my code to be
public static async Task<HttpResponseMessage> Upload(string path, string name, FileStream file) {
var method = new HttpMethod(#"PUT");
var message = new HttpRequestMessage(method, path/name) {
Content = new StreamContent(file)
};
return await HttpClient.SendAsync(message);
}
And it works... But I'm wonder how the two methods of uploading differ.

File API seems to always write corrupt files when used in a loop, except for the last file

I know the title is long, but it describes the problem exactly. I didn't know how else to explain it because this is totally out there.
I have a utility written in C# targeting .NET Core 2.1 that downloads and decrypts (AES encryption) files originally uploaded by our clients from our encrypted store, so they can be reprocessed through some of our services in the case that they fail. This utility is run via CLI using database IDs for the files as arguments, for example download.bat 101 102 103 would download 3 files with the corresponding IDs. I'm receiving byte data through a message queue (really not much more than a TCP socket) which describes a .TIF image.
I have a good reason to believe that the byte data is not ever corrupted on the server. That reason is when I run the utility with only one ID parameter, such as download.bat 101, then it works just fine. Furthermore, when I run it with multiple IDs, the last file that is downloaded by the utility is always intact, but the rest are always corrupted.
This odd behavior has persisted across two different implementations for writing the byte data to a file. Those implementations are below.
File.ReadAllBytes implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
File.WriteAllBytes(destination, outputStream.ToArray());
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
FileStream implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
using (FileStream fs = new FileStream(destination, FileMode.Create))
{
var bytes = outputStream.ToArray();
fs.Write(bytes, 0, envelope.Payload.Length);
_logger.LogStatement($"File byte content: [{string.Join(", ", bytes.Take(16))}]", LogLevel.Trace);
fs.Flush();
}
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
This method is called from a for loop which first receives the messages I described earlier and then feeds their payloads to the above method:
using (var requestSocket = new RequestSocket(fileServiceEndpoint))
{
// Envelopes is constructed beforehand
foreach (var envelope in envelopes)
{
var timer = Stopwatch.StartNew();
requestSocket.SendMoreFrame(messageTypeBytes);
requestSocket.SendMoreFrame(SerializationHelper.SerializeObjectToBuffer(envelope));
if (!requestSocket.TrySendFrame(_timeout, signedPayloadBytes, signedPayloadBytes.Length))
{
var message = $"Timeout exceeded while processing [{envelope.ActionType}] request.";
_logger.LogStatement(message, LogLevel.Error);
throw new Exception(message);
}
var responseReceived = requestSocket.TryReceiveFrameBytes(_timeout, out byte[] responseBytes);
...
var responseEnvelope = SerializationHelper.DeserializeObject<FileServiceResponseEnvelope>(responseBytes);
...
_logger.LogStatement($"Received response with payload of [{responseEnvelope.Payload.Length} bytes].", LogLevel.Info);
var destDir = downloadDetails.GetDestinationPath(responseEnvelope.FileId);
if (!Directory.Exists(destDir))
Directory.CreateDirectory(destDir);
var dest = Path.Combine(destDir, idsToFileNames[responseEnvelope.FileId]);
WriteMessageContents(responseEnvelope, dest, encryptionKey, macInitialVector);
}
}
I also know that TIFs have a very specific header, which looks something like this in raw bytes:
[73, 73, 42, 0, 8, 0, 0, 0, 20, 0...
It always begins with "II" (73, 73) or "MM" (77, 77) followed by 42 (probably a Hitchhiker's reference). I analyzed the bytes written by the utility. The last file always has a header that resembles this one. The rest are always random bytes; seemingly jumbled or mis-ordered image binary data. Any insight on this would be greatly appreciated because I can't wrap my mind around what I would even need to do to diagnose this.
UPDATE
I was able to figure out this problem with the help of elgonzo in the comments. Sometimes it isn't a direct answer that helps, but someone picking your brain until you look in the right place.
All right, as I suspected this was a dumb mistake (I had severe doubts that the File API was simply this flawed for so long). I just needed help thinking through it. There was an additional bit of code which I didn't post that was biting me, when I was retrieving the metadata for the file so that I could then request the file from our storage box.
byte[] encryptionKey = null;
byte[] macInitialVector = null;
...
using (var conn = new SqlConnection(ConnectionString))
using (var cmd = new SqlCommand(uploadedFileQuery, conn))
{
conn.Open();
var reader = cmd.ExecuteReader();
while (reader.Read())
{
FileServiceMessageEnvelope readAllEnvelope = null;
var originalFileName = reader["UploadedFileClientName"].ToString();
var fileId = Convert.ToInt64(reader["UploadedFileId"].ToString());
//var originalFileExtension = originalFileName.Substring(originalFileName.IndexOf('.'));
//_logger.LogStatement($"Scooped extension: {originalFileExtension}", LogLevel.Trace);
envelopes.Add(readAllEnvelope = new FileServiceMessageEnvelope
{
ActionType = FileServiceActionTypeEnum.ReadAll,
FileType = FileTypeEnum.UploadedFile,
FileName = reader["UploadedFileServerName"].ToString(),
FileId = fileId,
WorkerAuthorization = null,
BinaryTimestamp = DateTime.Now.ToBinary(),
Position = 0,
Count = Convert.ToInt32(reader["UploadedFileSize"]),
SignerFqdn = _messengerConfig.FullyQualifiedDomainName
});
readAllEnvelope.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayload = new SecureMessage { Payload = new byte[0] };
signedPayload.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayloadBytes = SerializationHelper.SerializeObjectToBuffer(signedPayload);
encryptionKey = (byte[])reader["UploadedFileEncryptionKey"];
macInitialVector = (byte[])reader["UploadedFileEncryptionMacInitialVector"];
}
conn.Close();
}
Eagle-eyed observers might realize that I have not properly coupled the encryptionKey and macInitialVector to the correct record, since each file has a unique key and vector. This means I was using the key for one of the files to decrypt all of them which is why they were all corrupt except for one file -- they were not properly decrypted. I solved this issue by coupling them together with the ID in a simple POCO and retrieving the appropriate key and vector for each file upon decryption.

How to stream archive to S3 by parts in lambda function?

I have stored a lot of a user's files on my Amazon S3 storage. I have to provide all these files to the user by request.
For this purpose, I have implemented the lambda function that collects path to user's files, creates a zip archive and store back this archive on s3. Where a user could download it.
My code looks like:
using (var s3Client = new AmazonS3Client()){
using (var memoryStream = new MemoryStream()){
using (var zip = new ZipArchive(memoryStream, ZipArchiveMode.Create, true)){
foreach (var file in m_filePathsOnS3){
var response = await s3Client.GetObjectAsync(m_sourceBucket, file);
var name = file.Split('/').Last();
ZipArchiveEntry entry = zip.CreateEntry(name);
using (Stream entryStream = entry.Open()){
await response.ResponseStream.CopyToAsync(entryStream);
}
}
}
memoryStream.Position = 0;
var putRequest = new PutObjectRequest{
BucketName = m_resultBucket,
Key = m_archivePath,
InputStream = memoryStream
};
await s3Client.PutObjectAsync(putRequest);
}
}
But, lambda function has the limitation in 3008 MB max memory allocation. So if I understand correctly, I will have the issue when trying to make the archive more than 3008 MB.
I looked for a way to stream and archive files on the fly.
Currently, I see only one way - move this lambda function to EC2 instance as service.
You are correct. There is also a limit of 500MB of disk space, which would impact this.
Therefore, AWS Lambda is not a good use-case for creating potentially very large zip files.

How to get the stream for a Multipart file in webapi upload?

I need to upload a file using Stream (Azure Blobstorage), and just cannot find out how to get the stream from the object itself. See code below.
I'm new to the WebAPI and have used some examples. I'm getting the files and filedata, but it's not correct type for my methods to upload it. Therefore, I need to get or convert it into a normal Stream, which seems a bit hard at the moment :)
I know I need to use ReadAsStreamAsync().Result in some way, but it crashes in the foreach loop since I'm getting two provider.Contents (first one seems right, second one does not).
[System.Web.Http.HttpPost]
public async Task<HttpResponseMessage> Upload()
{
if (!Request.Content.IsMimeMultipartContent())
{
this.Request.CreateResponse(HttpStatusCode.UnsupportedMediaType);
}
var provider = GetMultipartProvider();
var result = await Request.Content.ReadAsMultipartAsync(provider);
// On upload, files are given a generic name like "BodyPart_26d6abe1-3ae1-416a-9429-b35f15e6e5d5"
// so this is how you can get the original file name
var originalFileName = GetDeserializedFileName(result.FileData.First());
// uploadedFileInfo object will give you some additional stuff like file length,
// creation time, directory name, a few filesystem methods etc..
var uploadedFileInfo = new FileInfo(result.FileData.First().LocalFileName);
// Remove this line as well as GetFormData method if you're not
// sending any form data with your upload request
var fileUploadObj = GetFormData<UploadDataModel>(result);
Stream filestream = null;
using (Stream stream = new MemoryStream())
{
foreach (HttpContent content in provider.Contents)
{
BinaryFormatter bFormatter = new BinaryFormatter();
bFormatter.Serialize(stream, content.ReadAsStreamAsync().Result);
stream.Position = 0;
filestream = stream;
}
}
var storage = new StorageServices();
storage.UploadBlob(filestream, originalFileName);**strong text**
private MultipartFormDataStreamProvider GetMultipartProvider()
{
var uploadFolder = "~/App_Data/Tmp/FileUploads"; // you could put this to web.config
var root = HttpContext.Current.Server.MapPath(uploadFolder);
Directory.CreateDirectory(root);
return new MultipartFormDataStreamProvider(root);
}
This is identical to a dilemma I had a few months ago (capturing the upload stream before the MultipartStreamProvider took over and auto-magically saved the stream to a file). The recommendation was to inherit that class and override the methods ... but that didn't work in my case. :( (I wanted the functionality of both the MultipartFileStreamProvider and MultipartFormDataStreamProvider rolled into one MultipartStreamProvider, without the autosave part).
This might help; here's one written by one of the Web API developers, and this from the same developer.
Hi just wanted to post my answer so if anybody encounters the same issue they can find a solution here itself.
here
MultipartMemoryStreamProvider stream = await this.Request.Content.ReadAsMultipartAsync();
foreach (var st in stream.Contents)
{
var fileBytes = await st.ReadAsByteArrayAsync();
string base64 = Convert.ToBase64String(fileBytes);
var contentHeader = st.Headers;
string filename = contentHeader.ContentDisposition.FileName.Replace("\"", "");
string filetype = contentHeader.ContentType.MediaType;
}
I used MultipartMemoryStreamProvider and got all the details like filename and filetype from the header of content.
Hope this helps someone.

Categories