How to remove Header from a CSV FileStream - c#

I have to work on a certain set of limitations for csv file upload :
I will be working with 'large' CSV files (containing header row)
I need to remove the first header row from the CSV file
The file-upload code needs a FileStream (not containing the header) as input! (as I am restricted to do a lot of stream operations on top of this stream (containing headerless csv data))
Wrapper C# Code :
using (var stream = File.OpenRead("C:\~~~\~~~\~~~\SampleFile.csv"))
{
//CSV Header removal snippet - which gives me a new stream containing data without headers.
~
~
~
~
//All my stream handling code of chunking stream into 100mb and then uploading each chunk to azure storage (which is not part of this question)
}
Now I already know - that I can simply remove headers of a csv file using libraries like - CSVHelper (How to exclude header when writing data to CSV)
Using the above way I can create a header-less copy of a file and read the new file back as FileStream - but the problem is that I'm dealing with large files and making a copy of a file just to remove headers will be a space-consuming job.
So for the first time - I am asking a question in StackOverflow - to find a good solution to the above problem. I hope I was able to explain the problem clearly.

This should work to seek to the end of the first line.
using (var stream = File.OpenRead("~~filepath~~"))
using (var reader = new StreamReader(stream))
{
string line = null;
if ((line = reader.ReadLine()) != null)
{
stream.Position = line.Length + 2;
// The 2 is for NewLine(\r\n)
}
//All my stream handling code of chunking stream into 100mb and then uploading each chunk to azure storage (which is not part of this question)
}

Related

Read and write to the same csv file

I have a CSV file (E.g. Directories.csv) which contains a huge list of directories. I am looping through the directories from the CSV using streamreader and performing some task. I am updating the completed directory list to a dictionary and stuck at this step now.
Ask: I want to capture the data through the loop on which directories are complete in the same CSV just in case the application crashes or server reboots, so that I don't have to re-iterate through the loop again which got completed. (Or) Delete the completed directories row from the CSV
I tried to check online for suggestions and asking to create temp file and move the copy of it. Can this be possible in case the server reboots or application crashes? Please suggest how can I take this forward.
My code:
Dictionary<string, string> directoryDictionary = new Dictionary<string, string>();
using (FileStream fileStreamDirectory = File.Open(outputdir + "\\Directories.csv", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bufferStreamDirectory = new BufferedStream(fileStreamDirectory))
using (StreamReader streamReaderDirectory = new StreamReader(bufferStreamDirectory))
{
while ((Directoryline = streamReaderDirectory.ReadLine()) != null)
{
#Doing the task here
directoryDictionary.Add(Directoryline, "Completed");
}
}
You can't really insert data into middle of a text file (unless it is fixed width format which in not the case of CSV).
Two options:
read to memory, update in-memory data, rewrite whole table back to the file (may need to keep previous version in case of write failures)
use database that satisfy you criteria and import CSV there to work with.

Write to S3 using PutObjectRequest while still generating stream

I am converting an application that currently uses the Windows file system to read and store files.
While reading each line of an input file, it modifies the data, and then writes it out to an output file:
using (var writer = new StreamWriter(#"C:\temp\out.txt", false))
{
using (var reader = new StreamReader(#"C:\temp\in.txt", Encoding.UTF8))
{
while ((line = reader.ReadLine()) != null)
{
//Create modifiedLine from line data
...
writer.WriteLine(modifiedLine);
}
}
}
So far, I have been able to write to S3 using a PutObjectRequest, but only with the entire file contents at once:
//Set up stream
var stream = new MemoryStream();
var writer = new StreamWriter(stream);
writer.Write(theEntireModifiedFileContents);
writer.Flush();
stream.Position = 0;
var putRequest = new PutObjectRequest()
{
BucketName = destinationBucket,
Key = destinationKey,
InputStream = stream
};
var response = await s3Client.PutObjectAsync(putRequest);
Given that these are going to be large files, I would prefer to keep the line-by-line approach rather than having to send the entire file contents at once.
Is there any way to maintain a similar behavior to the file system example above with S3?
S3 is an object store and does not support modifications in-place, appending, etc.
However, it is possible to meet your goals if certain criteria is met / understood:
1) Realize that it will take more code to do this than simply modifying your code to buffer the line output and then upload as a single object.
2) You can upload each line as part of the REST API PUT stream. This means that you will need to continuously upload data until complete. Basically you are doing a slow upload of a single S3 object while you process each line.
3) You can use the multi-part API to upload each line as a single part of a multi-part transfer. Then combine parts once complete. Note: I don't remember if each part has to be the same size (except for the last part). The limit to the total number of parts is 1,000. If your number of lines is more than 1,000 than you will need to buffer, so go back to method #1 or add buffering to reduce the number of parts to 1,000.
Unless you are a really motivated developer, realize that method #1 is going to be far easier to implement and test. Methods #2 and #3 will require you to understand how S3 works at a very low level using HTTP PUT requests.

Out Of Memory Exception in Foreach

I am trying to create a function that will retrieve all the uploaded files (which are now saved as byte in the database) and download it in a single zip file. I currently have 6000 files to download (and the number could grow).
The functionality is already working (from retrieval to download) if I limit the number of files being downloaded, otherwise, I get an OutOfMemoryException on the ForEach loop.
Here's a pseudo code: (files variable is a list of byte array and file name)
var files = getAllFilesFromDB();
foreach (var file in files)
{
var tempFilePath = Path.Combine(path, filename);
using (FileStream stream = new FileStream(tempfileName, FileMode.Create, FileAccess.ReadWrite))
{
stream.Write(file.byteArray, 0, file.byteArray.Length);
}
}
private readonly IEntityRepository<File> fileRepository;
IEnumerable<FileModel> getAllFilesFromDb()
{
return fileRepository.Select(f => new FileModel(){ fileData = f.byteArray, filename = f.fileName});
}
My question is, is there any other way to do this to avoid getting such errors?
To avoid this problem, you could avoid loading all the contents of all the files in one go. Most likely you will need to split your database call in to two database calls.
Retrieve a list of all the files without their contents but with some identifier - like the PK of the table.
A method which retrieves the contents of an individual file.
Then your (pseudo)code becomes
get list of all files
for each file
get the file contents
write the file to disk
Another possibility is to alter the way your query works currently, so that it uses deferred execution - this means it will not actually load all the files at once, but stream them one at a time from the database - but without seeing more code from your repository implementation, I cannot/ will not guess the right solution for you.

Generate files and ZIP without memory stream

I'm looking for a way to store files in a zip file without a memory stream. My goal is to save a maximum of system memory, while direct disk IO is no problem.
I iterate over a database result set where I have collected some blobs. These are byte-arrays.
What I do it the following (System.IO.Compression):
using (var archive = ZipFile.Open("data.zip", ZipArchiveMode.Update))
{
foreach (var result in results)
{
string fileName = $"{result.Id}.bin";
using (var fileStream = new FileStream(fileName, FileMode.Create, FileAccess.Write))
{
// write the blob data from result.Value
fileStream.Write(result.Value, 0, result.Value.Length);
fileStream.Close();
}
archive.CreateEntryFromFile(fileName, fileName);
}
}
There are 2 problems with this implementation.
I have my *.bin files AND the one *.zip (only need the zip)
I don't know why, but this uses a lot of RAM (~100MB for 15x1.5MB bin files)
Is there a way to completely bybass the memory?
UPDATE:
What I'm trying to achieve is to generate one ZIP file that contains single binary files generated from database blobs. This should happen inside a ASP.NET Web API controller. A user can request the data, but instead of sending the whole data in the HTTP response, I generate the ZIP file in the time of the request, save it to a local file server and send a download link back to the user.
I think your >100 MBs coming from
the results object which should contain at least 15x1.5 MB of blob data
holding the resulting data.zip open inside the foreach Scope.
to minimize the RAM amount of the worker process:
create empty zip-file
do {
(single BLOB query from DB)
(write blob to new or overwrite File)
(open zip file for append)
(append file to zip)
(close and dispose **both** file handles / objects )
}

Using the SharpSVN api are there any methods available to get the number of lines contained in a file at a Revision without Exporting it?

I was just wondering if I missed anything inside the documentation that would allow me to get the number of lines contained in a file at a certain revision (or even number of lines changed from a SvnChangeItem, that would be nice too) without having to directly export the file to the filesystem and parse through it counting each line.
Any help would be appreciated. Thanks.
Nope, your stuck with exactly the solution you named. Export to temp file, count the lines, delete the file. A fairly expensive operation if your doing this file-by-file. It may be better to fetch the entire repo if you need to line-count every file and reuse the working directory for future runs.
The meta data (like current line count) is not contained within the repository but you can get the file without doing messy temp files.
For brevity, excluded code to iterate over revisions etc.
using (var client = new SvnClient())
{
using (MemoryStream memoryStream = new MemoryStream())
{
client.Write(new SvnUriTarget(urlToFile), memoryStream);
memoryStream.Position = 0;
var streamReader = new StreamReader(memoryStream);
int lineCount = 0;
while (streamReader.ReadLine() != null)
{
lineCount++;
}
}
}

Categories