Split CSV files into exact 1gb files or little less? [closed]

Split CSV files into exact 1gb files or little less? [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Every month we receive a invoice file that is always bigger then 2GB, our print house has a 1.1GB limitation and we currently do all these process by hand.
The first step in this application would be to be able to split those HUGE 2GB files into limited 1GB files in a way it won't break each CSV entry and that each files will be readable from the start to the end without breaking any data.
How could I split the file to me the above requirements ?
Are there any libraries for this such of process on CSV files ?

How about just copying the first 1 GB of data from the source into a new file, then searching backward for the last newline, and truncating the new file after that. Then you know how large the first file is, and you repeat the process for a second new file from that point to 1 GB later. Seems straightforward to me in just about any language (you mentioned C#, which I haven't used recently, but certainly it can easily do the job).
You didn't make it clear whether you need to copy the header line (if any) to each of the resulting files. Again, should be straightforward--just do it prior to the copying of data into each of the files.
You could also take the approach of just generically splitting the files using tar on Unix or some Zip-like utility on Windows, then telling your large-file-challenged partner to reconstruct the file from that format. Or maybe simply compressing the CSV file would work, and get you under the limit in practice.

There are just a few things you need to take care of:
Keep the line breaks: split the file on a new line (algorithmically said split the file on the previous line to that where the 1GB limit occured minus the header line size)
Copy the header to the beginning of the new file and then paste the rest
Preserve the encoding.

In a bash/terminal prompt, write:
man split
.. then
man wc
.. simply count the number of lines in the file, divide it by X, feed the number to split and you have X files less than 1.1GB (if x = filesize/1.1)

Related

Optimized parse text file, to then upload to Excel [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
My project is to take large amounts of logs, output inside text files and parse some of the data to be made into Excel format.
There is a lot of garbage data in between not needed
This is how one portion of it is like:
2018-05-17 07:16:57.105>>>>>>
{"status":"success", "statusCode":"0", "statusDesc":"Message Processed Sucessfully", "messageNumber":"451", "payload":{"messageCode":"SORTRESPONSE","Id":"5L","Id":"28032","messageTimestamp":"2018-05-16 23:16:55"}}
I will first need to take the time stamp befor the "{}"
as it differs from the messageTimestamp
When generating the excel workbook
This is how it will look like in Excel:
------A-----------------------------------B--------------C
1. Overall time stamp ---------- status------- statusCode
2. 2018-05-17 07:16:57.105 - success --- -0
And so on...
payload has its own section of logs within its "{}"
so its section in excel will look like this:
F
1. payload
2. {"messageCode":"SORTRESPONSE","Id":"5L","Id":"28032","messageTimestamp":"2018-05-16 23:16:55"}
its content can be in one section that's not an issue.
A friend of mine have done something similar but it can take a few minutes to even generate even one relatively small excel document
My Question:
What is the most optimal way can I parse the data needed to then store it in an array or multidimensional array
to then push it into an excel document.

I would try to split the input text on newline characters, then parse the JSON part with Newtonsoft.Json.
I would highly advise to not try to parse the JSON yourself. The bottleneck here will be disk IO not in-memory processing, so make it easy to write correct code and use 3rd party libraries.
Once you have structured data representing the input, you can write each entry to an output file with only the fields you need.
For an Excel file, is CSV okay or do you need XLSX files? For CSV you can just write to a file directly, for XLSX I would recommend the EPPlus library.
https://www.newtonsoft.com/json
https://archive.codeplex.com/?p=epplus

How to detect a corrupted file before opening it in C# [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have some files which have been corrupted and I want to detect which file is corrupted before opening it. I used FileInfor class but useless.

The easiest way to solve this issue is to have your program have a log file of when it accesses and edits a file. By keeping track of this, if a program exited prematurely, you could easily identify the the saving was interrupted. To do this you should have the program add a new log every time the program has completed saving a file, not before it is saved. When the program trys to open the program, you can check the time that the file was last edited and if the last edited time is later than the time logged in the log file, then reject it.
Of course this system will only work on one computer. There are other ways of implementing this such as having a log at the end of the file. If the log does not exist, then you know that the file is corrupt. Open your mind up to more ideas and try to think of some more ways to solve this issue. This is just one example.

1. Unfortunately there is no easy way to determine if file is corrupt before even rendering it. Usually the problem files have a correct header so the real reasons of corruption are different. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So most probably corrupted files have broken offsets or may have some object missing.
The best way to determine that the file is corrupted is to use specialized libraries of that type like PDF file libraries. There are lots of both free and commercial of such libraries for .NET. You may simply try to load file with one of such libraries. iTextSharp will be a good choice.
2. Or if you want, you can go though this answer :
File Upload Virus Scanning(server side)

You might need to parse the header bytes of the file and make sure it satisfies the rest of the file body contents. e.g.,
Reading image header info without loading the entire image
This is how you can read header of an image and get the image size without opening it. In the same way you should look at the desired file format header and validate it as per the file format rule. This is not the readymade solution but may give you an idea.

How I can read after EOF? C# [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I`m using C# and I need to read something after EOF. Is it possible by using C#? How?
Thanks.

You cant. EOF means end of file, there's nothing actually in the file after that.
You may as well ask how you can get ten gallons of oil from a four-gallon drum. Once it's empty, there's no more to be had.
Since you're talking C# hence Windows (and based on your comment and data located behind the end of file pointer), it's possible that they may be referring to "DOS mode" text files, which are (or used to be, I haven't investigated recently) terminated by the CTRL-Z character.
From the earliest days of the PC revolution, where CP/M used integral numbers of disk blocks to store a file and only stored the number of disk blocks rather than the number of bytes, CTRL-Z was used to indicate end of file if the file wasn't an exact multiple of the disk block size.
If that's the case, it's probably best just to open the file as a binary file, then read up to the first CTRL-Z character (code point 26) - everything beyond that could be considered data beyond EOF if it's truly a text file of that format.

File Paths Are Too Long - Crashing FTP Transfers [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am using licensed version of CuteFTP to transfer files(Thousands in number) for one server to another.
The problem I am facing now is most of the FTP transfers are failing as File Paths Are Too Long.
On average, the character length of my file path would be anywhere between 200 & 250.
I cannot individually shorten the file titles manually as there are huge number of files.
Any ideas or suggestions to overcome this problem?

This is an limitation of Windows more specifially the NTFS File system. The MAX_PATH define does allow you to create files with a total (path and file name) length of 260 characters. The easy way is to use Robocopy which can deal with such file names or if you are bound to FTP you will get an error when the target file name is too long. The only easy way out of this is to create a zip file the the files in question and transfer the zip file. This should be a good idea anyway since the transfer over the wire is much slower than to simply stream one big file which is 2-4 times smaller than the original data.
As bonus you get rid of the long file names until you try to unpack them. But then you should choose your folder structure in a way to have a shallow root directory.

How do I print a check? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to write a .NET library for printing checks. Nothing fancy: you pass in the data, out comes the printed check. What's the best way to do this?
Constraints: The format of the check.

A lot of people are using report generators for this. It's a bit overkill, but crystal reports will certainly do the job.
Other than that, this is a basic question about formatting printed output. Is that your intention?
Check out the printdocument class and you can do this yourself:
http://msdn.microsoft.com/en-us/magazine/cc188767.aspx
If you're printing checks remotely (ie, you need to provide a check on the website that the user can print out) then using PDF is the easiest and most certain way to accomplish that, but be careful of the security implications.
-Adam

Wow... that takes me back! In the old days printers where dot matrix and cheques where a continous feed. I suppose nowadays cheques are preprinted single sheets and are printed with lasers/inkjets. Back then we'd just write plain ascii to the printer and send printer specific control/escape sequences for any specific formatting needs (picking the font size, line spacing, and page sizes).
Now I would like try generating a PDF and then submitting that file for printing. It out to be possible to do this with a plain text file too... though that's getting pretty close to old school. The report generator suggestion by Adam is pretty good idea too.
Generally with cheque printing it is a lot of trial and error to get the formatting right. Printing on plain paper and holding it and a preprinted cheque up to the window is an easy way to check positioning without burning through tons of cheques.
One thing to note though is whether or not there is a requirement to track the control numbers preprinted on the cheques (aka cheque number). Auditors sometimes require this and it is also a reasonable guard against fraud (accounting for every preprinted cheque is not a terrible idea). To do this you need to handle reprinting, and markng individual cheques/cheque runs as "spoiled". You also need a manual process to collect and store spoiled cheques (for the auditors). On whole it's a giant pain to get this right and can take more time than you might imagine.

Unless you're really ambitious, you order pre-printed checks and look at the check template. Fill in the blanks and there you are.

Since the format would be fairly fixed, I but you could create a Word doc that holds the format and then programmatically insert the correct information and print it
EDIT
Wow, pretty anti MS eh? You can use the full power of Words to visually set the format for the cheque and there are libraries to modify Word docs in .net, so I don't see why this isn't a slick solution

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.