Reading very large CSV and JSON files

Reading very large CSV and JSON files - c#

I am currently programming a "search engine" in C# for a game, from which i get very large (3GB and more!) .csv and .json(l) files, I need to parse them, but it takes up very large amounts of RAM... what are good ways to parse them (I need all the data for transfering it into a DB)?
example csv:
id,station_id,commodity_id,supply,buy_price,sell_price,demand,collected_at
1,1,5,0,0,315,532,1486247405
2,1,6,0,0,6795,38,1486247405
3,1,7,0,0,527,318,1486247405
Unfortunatly no json example, but it is an array of OBJs which hold the data.

I used Microsoft.VisualBasic.FileIO.TextFieldParser and it was fast enough for a 2 GB .CSV file.
using (TextFieldParser sr = new TextFieldParser(datapath)
{
Delimiters = new string[1] { "," },
HasFieldsEnclosedInQuotes = true;
})
{
string[] values = sr.ReadFields();
while (values != null)
{
// ....
values = sr.ReadFields();
}
}
Hope it helps.

Related

How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
Example for line text file
name,id,image,age,place,link
string word = "13215646";
string output = string.Empty;
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
string[] strList = line.Split(',');
if (word == strList[1]) // check if word = id
{
output = line;
break;
}
}
}

You can use this to search the file:
var output = File.ReadLines(FileName).
Where(line => line.Split(',')[1] == word).
FirstOrDefault();
But it won't solve this:
if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.
There's not a practical way to avoid this for a basic file.
The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.
But neither is likely for a random csv file. This is one of the reasons people use databases.
However, we also need to stop and check whether this is really a problem for you. I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.

You can optimise the code. Here are few ideas:
var ids = new ["13215646", "113"];
foreach(var line in File.ReadLines(FileName))
{
var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
if(ids.Contains(id) // Optimization 2: Search for multiple ids
{
//Do what you need with the line
}
}

Unable to split List content with space as delimiter - C#

I have just started to learn C# for one of my projects, in which i should be able to read all the data from a text file and then i should be able to compare the imported data with the default data structure of the file. So far i have been able to do a little bit of stuff, however i am stuck at splitting the imported data in a list with space as delimiter so that i can try to compare it with the default data which i am planning to put in a default data list.
The structure of the file(File1) to be imported(or the user provided) is as follows:-
%emp_first_name% = xxxxxxxx %emp_middle_name% = xxxxxxxx %emp_last_name% = xxxxxxxx;
%emp_age% = nn;
%emp_dept.% = xxxxxxxx;
%emp_joining_date% = xx-xx-xxxx;
the default structure of the file(File2) is:-
%emp_first_name% = xxxxxxxx %emp_middle_name% = xxxxxxxx %emp_last_name% = xxxxxxxx;
%emp_age% = nn;
%emp_total_exp% = xx;
%emp_grade% = x;
%emp_dept.% = xxxxxxxx;
%emp_joining_date% = xx-xx-xxxx;
after reading the File1 in a list, i am unable to split it using space as a delimiter, this is what i am doing to read the File1 into a list.
public static void readFinL(string filename)
{
string readAllLines = File.ReadAllText(filename);
List<string> list = new List<string>();
list.Add(readAllLines);
foreach (string d in list)
{
var f = d.Split(',');
Console.WriteLine(f.GetValue(0));
}
}
what am i not doing or what is it that i am doing incorrectly with this method to read the file in a list. I am passing the data in a list since i should be able to compare File1 with File2 to check which row is missing in File1. Any pointer in correct direction will be helpful.

First of all, d.Split(',') is splitting with the comma. Use var f = d.Split(' ') instead.
If I'm not wrong, File.ReadAllText return a single string. Your list only have one element by this way.
string[] lines = File.ReadAllLines("path to the file")
Should do the work.

How to randomly generate a word in a CSV file [duplicate]

This question already has answers here:
How do I generate a random integer in C#?
(31 answers)
Closed 3 years ago.
I have a csv file in my file explorer windows 10. This file contains a list of rows e.g.:
John, 5656, Phil, Simon,,Jude, Helen, Andy
Conor, 5656, Phil, Simon,,Jude, Helen, Andy
I am an automated tester using C#, selenium and visual studio. In the application I am testing, there is an upload button which imports the csv file.
How do I randomly change the second number automatically so the update would be 1234 on the first row, 4444 on the second row(just append randomly). I think I would need a random generator for this.
Any advice or snippets of code would be appreciated.

Do you want to append the CSV file before its uploaded to the program or after? Either way it would look something like this:
public File updateFile(string filePath){
List<string> modifiedNames;
using (StreamReader sr = File.OpenText(path))
{
string s;
while ((s = sr.ReadLine()) != null)
{
s = s + randomlyGeneratedSuffix();
newEntries.add(s)
}
}
using (StreamWriter sw = new StreamWriter("names.txt")) {
foreach (string s in modifiedNames) {
sw.WriteLine(s);
}
}
// return new file?
}

Reading the file before uploading, changing the numbers on the second position in csv and writing it again to disk should work. Here is a very simple approach, to help you get started:
var fileLines = File.ReadAllLines("file.csv");
var randomGenerator = new Random();
var newFileLines = new List<string>();
foreach (var fileLine in fileLines)
{
var lineValues = fileLine.Split(',');
lineValues[1] = randomGenerator.Next(1000, int.MaxValue).ToString();
var newLine = string.Join(",", lineValues);
newFileLines.Add(newLine);
}
File.WriteAllLines("file.csv", newFileLines);

Instead of updating an existing CSV file for testing I would generate a new one from code.
There are a lot of code examples online how to create a CSV file in C#, for example: Writing data into CSV file in C#
For random numbers you can use the random class: https://learn.microsoft.com/en-us/dotnet/api/system.random?view=netframework-4.7.2

Slow loading of .CSV files using EPPLUS

I have loads of .csv files I need to convert to .xslx after applying some formatting.
A file containing approx 20 000 rows and 7 columns takes 12 minutes to convert.
If the file contains more than 100 000 it runs for > 1 hour.
This is unfortunately not acceptable for me.
Code snippet:
var format = new ExcelTextFormat();
format.Delimiter = ';';
format.Encoding = new UTF7Encoding();
format.Culture = new CultureInfo(System.Threading.Thread.CurrentThread.CurrentCulture.ToString());
format.Culture.DateTimeFormat.ShortDatePattern = "dd.mm.yyyy";
using (ExcelPackage package = new ExcelPackage(new FileInfo(file.Name))){
ExcelWorksheet worksheet = package.Workbook.Worksheets.Add(Path.GetFileNameWithoutExtension(file.Name));
worksheet.Cells["A1"].LoadFromText(new FileInfo(file.FullName), format);
}
I have verified that it is the LoadFromText command that spends the time used.
Is there a way to speed things up?
I have tried without the "format" parameter, but the loadtime was the same.
What loadtimes are you experiencing?

My suggestion here is to read the file by yourself and then use the library to create the file.
The code to read the CSV could be as simple as:
List<String> lines = new List<String>();
using (StreamReader reader = new StreamReader("file.csv"))
{
String line;
while((line = reader.ReadLine()) != null)
{
lines.add(line);
}
}
//Now you got all lines of your CSV
//Create your file with EPPLUS
foreach(String line in lines)
{
var values = line.Split(';');
foreach(String value in values)
{
//use EPPLUS library to fill your file
}
}

I ran into a very similar problem with LoadFromCollection. EPPlus has to account for all situations in their methods to load data generically like that so there is a good deal of overhead. I ended up narrowing done the bottleneck to that method and ended up just manually coverting the data from the collection to Excel Cell object in EPPlus. Probably saved several minutes in my exports.
Plenty of examples on how to read csv data:
C# Read a particular value from CSV file

What's the best way to get all the content in between two tagged lines of a file so that you can deserialize it?

I've been noticing that the following segment of code does not scale well for large files (I think that appending to the paneContent string is slow):
string paneContent = String.Empty;
bool lineFound = false;
foreach (string line in File.ReadAllLines(path))
{
if (line.Contains(tag))
{
lineFound = !lineFound;
}
else
{
if (lineFound)
{
paneContent += line;
}
}
}
using (TextReader reader = new StringReader(paneContent))
{
data = (PaneData)(serializer.Deserialize(reader));
}
What's the best way to speed this all up? I have a file that looks like this (so I wanna get all the content in between the two different tags and then deserialize all that content):
A line with some tag
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with some tag
Note: These tags are not XML tags.

You could use a StringBuilder as opposed to a string, that is what the StringBuilder is for. Some example code is below:
var paneContent = new StringBuilder();
bool lineFound = false;
foreach (string line in File.ReadLines(path))
{
if (line.Contains(tag))
{
lineFound = !lineFound;
}
else
{
if (lineFound)
{
paneContent.Append(line);
}
}
}
using (TextReader reader = new StringReader(paneContent.ToString()))
{
data = (PaneData)(serializer.Deserialize(reader));
}
As mentioned in this answer, a StringBuilder is preferred to a string when you are concatenating in a loop, which is the case here.

Here is an example of how to use groups with regexes and retrieve their contents afterwards.
What you want is a regex that will match your tags, label this as a group then retrieve the data of the group as in the example

Use a StringBuilder to build your data string (paneContent). It's much faster because concatenating strings results in new memory allocations. StringBuilder pre-allocates memory (if you expect large data strings, you can customize the initial allocation).
It's a good idea to read your input file line-by-line so you can avoid loading the whole file into memory if you expect files with many lines of text.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading very large CSV and JSON files - c#

Related

How to search string in large text file?

Unable to split List content with space as delimiter - C#

How to randomly generate a word in a CSV file [duplicate]

Slow loading of .CSV files using EPPLUS

What's the best way to get all the content in between two tagged lines of a file so that you can deserialize it?

Categories

Resources