Parsing and Analysing few GBs data - c#

I am looking for an approach to analyse custom log files.
I have right now implemented using LINQ and C#.NET. It only works on log files of size upto 500MB.
Each line of the log file is made in to an object that looks like
public class Metrics
{
public DateTime Date { get; set; }
public string Metrics1 { get; set; }
public string Metrics2 { get; set; }
:
:
public string Metrics9 { get; set; }
}
List<Metrics> MetricsList = new List<Metrics>();
MetricsList is populated.
Various LINQ queries are run on MetricsList to provide useful analytics.
It is observed that a Metrics object needs 300 bytes. I have approximately 4 million lines in 500MB log files which makes the size of MetricsList alone consuming more than 1GB of program memory.
My requirement is to parse and analyse files with size upto 2 GB which looks like going to consume 4 GB of memory.
Any better approaches or alternatives using Windows, Microsoft Technologies and any Open Source Libraries.

I have done a similar task using SQlite. Install System.Data.SQLite NuGet (optional: I have used Dapper NuGet as a very efficient micro-ORM too) and then you have a very good tool for performing queries and generating your reports. The only thing that you may not like is that you have to write SQL instead of LINQ (Although there is LINQ for SQLite too; but I have not used it).
This way that memory consumption will go away too.

Usually you don't want to store files like that in memory (unless you have enough of course), but process the data as you parse the file. I'd simply install more memory and set the solution to 64-bit probably...
However, if that is not an option, you can always optimize memory usage a bit. .NET stores strings as char[] where a char is basically a 2-byte short. You can easily save a lot of memory by simply not storing it as char[] but as byte[] using Encoding.UTF8.GetBytes.
Also, each string or byte[] consumes 24 bytes (16 for the object itself, 8 for the pointer) in a 64-bit environment. That can add up if you have a lot of small strings. Instead of storing them as strings, you can also store a single byte[] and do the parsing in the getters.
So to conclude my advice is: buy more memory or process the data as you read/need it.
[Update+1]
Just noticed that you use a List. The easiest way to process-as-you-go is to read the file as IEnumerable and use Linq on that. Don't put it in a list first. E.g.:
public IEnumerable<Metric> ReadFile()
{
string s;
while ((s=myFileReader.ReadLine())!=null)
{
yield return Parse(s);
}
}
int someAnalysis = ReadFile().Sum((a)=>(a.Metric1.Length)); // or whatever you do
[Update+2]
Oh I have another trick for you. Reading files can be a pain with performance, since file IO relatively sucks. So instead of using the IEnumeration trick from above, you can also use a compressed stream to store all the data in memory - and then use that during processing instead of the file.
For the people that are wondering if I'm serious about this weird solution: this is a frequently used technique when you're building search technology and databases, simply because having more in (fast) memory means having less (slow) disk IO. Further, a log file will probably compress very nicely.
So Read file && flatestream on top of a memorystream. Then read that for Linq in the way discussed above (again, flatestream on top of memorystream).

Related

Memory limitted to about 2.5 GB for single .net process

I am writing .NET applications running on Windows Server 2016 that does an http get on a bunch of pieces of a large file. This dramatically speeds up the download process since you can download them in parallel. Unfortunately, once they are downloaded, it takes a fairly long time to pieces them all back together.
There are between 2-4k files that need to be combined. The server this will run on has PLENTLY of memory, close to 800GB. I thought it would make sense to use MemoryStreams to store the downloaded pieces until they can be sequentially written to disk, BUT I am only able to consume about 2.5GB of memory before I get an System.OutOfMemoryException error. The server has hundreds of GB available, and I can't figure out how to use them.
MemoryStreams are built around byte arrays. Arrays cannot be larger than 2GB currently.
The current implementation of System.Array uses Int32 for all its internal counters etc, so the theoretical maximum number of elements is Int32.MaxValue.
There's also a 2GB max-size-per-object limit imposed by the Microsoft CLR.
As you try to put the content in a single MemoryStream the underlying array gets too large, hence the exception.
Try to store the pieces separately, and write them directly to the FileStream (or whatever you use) when ready, without first trying to concatenate them all into 1 object.
According to the source code of the MemoryStream class you will not be able to store more than 2 GB of data into one instance of this class.
The reason for this is that the maximum length of the stream is set to Int32.MaxValue and the maximum index of an array is set to 0x0x7FFFFFC7 which is 2.147.783.591 decimal (= 2 GB).
Snippet MemoryStream
private const int MemStreamMaxLength = Int32.MaxValue;
Snippet array
// We impose limits on maximum array lenght in each dimension to allow efficient
// implementation of advanced range check elimination in future.
// Keep in sync with vm\gcscan.cpp and HashHelpers.MaxPrimeArrayLength.
// The constants are defined in this method: inline SIZE_T MaxArrayLength(SIZE_T componentSize) from gcscan
// We have different max sizes for arrays with elements of size 1 for backwards compatibility
internal const int MaxArrayLength = 0X7FEFFFFF;
internal const int MaxByteArrayLength = 0x7FFFFFC7;
The question More than 2GB of managed memory has already been discussed long time ago on the microsoft forum and has a reference to a blog article about BigArray, getting around the 2GB array size limit there.
Update
I suggest to use the following code which should be able to allocate more than 4 GB on a x64 build but will fail < 4 GB on a x86 build
private static void Main(string[] args)
{
List<byte[]> data = new List<byte[]>();
Random random = new Random();
while (true)
{
try
{
var tmpArray = new byte[1024 * 1024];
random.NextBytes(tmpArray);
data.Add(tmpArray);
Console.WriteLine($"{data.Count} MB allocated");
}
catch
{
Console.WriteLine("Further allocation failed.");
}
}
}
As has already been pointed out, the main problem here is the nature of MemoryStream being backed by a byte[], which has fixed upper size.
The option of using an alternative Stream implementation has been noted. Another alternative is to look into "pipelines", the new IO API. A "pipeline" is based around discontiguous memory, which means it isn't required to use a single contiguous buffer; the pipelines library will allocate multiple slabs as needed, which your code can process. I have written extensively on this topic; part 1 is here. Part 3 probably has the most code focus.
Just to confirm that I understand your question: you're downloading a single very large file in multiple parallel chunks and you know how big the final file is? If you don't then this does get a bit more complicated but it can still be done.
The best option is probably to use a MemoryMappedFile (MMF). What you'll do is to create the destination file via MMF. Each thread will create a view accessor to that file and write to it in parallel. At the end, close the MMF. This essentially gives you the behavior that you wanted with MemoryStreams but Windows backs the file by disk. One of the benefits to this approach is that Windows manages storing the data to disk in the background (flushing) so you don't have to, and should result in excellent performance.

Read/Write array to a file

I need guidance, someone to point me in the right direction. As the tittle says, I need to save information to a file: Date, string, integer and an array of integers. And I also need to be able to access that information later, when an user wants to review it.
Optional: File is plain text and I can directly check it and it is understandable.
Bonus points if chosen method can be "easily" converted to working with a database in the future instead of individual files.
I'm pretty new to C# and what I've found so far is that I should turn the array into a string with separators.
So, what'd you guys suggest?
// JSON.Net
string json = JsonConvert.SerializeObject(objOrArray);
File.WriteAllText(path, json);
// (note: can also use File.Create etc if don't need the string in memory)
or...
using(var file = File.Create(path)) { // protobuf-net
Serializer.Serialize(file, objOrArray);
}
The first is readable; the second will be smaller. Both will cope fine with "Date, string, integer and an array of integers", or an array of such objects. Protobuf-net would require adding some attributes to help it, but really simple.
As for working with a database as columns... the array of integers is the glitch there, because most databases don't support "array of integers" as a column type. I'd say "separation of concerns" - have a separate model for DB persistence. If you are using the database purely to store documents, then: pretty much every DB will support CLOB and BLOB data, so either is usable. Many databases now have inbuilt JSON support (helper methods, etc), which might make JSON as a CLOB more tempting.
I would probably serialize this to json and save it somewhere. Json.Net is a very popular way.
The advantage of this is also creating a class that can be later used to work with an Object-Relational Mapper.
var userInfo = new UserInfoModel();
// write the data (overwrites)
using (var stream = new StreamWriter(#"path/to/your/file.json", append: false))
{
stream.Write(JsonConvert.SerializeObject(userInfo));
}
//read the data
using (var stream = new StreamReader(#"path/to/your/file.json"))
{
userInfo = JsonConvert.DeserializeObject<UserInfoModel>(stream.ReadToEnd());
}
public class UserInfoModel
{
public DateTime Date { get; set; }
// etc.
}
for the Plaintext File you're right.
Use 1 Line for each Entry:
Date
string
Integer
Array of Integer
If you read the File in your code you can easily seperate them by reading line to line.
Make a string with a specific Seperator out of the Array:
[1,2,3] -> "1,2,3"
When you read the line you can Split the String by "," and gets a Array of Strings. Parse each Entry to int into an Array of Int with the same length.
How to read and write the File get a look at Easiest way to read from and write to files
If you really wants the switch to a database at a point, try a JSON Format for your File. It is easy to handle and there are some good Plugins to work with.
Mfg
Henne
The way I got started with C# is via the game Space Engineers from the Steam Platform, the Mods need to save a file Locally (%AppData%\Local\Temp\SpaceEngineers\ or %AppData%\Roaming\SpaceEngineers\Storage\) for various settings, and their logging is similar to what #H. Sandberg mentioned (line by line, perhaps a separator to parse with later), the upside to this is that it's easy to retrieve, easy to append, easy to overwrite, and I'm pretty sure it's even possible to retrieve File Size, which when combined with File Deletion and File Creation can prevent runaway File Sizes as this allows you to set an Upper Limit to check against, allowing you to run it on a Server with minimal impact (probably best to include a minimum Date filter {make sure X is at least Y days old before deleting it for being over Z Bytes} to prevent Debugging Data Loss {"Why was it over that limit?"})
As far as the actual Code behind the idea, I'm approximately at the same Skill Level as the OP, which is to say; Rookie, but I would advise looking at the Coding in the Space Engineers Mods for some Samples (plus it's not half bad for a Beta Game), as they are almost all written in C#. Also, the Programmable Blocks compile in C# as well, so you'll be able to use that to both assist in learning C# and reinforce and utilize what you already know (although certain C# commands aren't allowed for security reasons, utilizing the Mod API you'll have more flexibility to do things such as Creating/Maintaining Log Files, Retrieving/Modifying Object Properties, etc.), You are even capable of printing Text to various in Game Text Monitors.
I apologise if my Syntax needs some work, and I'm sorry I am not currently capable of just whipping up some Code to solve your issue, but I do know
using System;
Console.WriteLine("Hello World");
so at least it's not a total loss, but my example Code likely won't compile, since it's likely missing things like: an Output Location, perhaps an API reference or two, and probably a few other settings. Like I said, I'm New, but that is a valid C# Command, I know I got that part correct.
Edit: here's a better attempt:
using System;
class Test
{
static void Main()
{
string a = "Hello Hal, ";
string b = "Please open the Airlock Doors.";
string c = "I'm sorry Dave, "
string d = "I'm afraid I can't do that."
Console.WriteLine(a + b);
Console.WriteLine(c + d);
Console.Read();
}
}
This:
"Hello Hal, Please open the Airlock Doors."
"I'm sorry Dave, I'm afraid I can't do that."
Should be the result. (the "Quotation Marks" shouldn't appear in the readout {the last Code Block}, that's simply to improve readability)

Remove of duplicate strings from very big text file

I have to remove duplicate strings from extremely big text file (100 Gb+)
Since in memory duplicate removing is hopeless due to size of data, I have tried bloomfilter but of no use beyond something like 50 millions strings ..
total strings are like 1 trillion+
I want to know what are the ways to solve this problem..
My initial attempt is, dividing the file in to number of sub files , sort each file and then merge all files together...
If you have better solution than this please let me know,
Thanks..
The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.
If the article is not clear enough have a look at the referenced implementations such as this one.
You can make second file, which contains records, each record is 64-bit CRC plus offset of the string and file should be indexed for fast search.
Something like this:
ReadFromSourceAndSort()
{
offset=0;
while(!EOF)
{
string = ReadFromFile();
crc64 = crc64(string);
if(lookUpInCache(crc64))
{
skip;
} else {
WriteToCacheFile(crc64, offset);
WriteToOutput(string);
}
}
}
How to make good cachefile? It should be sorted by CRC64 to search fast. So you shuold to make structure of this file like binary searching tree, but with fast adding of new items without moving existing in the file. To improve speed you need to use Memory Mapped Files.
Possible answer:
memory = ReserveMemory(100 Mb);
mapfile= MapMemoryToFile(memory, "\\temp\\map.tmp"); (File can be bigger, Mapping is just window)
currentWindowNumber = 0;
while(!EndOfFile)
{
ReadFromSourceAndSort(); But only for first 100 Mb in memory
currentWindowNumber++;
MoveMapping(currentWindowNumber)
}
And Function To lookup; Shuld not use mapping (because each window switching saves 100 Mb to HDD and loads 100 Mb of the next window). Just seeks in 100Mb Trees of CRC64 and if CRC64 found -> string is already stored

Storing Large Lookup Tables

I am developing an app that utilizes very large lookup tables to speed up mathematical computations. The largest of these tables is an int[] that has ~10 million entries. Not all of the lookup tables are int[]. For example, one is a Dictionary with ~200,000 entries. Currently, I generate each lookup table once (which takes several minutes) and serialize it to disk (with compression) using the following snippet:
int[] lut = GenerateLUT();
lut.Serialize("lut");
where Serialize is defined as follows:
public static void Serialize(this object obj, string file)
{
using (FileStream stream = File.Open(file, FileMode.Create))
{
using (var gz = new GZipStream(stream, CompressionMode.Compress))
{
var formatter = new BinaryFormatter();
formatter.Serialize(gz, obj);
}
}
}
The annoyance I am having is when launching the application, is that the Deserialization of these lookup tables is taking very long (upwards of 15 seconds). This type of delay will annoy users as the app will be unusable until all the lookup tables are loaded. Currently the Deserialization is as follows:
int[] lut1 = (Dictionary<string, int>) Deserialize("lut1");
int[] lut2 = (int[]) Deserialize("lut2");
...
where Deserialize is defined as:
public static object Deserialize(string file)
{
using (FileStream stream = File.Open(file, FileMode.Open))
{
using (var gz = new GZipStream(stream, CompressionMode.Decompress))
{
var formatter = new BinaryFormatter();
return formatter.Deserialize(gz);
}
}
}
At first, I thought it might have been the gzip compression that was causing the slowdown, but removing it only skimmed a few hundred milliseconds from the Serialization/Deserialization routines.
Can anyone suggest a way of speeding up the load times of these lookup tables upon the app's initial startup?
First, deserializing in a background thread will prevent the app from "hanging" while this happens. That alone may be enough to take care of your problem.
However, Serialization and deserialization (especially of large dictionaries) tends to be very slow, in general. Depending on the data structure, writing your own serialization code can dramatically speed this up, particularly if there are no shared references in the data structures.
That being said, depending on the usage pattern of this, a database might be a better approach. You could always make something that was more database oriented, and build the lookup table in a lazy fashion from the DB (ie: a lookup is lookup in the LUT, but if the lookup doesn't exist, load it from the DB and save it in the table). This would make startup instantaneous (at least in terms of the LUT), and probably still keep lookups fairly snappy.
I guess the obvious suggestion is to load them in the background. Once the app has started, the user has opened their project, and selected whatever operation they want, there won't be much of that 15 seconds left to wait.
Just how much data are we talking about here? In my experience, it takes about 20 seconds to read a gigabyte from disk into memory. So if you're reading upwards of half a gigabyte, you're almost certainly running into hardware limitations.
If data transfer rate isn't the problem, then the actual deserialization is taking time. If you have enough memory, you can load all of the tables into memory buffers (using File.ReadAllBytes()) and then deserialize from a memory stream. That will allow you to determine how much time reading is taking, and how much time deserialization is taking.
If deserialization is taking a lot of time, you could, if you have multiple processors, spawn multiple threds to do the serialization in parallel. With such a system, you could potentially be deserializing one or more tables while loading the data for another. That pipelined approach could make your entire load/deserialization time be almost as fast as load only.
Another option is to put your tables into, well, tables: real database tables. Even an engine like Access should yield pretty good performance, because you have an obvious index for every query. Now the app only has to read in data when it's actually about to use it, and even then it's going to know exactly where to look inside the file.
This might make the app's actual performance a bit lower, because you have to do a disk read for every calculation. But it would make the app's perceived performance much better, because there's never a long wait. And, like it or not, the perception is probably more important than the reality.
Why zip them?
Disk is bigger than RAM.
A straight binary read should be pretty quick.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Categories