I'm creating files that have a certain structure to them. They begin with a Header, then contain a block of DataElements. (The exact details don't matter to this question.)
I have a DataFileWriter connected to a FileStream for output. The problem is, the service that's consuming the files I'm building will reject any data file whose size is larger than the arbitrary value TOOBIG.
Given these constraints:
Every file must start with a Header
Every file must contain one or more DataElements, which must be written out completely; a file ending with an incomplete DataElement is invalid
It is perfectly valid to stop at the end of the current element and begin writing a new file with a new header, as long as each DataElement is written exactly once
The DataFileWriter doesn't and should not know that it's writing to a FileStream as opposed to some other type of stream; all it knows is that it has a Stream and in other cases that could be a completely different setup.
DataElement does not have a fixed size, but it's reasonable to assume any given element won't exceed 4 KB in size.
What's the best way to set up a system that will ensure, assuming that no massive DataElements come through, that no file exceeding a size of TOOBIG will be created? Basic architecture is given below; how would I need to modify it?
public class DataFileWriter : IDisposable
{
private readonly Stream _output;
private readonly IEnumerable<DataElement> _input;
private const int TOOBIG = 4 * 1024 * 1024 * 1024; // 4 GB
public DataFileWriter(IEnumerable<DataElement> input, Stream output)
{
_input = input;
_output = output;
}
public void Write()
{
WriteHeader(); // writes the header to _output
foreach (var element in input)
{
WriteElement(element); // serializes the record to _output
}
}
public void Dispose()
{
_output.Dispose();
}
}
Related
I'm extending BinaryWriter using a MemoryStream.
public class PacketWriter : BinaryWriter
{
public PacketWriter(Opcode op) : base(CreateStream(op))
{
this.Write((ushort)op);
}
private static MemoryStream CreateStream(Opcode op) {
return new MemoryStream(PacketSizes.Get(op));
}
public WriteCustomThing() {
// Validate that MemoryStream has space?
// Do all the stuff
}
}
Ideally, I want to use write using PacketWriter as long as there is space available (which is already defined in PacketSizes). If there isn't space available, I want an exception thrown. It seems like MemoryStream just dynamically allocates more space if you write over capacity, but I want a fixed capacity. Can I achieve this without needing to check the length every time? The only solution I thought of so far was to override all the Write methods of BinaryWriter and compare lengths, but this is annoying.
Just provide a buffer of the desired size to write into:
using System;
using System.IO;
class Test
{
static void Main()
{
var buffer = new byte[3];
var stream = new MemoryStream(buffer);
stream.WriteByte(1);
stream.WriteByte(2);
stream.WriteByte(3);
Console.WriteLine("Three successful writes");
stream.WriteByte(4); // This throws
Console.WriteLine("Four successful writes??");
}
}
This is documented behavior:
Initializes a new non-resizable instance of the MemoryStream class based on the specified byte array.
I've been sifting through the posts and forums but could not find a way to achieve this.
I have an array of 10,000,000 Person objects. I'm sending these objects over the network using a Streamed WCF Net.Tcp web service.
The problem is I want to read the first, for example, 5000 Person objects of the array as is it arrives and process only those. Afterwhich I will advance the stream and read another 5000, etc...
I haven't been able to find a way to do this because as far as I can tell there is no explicit size of objects in C#. As in, I can't just read the first 312 Bytes of the stream and say "Yes this is the first Person object. Now read the next 312 Bytes to get the next person.".
I ideally would like to use ProtoBuf-Net to serialize my objects but the .NET BinaryFormatter is fine as well.
I'm also open to sending the data in chunks, such as arrays of 5000. But I want to do so without opening a brand new tcp connection everytime. If only there was a way to tell the code that reads the stream: "Ok, deserialize everything I just sent you (Array of 5000) and then I will continue writing another 5000 to the stream".
Any ideas?
Thanks.
There may not be an explicit size for most objects in .NET but you can find the size of a serialized object. First send the size (in bytes) of the serialized object, then send the serialized object.
// psuedo-code
byte[] serializedObj = DoSerialization(Person); // we see length on an array
using (var writer = new StreamWriter(stream)) {
writer.Write(serializedObj.Length);
stream.Write(serializedObj);
}
You can also do this in bulk by modifying what and how you send your objects. You could create a List<Person>, add N number of Person, serialize the List and send as before.
Although I am not sure if sending the size before sending the data is necessary, it can help when you are reading the stream, to know how many bytes you are expecting.
You can do this with protobuf-net simply by using a ObservableCollection<Person> in your receiving system. When the collection grows larger than 5000 objects during deserialization, remove and processing the items in an ObservableCollection<T>.CollectionChanged callback. Then process any remaining items in an [OnDeserialized] callback.
For instance, consider the following root object:
[ProtoContract]
public class RootObject
{
public RootObject()
{
this.People = new ObservableCollection<Person>();
}
[ProtoMember(1)]
public ObservableCollection<Person> People { get; private set; }
public event EventHandler<EventArgs<StreamingContext>> OnDeserialized;
[OnDeserialized]
internal void OnDeserializedMethod(StreamingContext context)
{
var onDeserialized = OnDeserialized;
if (onDeserialized != null)
onDeserialized(this, new EventArgs<StreamingContext> { Value = context });
}
}
public class EventArgs<T> : EventArgs
{
public T Value { get; set; }
}
Say you have a method you would like to call to process each 5000 Person objects as they are added to the collection, for instance:
const int ProcessIncrement = 5000;
void ProcessItems(ICollection<Person> people, bool force)
{
if (people == null || people.Count == 0)
return;
if (people.Count >= ProcessIncrement || force)
{
// Remove and process the items, possibly on a different thread.
Console.WriteLine(string.Format("Processing {0} people." people.Count));
people.Clear();
}
}
You can pre-allocate your RootObject and add listeners with the necessary logic, and merge contents of the serialization stream into the root:
// Allocate a new RootObject
var newRoot = new RootObject();
// Add listeners to process chunks of Person objects as they are added
newRoot.People.CollectionChanged += (o, e) =>
{
// Process each chunk of 5000.
var collection = (ICollection<Person>)o;
ProcessItems(collection, false);
};
newRoot.OnDeserialized += (o, e) =>
{
// Forcibly process any remaining no matter how many.
ProcessItems(((RootObject)o).People, true);
};
// Deserialize from the stream onto the pre-allocated newRoot
Serializer.Merge(stream, newRoot);
As required, ProcessItems will be called every time an object is added to the collection, processing them in increments of 5000 then processing the remainder unconditionally.
Now, the only question is, does protobuf-net load the entire stream into memory before deserializing the collection, or does it do streaming deserialization? As it turns out, it does the latter, as shown by this sample fiddle that shows the stream position being gradually incremented as the items in the People collection are added, processed and removed.
Here I added the listeners to RootObject manually before deserialization. If you were to add them in the constructor itself, you could use ProtoBuf.Serializer.Deserialize<RootObject>(Stream stream) instead of Serializer.Merge onto a pre-allocated root object, which might be easier to integrate into your current architecture.
Incidentally, this technique should work with XmlSerializer and Json.NET as well.
Which of the following approaches is better? I meant to ask, is it better to copy the stream locally, close it and do whatever operations that are needed to be done using the data? or just perform operations with the stream open? Assume that the input from the stream is huge.
First method:
public static int calculateSum(string filePath)
{
int sum = 0;
var list = new List<int>();
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
list.Add(int.Parse(sr.ReadLine()));
}
}
foreach(int item in list)
sum += item;
return sum;
}
Second method:
public static int calculateSum(string filePath)
{
int sum = 0;
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
sum += int.Parse(sr.ReadLine());
}
}
return sum;
}
If the file is modified often, then read the data in and then work with it. If it is not accessed often, then you are fine to read the file one line at a time and work with each line separately.
In general, if you can do it in a single pass, then do it in a single pass. You indicate that the input is huge, so it might not all fit into memory. If that's the case, then your first option isn't even possible.
Of course, there are exceptions to every rule of thumb. But you don't indicate that there's anything special about the file or the access pattern (other processes wanting to access it, for example) that prevents you from keeping it open longer than absolutely necessary to copy the data.
I don't know if your example is a real-world scenario or if you're just using the sum thing as a placeholder for more complex processing. In any case, if you're processing a file line-by-line, you can save yourself a lot of trouble by using File.ReadLines:
int sum = 0;
foreach (var line in File.ReadLines(filePath))
{
sum += int.Parse(line);
}
This does not read the entire file into memory at once. Rather, it uses an enumerator to present one line at a time, and only reads as much as it must to maintain a relatively small (probably four kilobyte) buffer.
I am creating a Windows Form application, where I select a folder that contains multiple *.txt files. Their length may vary from few thousand lines (kB) to up to 50 milion lines (1GB). Every line of the code has three informations. Date in long, location id in int and value in float all separated by semicolon (;). I need to calculate min and max value in all those files and tell in which file it is, and then the most frequent value.
I already have these files verified and stored in an arraylist. I am opening a thread to read the files one by one and I read the data by line. It works fine, but when there are 1GB files, I run out of memory. I tried to store the values in dictionary, where key would be the date and the value would be an object that contains all the info loaded from the line alongside with the filename. I see I cannot use a dictionary, because at about 6M values, I ran out of memory. So I should probably do it in multithread. I though I could run two threads, one that reads the file and puts the info in some kind of container and the other that reads from it and makes calculations and then deletes the values from the container. But I don't know which container could do such thing. Moreover I need to calculate the most frequent value, so they need to be stored somewhere which leads me back to some kind of dictionary, but I already know I will run out of memory. I don't have much experience with threads either, so I don't know what is possible. Here is my code so far:
GUI:
namespace STI {
public partial class GUI : Form {
private String path = null;
public static ArrayList txtFiles;
public GUI() {
InitializeComponent();
_GUI1 = this;
}
//I run it in thread. I thought I would run the second
//one here that would work with the values inputed in some container
private void buttonRun_Click(object sender, EventArgs e) {
ThreadDataProcessing processing = new ThreadDataProcessing();
Thread t_process = new Thread(processing.runProcessing);
t_process.Start();
//ThreadDataCalculating calculating = new ThreadDataCalculating();
//Thread t_calc = new Thread(calculating.runCalculation());
//t_calc.Start();
}
}
}
ThreadProcessing.cs
namespace STI.thread_package {
class ThreadDataProcessing {
public static Dictionary<long, object> finalMap = new Dictionary<long, object>();
public void runProcessing() {
foreach (FileInfo file in GUI.txtFiles) {
using (FileStream fs = File.Open(file.FullName.ToString(), FileMode.Open))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs)) {
String line;
String[] splitted;
try {
while ((line = sr.ReadLine()) != null) {
splitted = line.Split(';');
if (splitted.Length == 3) {
long date = long.Parse(splitted[0]);
int location = int.Parse(splitted[1]);
float value = float.Parse(splitted[2], CultureInfo.InvariantCulture);
Entry entry = new Entry(date, location, value, file.Name);
if (!finalMap.ContainsKey(entry.getDate())) {
finalMap.Add(entry.getDate(), entry);
}
}
}
GUI._GUI1.update("File \"" + file.Name + "\" completed\n");
}
catch (FormatException ex) {
GUI._GUI1.update("Wrong file format.");
}
catch (OutOfMemoryException) {
GUI._GUI1.update("Out of memory");
}
}
}
}
}
}
and the object in which I put the values from lines:
Entry.cs
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace STI.entities_package {
class Entry {
private long date;
private int location;
private float value;
private String fileName;
private int count;
public Entry(long date, int location, float value, String fileName) {
this.date = date;
this.location = location;
this.value = value;
this.fileName = fileName;
this.count = 1;
}
public long getDate() {
return date;
}
public int getLocation() {
return location;
}
public String getFileName() {
return fileName;
}
}
}
I don't think that multithreading is going to help you here - it could help you separate the IO-bound tasks from the CPU-bound tasks, but your CPU-bound tasks are so trivial that I don't think they warrant their own thread. All multithreading is going to do is unnecessarily increase the problem complexity.
Calculating the min/max in constant memory is trivial: just maintain a minFile and maxFile variable that gets updated when the current file's value is less-than minFile or greater-than maxFile. Finding the most frequent value is going to require more memory, but with only a few million files you ought to have enough RAM to store a Dictionary<float, int> that maintains the frequency of each value, after which you iterate through the map to determine which value had the highest frequency. If for some reason you don't have enough RAM (make sure that your files are being closed and garbage collected if you're running out of memory, because a Dictionary<float, int> with a few million entries ought to fit in less than a gigabyte of RAM) then you can make multiple passes over the files: on the first pass store the values in a Dictionary<interval, int> where you've split up the interval between MIN_FLOAT and MAX_FLOAT into a few thousand sub-intervals, then on the next pass you can ignore all values that didn't fit into the interval with the highest frequency thus shrinking the dictionary's size. However, the Dictionary<float, int> ought to fit into memory, so unless you start processing billions of files instead of millions of files you probably won't need a multi-pass procedure.
I am using the ProtoWriter/ProtoReader classes to implement something similar to the DataTableSerializer included with the Protobuf-net source. One difference is that after the initial transfer of the table contents all future updates are serialised incrementally.
Currently I'm not disposing the ProtoWriter instance until the program ends (as I want all future updates to be serialised with the same writer). This has the effect of delaying all writing to the output stream until the internal buffer size of 1024 bytes is reached.
Should I be creating a new ProtoWriter for each incremental update? Is there another way to force the writer to write to the stream?
Sample code:
private readonly ProtoWriter _writer;
private void WriteUpdate(IEnumerable<IReactiveColumn> columns, int rowIndex)
{
// Start the row group
ProtoWriter.WriteFieldHeader(ProtobufOperationTypes.Update, WireType.StartGroup, _writer);
var token = ProtoWriter.StartSubItem(rowIndex, _writer);
var rowId = rowIndex;
// Send the row id so that it can be matched against the local row id at the other end.
ProtoWriter.WriteFieldHeader(ProtobufFieldIds.RowId, WireType.Variant, _writer);
ProtoWriter.WriteInt32(rowId, _writer);
foreach (var column in columns)
{
var fieldId = _columnsToFieldIds[column.ColumnId];
WriteColumn(column, fieldId, rowId);
}
ProtoWriter.EndSubItem(token, _writer);
}
Interesting question. The flush method isn't exposed because internally it is not always the case that it is appropriate to flush, but I guess there's not a huge reason not to expose this and just let it no-op. On the other hand:
it is already a lightweight wrapper around a stream: you could dispose and recreate
or you could just keep writing and make full use of the extra buffering