I've been looking to do some binary serialization to file and protobuf-net seems like a well-performing alternative. I'm a bit stuck in getting started though. Since I want to decouple the definition of the classes from the actual serialization I'm not using attributes but opting to go with .proto files, I've got the structure for the object down (I think)
message Post {
required uint64 id = 1;
required int32 userid = 2;
required string status= 3;
required datetime created = 4;
optional string source= 5;
}
(is datetime valid or should I use ticks as int64?)
but I'm stuck on how to use protogen and then serialize a IEnumerable of Post to a file and read it back. Any help would be appreciated
Another related question, is there any best practices for detecting corrupted binary files, like if the computer is shut down while serializing
Re DateTime... this isn't a standard proto type; I have added a BCL.DateTime (or similar) to my own library, which is intended to match the internal serialization that protobuf-net uses for DateTime, but I'm fairly certain I haven't (yet) updated the code-generator to detect this as a special-case. It would be fairly easy to add if you want me to try... If you want maximum portability, a "ticks" style approach might be pragmatic. Let me know...
Re serializing to a file - if should be about the same as the Getting Started example, but note that protobuf-net wants to work with data it can reconstruct; just IEnumerable<T> might cause problems - IList<T> should be fine, though (it'll default to List<T> as a concrete type when reconstructing).
Re corruption - perhaps use SerializeWithLengthPrefix - it can then detect issues even at a message boundary (where they are otherwise undetectable as an EOF). This (as the name suggests) writes the length first, so it knows whether is has enough data (via DeserializeWithLengthPrefix). Alternatively, reserve the first [n] bytes in your file for a hash / checksum. Write this blank spacer, then the data, calculate the hash / checksum and overwrite the start. Verify during deserialization. Much more work.
Related
I need guidance, someone to point me in the right direction. As the tittle says, I need to save information to a file: Date, string, integer and an array of integers. And I also need to be able to access that information later, when an user wants to review it.
Optional: File is plain text and I can directly check it and it is understandable.
Bonus points if chosen method can be "easily" converted to working with a database in the future instead of individual files.
I'm pretty new to C# and what I've found so far is that I should turn the array into a string with separators.
So, what'd you guys suggest?
// JSON.Net
string json = JsonConvert.SerializeObject(objOrArray);
File.WriteAllText(path, json);
// (note: can also use File.Create etc if don't need the string in memory)
or...
using(var file = File.Create(path)) { // protobuf-net
Serializer.Serialize(file, objOrArray);
}
The first is readable; the second will be smaller. Both will cope fine with "Date, string, integer and an array of integers", or an array of such objects. Protobuf-net would require adding some attributes to help it, but really simple.
As for working with a database as columns... the array of integers is the glitch there, because most databases don't support "array of integers" as a column type. I'd say "separation of concerns" - have a separate model for DB persistence. If you are using the database purely to store documents, then: pretty much every DB will support CLOB and BLOB data, so either is usable. Many databases now have inbuilt JSON support (helper methods, etc), which might make JSON as a CLOB more tempting.
I would probably serialize this to json and save it somewhere. Json.Net is a very popular way.
The advantage of this is also creating a class that can be later used to work with an Object-Relational Mapper.
var userInfo = new UserInfoModel();
// write the data (overwrites)
using (var stream = new StreamWriter(#"path/to/your/file.json", append: false))
{
stream.Write(JsonConvert.SerializeObject(userInfo));
}
//read the data
using (var stream = new StreamReader(#"path/to/your/file.json"))
{
userInfo = JsonConvert.DeserializeObject<UserInfoModel>(stream.ReadToEnd());
}
public class UserInfoModel
{
public DateTime Date { get; set; }
// etc.
}
for the Plaintext File you're right.
Use 1 Line for each Entry:
Date
string
Integer
Array of Integer
If you read the File in your code you can easily seperate them by reading line to line.
Make a string with a specific Seperator out of the Array:
[1,2,3] -> "1,2,3"
When you read the line you can Split the String by "," and gets a Array of Strings. Parse each Entry to int into an Array of Int with the same length.
How to read and write the File get a look at Easiest way to read from and write to files
If you really wants the switch to a database at a point, try a JSON Format for your File. It is easy to handle and there are some good Plugins to work with.
Mfg
Henne
The way I got started with C# is via the game Space Engineers from the Steam Platform, the Mods need to save a file Locally (%AppData%\Local\Temp\SpaceEngineers\ or %AppData%\Roaming\SpaceEngineers\Storage\) for various settings, and their logging is similar to what #H. Sandberg mentioned (line by line, perhaps a separator to parse with later), the upside to this is that it's easy to retrieve, easy to append, easy to overwrite, and I'm pretty sure it's even possible to retrieve File Size, which when combined with File Deletion and File Creation can prevent runaway File Sizes as this allows you to set an Upper Limit to check against, allowing you to run it on a Server with minimal impact (probably best to include a minimum Date filter {make sure X is at least Y days old before deleting it for being over Z Bytes} to prevent Debugging Data Loss {"Why was it over that limit?"})
As far as the actual Code behind the idea, I'm approximately at the same Skill Level as the OP, which is to say; Rookie, but I would advise looking at the Coding in the Space Engineers Mods for some Samples (plus it's not half bad for a Beta Game), as they are almost all written in C#. Also, the Programmable Blocks compile in C# as well, so you'll be able to use that to both assist in learning C# and reinforce and utilize what you already know (although certain C# commands aren't allowed for security reasons, utilizing the Mod API you'll have more flexibility to do things such as Creating/Maintaining Log Files, Retrieving/Modifying Object Properties, etc.), You are even capable of printing Text to various in Game Text Monitors.
I apologise if my Syntax needs some work, and I'm sorry I am not currently capable of just whipping up some Code to solve your issue, but I do know
using System;
Console.WriteLine("Hello World");
so at least it's not a total loss, but my example Code likely won't compile, since it's likely missing things like: an Output Location, perhaps an API reference or two, and probably a few other settings. Like I said, I'm New, but that is a valid C# Command, I know I got that part correct.
Edit: here's a better attempt:
using System;
class Test
{
static void Main()
{
string a = "Hello Hal, ";
string b = "Please open the Airlock Doors.";
string c = "I'm sorry Dave, "
string d = "I'm afraid I can't do that."
Console.WriteLine(a + b);
Console.WriteLine(c + d);
Console.Read();
}
}
This:
"Hello Hal, Please open the Airlock Doors."
"I'm sorry Dave, I'm afraid I can't do that."
Should be the result. (the "Quotation Marks" shouldn't appear in the readout {the last Code Block}, that's simply to improve readability)
I have a JSON string which looks like:
{"Detail": [
{"PrimaryKey":111,"Date":"2016-09-01","Version":"7","Count":2,"Name":"Windows","LastAccessTime":"2016-05-25T21:49:52.36Z"},
{"PrimaryKey":222,"Date":"2016-09-02","Version":"8","Count":2,"Name":"Windows","LastAccessTime":"2016-07-25T21:49:52.36Z"},
{"PrimaryKey":333,"Date":"2016-09-03","Version":"9","Count":3,"Name":"iOS","LastAccessTime":"2016-08-22T21:49:52.36Z"},
.....( *many values )
]}
The array Detail has lots of PrimaryKeys. Sometimes, it is about 500K PrimaryKeys. The system we use can only process JSON strings with certain length, i.e. 128KB. So I have to split this JSON string into segments (each one is 128KB or fewer chars in length).
Regex reg = new Regex(#"\{"".{0," + (128*1024).ToString() + #"}""\}");
MatchCollection mc = reg.Matches(myListString);
Currently, I use regular expression to do this. It works fine. However, it uses too much memory. Is there a better way to do this (unnecessary to be regular expression)?
*** Added more info.
The 'system' I mentioned above is Azure DocumentDB. By default, the document can only be 512KB (as now). Although we can request MS increase this, but the json file we got always much much more than 512KB. That's why we need to figure out a way to do this.
If possible, we want to keep using documentDB, but we are open to other suggestions.
*** Some info to make things clear: 1) the values in the array are different. Not duplicated. 2) Yes, I use StringBuilder whenever I can. 3) Yes, I tried IndexOf & Substring, but based on tests, the performance is not better than regular expression in this case (although it could be the way I implement it).
* **the json object is complex, but all I care is this "Detail" which is an array. We can assume the string is just like the example, only has "Detail". We need to split this json array string into size smaller than 512KB. Basically, we can think this as a simple string, not json. but, it is a json format, so maybe some libraries can do this better.
Take a look at Json.NET (available via NuGet).
It has a JsonReader class, which allows you to create a required object by reading json by token, example of json reading with JsonReader. Not that if you pass invalid json string (e.g. without "end array" character or without "end object" character) to JsonReader - it will throw an exception only when it reaches invalid item, so you can pass different substrings to it.
Also, I guess that your system has something similar to JsonReader, so you can use it.
Reading a string with StringReader should not require too much application memory and it should be faster then iterating through regular expression matches.
Here is a hacky solution assuming data contains your JSON data:
var details = data
.Split('[')[1]
.Split(']')[0]
.Split(new[] { "}," }, StringSplitOptions.None)
.Select(d => d.Trim())
.Select(d => d.EndsWith("}") ? d : d + "}");;
foreach (var detail in details)
{
// Now process "detail" with your JSON library.
}
Working example: https://dotnetfiddle.net/sBQjyi
Obviously you should only do this if you really can't use a normal JSON library. See Mikhail Neofitov's answer for library suggestions.
If you are reading the JSON data from file or network you should implement a more stream-like processing where you read one details line, deserialize it with your JSON library and yield it to the caller. When the caller requests the next detail object, read the next line, deserialize it and so on. This way you can minimize the memory footprint of your deserializer.
You might want to consider storing each detail in a separate document. It means two round trips to get both the header and all of the detail documents, but it means you are never dealing with a really large JSON document. Also, if Detail is added to incrementally, it'll be much more efficient for writes because there is no way to just add another row. You have to rewrite the entire document. Your read/write ratio will determine the break even point in overall efficiency.
Another argument for this is that the complexity of regex parsing, feeding it through your JSON parser, then reassembling it goes away. You never know if your regex parser will deal with all cases (commas inside of quotes, international characters, etc.). I've seen many folks think they have a good regex only to find odd cases in production.
If your Detail array can grow unbounded (or even with a large bound), then you should definitely make this change regardless of your JSON parser limitations or read/write ratio because eventually, you'll exceed the limit.
I have a list of objects (let's say class AccessLevel) that is being serialized using Protofbuf-Net.
The objects are not fixed-sized, is it possible to update a single object in the serialized file (based on index) without rewriting the whole file?
If the change makes it smaller or doesn't impact the size: probably, but there is nothing in the library to help you do this, as it is not a supported scenario. For same-length: just overwrite. Of course, knowing the length in advance is a trick :)
At the protocol level, when reducing size: you could pad data out by faking an unused field, or by using suboptimal varint encoding of existing fields (spare bytes with nothing except the continuation bit).
If it gets bigger: no amount of trickery is going to save you from having to rework the entire file.
These are both theoretical. A more practical answer is provably: no.
I am saving a series of protobuf-net objects in a database cell as a Byte[] of length-prefixed protobuf-net objects:
//retrieve existing protobufs from database and convert to Byte[]
object q = sql_agent_cmd.ExecuteScalar();
older-pbfs = (Byte[])q;
// serialize the new pbf to add into MemoryStream m
//now write p and the new pbf-net Byte[] into a memory stream and retrieve the sum
var s = new System.IO.MemoryStream();
s.Write(older-pbfs, 0, older-pbfs.Length);
s.Write(m.GetBuffer(), 0, m.ToArray().Length); // append new bytes at the end of old
Byte[] sum-pbfs = s.ToArray();
//sum-pbfs = old pbfs + new pbf. Insert sum-pbfs into database
This works fine. My concern is what happens if there is slight db corruption. It will no longer be possible to know which byte is the length prefix and the entire cell contents would have to be discarded. Wouldn't it be advisable to also use some kind of a end-of-pbf-object indicator (kind of like the \n or EOF indicators used in text files). This way even if a record gets corrupted, the other records would be recoverable.
If so, what is the recommended way to add end-of-record indicators at the end of each pbf.
Using protobuf-netv2 and C# on Visual Studio 2010.
Thanks
Manish
If you use a vanilla message via Serialize / Deserialize, then no: that isn't part of the specification (because the format is designed to be appendable).
If, however, you use SerializeWithLengthPrefix, it will dump the length at the start of the message; it will then know in advance how much data is expected. You deserialize with DeserializeWithLengthPrefix, and it will complain loudly if it doesn't have enough data. However! It will not complain if you have extra data, since this too is designed to be appendable.
In terms of Jon's reply, the default usage of the *WithLengthPrefix method is in terms of the data stored exactly identical to what Jon suggests; it pretends there is a wrapper object and behaves accordingly. The differences are:
no wrapper object actually exists
the "withlengthprefix" methods explicitly stop after a single occurrence, rather than merging any later data into the same object (useful for, say, sending multiple discreet objects to a single file, or down a single socket)
The difference in the two "appendable"s here is that the first means "merge into a single object", where-as the second means "I expect multiple records".
Unrelated suggestion:
s.Write(m.GetBuffer(), 0, m.ToArray().Length);
should be:
s.Write(m.GetBuffer(), 0, (int)m.Length);
(no need to create an extra buffer)
(Note: I don't know much about protobuf-net itself, but this is generally applicable to Protocol Buffer messages.)
Typically if you want to record multiple messages, it's worth just putting a "wrapper" message - make the "real" message a repeated field within that. Each "real" message will then be length prefixed by the natural wire format of Protocol Buffers.
This won't detect corruption of course - but to be honest, if the database ends up getting corrupted you've got bigger problems. You could potentially detect corruption, e.g. by keeping a hash along with each record... but you need to consider the possibility of the corruption occurring within the length prefix, or within the hash itself. Think about what you're really trying to achieve here - what scenarios you're trying to protect against, and what level of recovery you need.
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.