How to split the string more efficiently?

How to split the string more efficiently? - c#

I have a JSON string which looks like:
{"Detail": [
{"PrimaryKey":111,"Date":"2016-09-01","Version":"7","Count":2,"Name":"Windows","LastAccessTime":"2016-05-25T21:49:52.36Z"},
{"PrimaryKey":222,"Date":"2016-09-02","Version":"8","Count":2,"Name":"Windows","LastAccessTime":"2016-07-25T21:49:52.36Z"},
{"PrimaryKey":333,"Date":"2016-09-03","Version":"9","Count":3,"Name":"iOS","LastAccessTime":"2016-08-22T21:49:52.36Z"},
.....( *many values )
]}
The array Detail has lots of PrimaryKeys. Sometimes, it is about 500K PrimaryKeys. The system we use can only process JSON strings with certain length, i.e. 128KB. So I have to split this JSON string into segments (each one is 128KB or fewer chars in length).
Regex reg = new Regex(#"\{"".{0," + (128*1024).ToString() + #"}""\}");
MatchCollection mc = reg.Matches(myListString);
Currently, I use regular expression to do this. It works fine. However, it uses too much memory. Is there a better way to do this (unnecessary to be regular expression)?
*** Added more info.
The 'system' I mentioned above is Azure DocumentDB. By default, the document can only be 512KB (as now). Although we can request MS increase this, but the json file we got always much much more than 512KB. That's why we need to figure out a way to do this.
If possible, we want to keep using documentDB, but we are open to other suggestions.
*** Some info to make things clear: 1) the values in the array are different. Not duplicated. 2) Yes, I use StringBuilder whenever I can. 3) Yes, I tried IndexOf & Substring, but based on tests, the performance is not better than regular expression in this case (although it could be the way I implement it).
* **the json object is complex, but all I care is this "Detail" which is an array. We can assume the string is just like the example, only has "Detail". We need to split this json array string into size smaller than 512KB. Basically, we can think this as a simple string, not json. but, it is a json format, so maybe some libraries can do this better.

Take a look at Json.NET (available via NuGet).
It has a JsonReader class, which allows you to create a required object by reading json by token, example of json reading with JsonReader. Not that if you pass invalid json string (e.g. without "end array" character or without "end object" character) to JsonReader - it will throw an exception only when it reaches invalid item, so you can pass different substrings to it.
Also, I guess that your system has something similar to JsonReader, so you can use it.
Reading a string with StringReader should not require too much application memory and it should be faster then iterating through regular expression matches.

Here is a hacky solution assuming data contains your JSON data:
var details = data
.Split('[')[1]
.Split(']')[0]
.Split(new[] { "}," }, StringSplitOptions.None)
.Select(d => d.Trim())
.Select(d => d.EndsWith("}") ? d : d + "}");;
foreach (var detail in details)
{
// Now process "detail" with your JSON library.
}
Working example: https://dotnetfiddle.net/sBQjyi
Obviously you should only do this if you really can't use a normal JSON library. See Mikhail Neofitov's answer for library suggestions.
If you are reading the JSON data from file or network you should implement a more stream-like processing where you read one details line, deserialize it with your JSON library and yield it to the caller. When the caller requests the next detail object, read the next line, deserialize it and so on. This way you can minimize the memory footprint of your deserializer.

You might want to consider storing each detail in a separate document. It means two round trips to get both the header and all of the detail documents, but it means you are never dealing with a really large JSON document. Also, if Detail is added to incrementally, it'll be much more efficient for writes because there is no way to just add another row. You have to rewrite the entire document. Your read/write ratio will determine the break even point in overall efficiency.
Another argument for this is that the complexity of regex parsing, feeding it through your JSON parser, then reassembling it goes away. You never know if your regex parser will deal with all cases (commas inside of quotes, international characters, etc.). I've seen many folks think they have a good regex only to find odd cases in production.
If your Detail array can grow unbounded (or even with a large bound), then you should definitely make this change regardless of your JSON parser limitations or read/write ratio because eventually, you'll exceed the limit.

Related

Read/Write array to a file

I need guidance, someone to point me in the right direction. As the tittle says, I need to save information to a file: Date, string, integer and an array of integers. And I also need to be able to access that information later, when an user wants to review it.
Optional: File is plain text and I can directly check it and it is understandable.
Bonus points if chosen method can be "easily" converted to working with a database in the future instead of individual files.
I'm pretty new to C# and what I've found so far is that I should turn the array into a string with separators.
So, what'd you guys suggest?

// JSON.Net
string json = JsonConvert.SerializeObject(objOrArray);
File.WriteAllText(path, json);
// (note: can also use File.Create etc if don't need the string in memory)
or...
using(var file = File.Create(path)) { // protobuf-net
Serializer.Serialize(file, objOrArray);
}
The first is readable; the second will be smaller. Both will cope fine with "Date, string, integer and an array of integers", or an array of such objects. Protobuf-net would require adding some attributes to help it, but really simple.
As for working with a database as columns... the array of integers is the glitch there, because most databases don't support "array of integers" as a column type. I'd say "separation of concerns" - have a separate model for DB persistence. If you are using the database purely to store documents, then: pretty much every DB will support CLOB and BLOB data, so either is usable. Many databases now have inbuilt JSON support (helper methods, etc), which might make JSON as a CLOB more tempting.

I would probably serialize this to json and save it somewhere. Json.Net is a very popular way.
The advantage of this is also creating a class that can be later used to work with an Object-Relational Mapper.
var userInfo = new UserInfoModel();
// write the data (overwrites)
using (var stream = new StreamWriter(#"path/to/your/file.json", append: false))
{
stream.Write(JsonConvert.SerializeObject(userInfo));
}
//read the data
using (var stream = new StreamReader(#"path/to/your/file.json"))
{
userInfo = JsonConvert.DeserializeObject<UserInfoModel>(stream.ReadToEnd());
}
public class UserInfoModel
{
public DateTime Date { get; set; }
// etc.
}

for the Plaintext File you're right.
Use 1 Line for each Entry:
Date
string
Integer
Array of Integer
If you read the File in your code you can easily seperate them by reading line to line.
Make a string with a specific Seperator out of the Array:
[1,2,3] -> "1,2,3"
When you read the line you can Split the String by "," and gets a Array of Strings. Parse each Entry to int into an Array of Int with the same length.
How to read and write the File get a look at Easiest way to read from and write to files
If you really wants the switch to a database at a point, try a JSON Format for your File. It is easy to handle and there are some good Plugins to work with.
Mfg
Henne

The way I got started with C# is via the game Space Engineers from the Steam Platform, the Mods need to save a file Locally (%AppData%\Local\Temp\SpaceEngineers\ or %AppData%\Roaming\SpaceEngineers\Storage\) for various settings, and their logging is similar to what #H. Sandberg mentioned (line by line, perhaps a separator to parse with later), the upside to this is that it's easy to retrieve, easy to append, easy to overwrite, and I'm pretty sure it's even possible to retrieve File Size, which when combined with File Deletion and File Creation can prevent runaway File Sizes as this allows you to set an Upper Limit to check against, allowing you to run it on a Server with minimal impact (probably best to include a minimum Date filter {make sure X is at least Y days old before deleting it for being over Z Bytes} to prevent Debugging Data Loss {"Why was it over that limit?"})
As far as the actual Code behind the idea, I'm approximately at the same Skill Level as the OP, which is to say; Rookie, but I would advise looking at the Coding in the Space Engineers Mods for some Samples (plus it's not half bad for a Beta Game), as they are almost all written in C#. Also, the Programmable Blocks compile in C# as well, so you'll be able to use that to both assist in learning C# and reinforce and utilize what you already know (although certain C# commands aren't allowed for security reasons, utilizing the Mod API you'll have more flexibility to do things such as Creating/Maintaining Log Files, Retrieving/Modifying Object Properties, etc.), You are even capable of printing Text to various in Game Text Monitors.
I apologise if my Syntax needs some work, and I'm sorry I am not currently capable of just whipping up some Code to solve your issue, but I do know
using System;
Console.WriteLine("Hello World");
so at least it's not a total loss, but my example Code likely won't compile, since it's likely missing things like: an Output Location, perhaps an API reference or two, and probably a few other settings. Like I said, I'm New, but that is a valid C# Command, I know I got that part correct.
Edit: here's a better attempt:
using System;
class Test
{
static void Main()
{
string a = "Hello Hal, ";
string b = "Please open the Airlock Doors.";
string c = "I'm sorry Dave, "
string d = "I'm afraid I can't do that."
Console.WriteLine(a + b);
Console.WriteLine(c + d);
Console.Read();
}
}
This:
"Hello Hal, Please open the Airlock Doors."
"I'm sorry Dave, I'm afraid I can't do that."
Should be the result. (the "Quotation Marks" shouldn't appear in the readout {the last Code Block}, that's simply to improve readability)

Compare 2 JSONs - The property count - c#

Assume the recurring code is generating huge JSON based on some nested structure. Suppose there comes a bad data on one bad day, and our new JSON is 30% smaller in size from last good JSON.
My goal is to count all the main properties while deserializing the JSON and compare it from past count statistics and notify the JSON producer that new JSON is bad.
Is there a simple way to do this? Using C# and NewtonSoft.Json for deserialization.
For ex: Suppose JSON encapsulates n-array tree like structure. Each node contains the source field as string (which can be null). Now, in first run 100 nodes were present in JSON and 70% of them contain valid source string. In second run there were 105 nodes with 80% of valid source string (GOOD data). In third run there were 40 nodes with 20% of valid source string. I consider this as "bad" data. I want to fail here without ingesting newer bad data.
Each time we run recurring code, I want to compare it against the last run's count statistics, and want to fail the code if the data is bad as described in example.

What you are describing is business logic and nothing the serializer should care about.
I think the best way to this is to add a validation step between deserialization and processing of the deserialized data.
This means that the validator iterates over the deserialized POCO classes and counts how many nodes have a valid source string. In case they are below a certain threshold then execution should halt and the JSON producer should be notified.
If you want the threshold to be dynamic (i.e. it should depend on the previous count) then you have to store the counted value in a file or database to retrieve it when the next JSON arrives.

parse hstore column in .net

I was wondering what was the best way to retrieve and handle hstore data in .net.
If I do a basic query, then it'll output a string formatted that way :
"key1" => "value1", "key2" => "value2"
Looks alike KeyValuePair that I today parse like that :
SimpleJson.SimpleJson
.DeserializeObject<Dictionary<string, string>> ("{" + tags.Replace ("=>", ":") + "}");
I could do it manually with something like :
split first with ","
then loop and split with "=>"
then extract left to key and right to value
But then, what if there is a "," inside a value or a full element inside a value, should I do a recursive parsing? and there goes on all the questions about parsing a string :)
Do you see a better way than the trick of Json ?

You have a few options. I can't say whether any are better than what you have now. My recommendation actually would be to start beta testing 9.3 as soon as it is in beta and look at moving your code over there. 9.3's hstore has a cast to json, so you can just do it in the db. This may be a good reason to plan to be an early adopter. In essence in 9.3, you will just be able to:
SELECT my_hstore::json FROM mytable;
A second possibility is that you could create a structure in memory in PostgreSQL (for existing supported versions in 9.1 and 9.2) to write your own cast to json. For insight on how to do it, see Andrew's blog below.
A third possibility is to use existing functions of XML and write an XML cast. This is fairly easy. With a simple function you could easily do a:
SELECT my_hstore::xml FROM mytable;
The final possibility is to do string parsing in .net. This is the area I have the least knowledge of, but I would also suggest it is the last choice in large part because with the others, you have direct access to the hstore data interfaces, and also because the functions you would be using all have relatively tight contracts and a lot of other users.

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}

You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.

Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.

While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.

It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.

Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.

Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.

Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.

I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)

You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?

I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}

While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.