Slow write and inconsistant write to MongoDB via C#

Slow write and inconsistant write to MongoDB via C# - c#

Problem:
We are experiencing slow write and the time varies greatly between writes. From just over 1 minute to almost 4 minutes. This is to insert a large amount of values (704 974 to be precise)
Background:
We read the values in from a txt file, link it to a certain tag (data object) and then add the values to a value collection. Each data point is linked to a tag and a time. Each value bson has two primary keys: ID, and tag-date.
The value Bson looks as follows:
{
"_id" : ObjectId,
"Timestamp" : ISODate,
"TagID" : ObjectId,
"Value" : string or double,
"FileReferenceID" : ObjectId,
"WasValueInterpolated" : int
}
The mongoDb is set as a 3 replication set, and we write with writeconcern true (we have had trouble where one replicating set would fall behind and refuse to auto catch up) The Mongo's run on a MESOS/Linux deployment, while the c# code connects over the network from a windows machine.
Mongo version: 3.4.5
C# driver: 2.4.4
I have also increased the min and max threadpool thread counts by 50. I leverage threads to try and increase the write speed, it somewhat works.
Code
The actual piece of the code that inserts the data into mongo is this:
public int InsertValues(string aDatabaseName, List<cValueDocument> aValues, out List<int> aErrorValueIndexes,
bool aOverride = false)
{
Parallel.ForEach(aValues, aValue => aValue.ConvertToDoubleIfCan());
int lResult = 0;
Stack<int> lErrorIndexesToDelete = new Stack<int>(aValues.Count - 1);
aErrorValueIndexes = new List<int>(aValues.Count);
aDatabaseName = aDatabaseName.ToLower();
var lValueCollection =
MongoDatabasesDict[aDatabaseName].GetCollection<cValueDocument>(tMongoCollectionNames.Value.ToString());
var lTagCollection =
MongoDatabasesDict[aDatabaseName].GetCollection<cTagDocument>(tMongoCollectionNames.Tag.ToString());
List<cValueDocument> lValuesOrderedByTime = aValues.OrderBy(aDocument => aDocument.Timestamp).ToList();
// Check if there are Values that belong to other Tags.
if (lValuesOrderedByTime.Any(aValue => aValue.TagID != lValuesOrderedByTime[0].TagID))
{
return -4;
}
cTagDocument lTagDocument = lTagCollection.AsQueryable().First(aTag => aTag._id == lValuesOrderedByTime[0].TagID);
// Find any Values that are not at least the minimum Interval between each other.
for (int i = 1; i < lValuesOrderedByTime.Count; i++)
{
// Determine if the two consecutive timestamps are not at least an Interval from each other.
if (lTagDocument.IntervalMask.CompareDateTimesWithinInterval(lValuesOrderedByTime[i - 1].Timestamp, lValuesOrderedByTime[i].Timestamp) == -1)
{
aErrorValueIndexes.Add(aValues.FindIndex(aElement => aElement == lValuesOrderedByTime[i]));
lErrorIndexesToDelete.Push(i);
lResult = -2;
}
}
// Determine if erroneous values must be removed before insertion.
if (lResult == -2)
{
// Remove each value with an index in lErrorIndexesToDelete;
while (lErrorIndexesToDelete.Count > 0)
{
lValuesOrderedByTime.RemoveAt(lErrorIndexesToDelete.Pop());
}
}
if (aOverride)
{//removed for testing purposes}
try
{
try
{
InsertManyOptions lOptions = new InsertManyOptions();
lOptions.IsOrdered = false;
lValueCollection.InsertMany(lValuesOrderedByTime, lOptions);
}
catch (MongoBulkWriteException lException)
{
if (lException.WriteErrors?.First().Code == 11000)
{
aErrorValueIndexes = Enumerable.Range(0, aValues.Count).ToList();
lResult = -3;
}
else
{
throw;
}
}
return lResult;
}
catch (Exception lException)
{
BigDBLog.LogEvent("Error, an exception was thrown in InsertValues. ", tEventEscalation.Error,
false,
lException);
return -5;
}
}
I removed some of the value checking to shorten the function.
Basically the function pushes aValues to the mongo
Debug/Tracing
Here is where it gets a strange for me. I use the same file to test it. I delete the values from the DB before each test.
The executing speed goes from just over 1 minute to just below 4 minutes.
I did a trace on the program using jetbrains dotTrace. It points to MongoDB.Driver.Core.Misc.StreamExtensionMethods.ReadBytes as the hot spot. If you follow the code, this was called form MongoDB.Driver.MongoCollectionBase`1.InsertMany. This is weird as I am writing the the DB and not reading from it. The CPU uses does spike (more during the translation of the file), but stays low (1-2%).
I have checked the wiredTiger.concurrentTransactions tickets. They dont run out.
Questions
How can I improve my read speed (at worst 3175 values/s, at best
8392 values/s)?
Why do I have such inconsistencies? I dont really believe it can all be network related.
Why does it read bytes so much and why is it causes
a bottleneck?
What can I further do to help find the problem?

Related

No memory leaks or errors but my code slows down exponentially C#

I am perplexed by this issue. I believe I'm just missing an easy problem right in front of my face but I'm at the point where I need a second opinion to point out anything obvious that I'm missing. I minimized my code and simplified it so it only shows a small part of what it does. The full code is just many different calculations added on to what I have below.
for (int h = 2; h < 200; h++)
{
var List1 = CalculateSomething(testValues, h);
var masterLists = await AddToRsquaredList("Calculation1", h, actualValuesList, List1, masterLists.Item1, masterLists.Item2);
var List2 = CalculateSomething(testValues, h);
masterLists = await AddToRsquaredList("Calculation2", h, actualValuesList, List2, masterLists.Item1, masterLists.Item2);
var List3 = CalculateSomething(testValues, h);
masterLists = await AddToRsquaredList("Calculation3", h, actualValues, List3, masterLists.Item1, masterLists.Item2);
}
public static async Task<(List<RSquaredValues3>, List<ValueClass>)> AddToRsquaredList(string valueName, int days,
IEnumerable<double> estimatedValuesList, IEnumerable<double> actualValuesList,
List<RSquaredValues3> rSquaredList, List<ValueClass> valueClassList)
{
try
{
RSquaredValues3 rSquaredValue = new RSquaredValues3
{
ValueName = valueName,
Days = days,
RSquared = GoodnessOfFit.CoefficientOfDetermination(estimatedValuesList, actualValuesList),
StdError = GoodnessOfFit.PopulationStandardError(estimatedValuesList, actualValuesList)
};
int comboSize = 15;
double max = 0;
var query = await rSquaredList.OrderBy(i => i.StdError - i.RSquared).DistinctBy(i => i.ValueName).Take(comboSize).ToListAsync().ConfigureAwait(false);
if (query.Count > 0)
{
max = query.Last().StdError - query.Last().RSquared;
}
else
{
max = 10000000;
}
if ((rSquaredValue.StdError - rSquaredValue.RSquared < max || query.Count < comboSize) && rSquaredList.Contains(rSquaredValue) == false)
{
rSquaredList.Add(rSquaredValue);
valueClassList.Add(new ValueClass { ValueName = rSquaredValue.ValueName, ValueList = estimatedValuesList, Days = days });
}
}
catch (Exception ex)
{
ThrowExceptionInfo(ex);
}
return (rSquaredList, valueClassList);
}

There is clearly a significance to StdError - RSquared, so change RSquaredValues3 to expose that value (i.e. calculate it once, on construction, since the values do not change) rather than recalculating it in multiple places during the processing loop.
The value in this new property is the way that the list is being sorted. Rather than sorting the list over and over again, consider keeping the items in the list in that order in the first place. You can do this by ensuring that each time an item gets added, it is inserted in the right place in the list. This is called an insertion sort. (I have assumed that SortedList<TKey,TValue> is inappropriate due to duplicate 'key's.)
Similar improvements can be made to avoid the need for DistinctBy(i => i.ValueName). If you are only interested in distinct value names, then consider avoiding inserting the item if it is not providing an improvement.
Your List needs to grow during your processing - under the hood, the list doubles every time it grows, so the number of growths is O(log(n)). You can specify a suggested capacity in construction. If you specify the expected size large enough at the start, then the list will not need to do this during your processing.
The await of the ToListAsync is not adding any advantage to this code, as far as I can see.
The check for rSquaredList.Contains(rSquaredValue) == false looks like a redundant check, since this is a reference comparison of a newly instantiated item which cannot have been inserted in the list. So you can remove it to make it run faster.

With all that use of Task and await, you are not actually gaining anything at the moment, since you have a single thread handling it and are waiting for execution sequentially, so it appears to all be overhead. I am not sure if you can parallelize this workload but the main loop from 2 to 200 seems like a prime candidate for a Parallel.For() loop instead. You should also look into using a System.Collections.Concurrent.ConcurrentBag() for your master list if you implement parallelism to avoid deadlock issues.

how to display labels like a running sequence ? C#

I am trying to do "running window" like for labels. I tried to find similar solution in the google but it got me nowhere.
EXAMPLE: 5 numbers needed to be displayed at different counter values. This was macro to my timer_Start() thus, the counter increases every 5 seconds which was set at my main form.
Display: 21 23 24 25 26
If I insert another value, eg. 23, the last 5 number should be displayed.
Display: 23 21 23 24 25 ,
However, for my code below, when I insert another value, all 5 of them will change. If i change to if(counter == 2), it unable to get update when the counter == 3.
int counter = 0;
sql_cmd = sql_conn.CreateCommand();
sql_cmd.CommandText = "SELECT * FROM temp where id=12";
try
{
sql_conn.Open();
sql_reader = sql_cmd.ExecuteReader();
while (sql_reader.Read()) // start retrieve
{
if (counter >= 1)
{
this.avg1.Text = sql_reader["Temp1"].ToString();
}
}
sql_conn.Close();
}
catch (Exception e)
{
MessageBox.Show(e.Message);
}
if (counter >= 2)
{
avg2.Text = avg1.Text;
}
if (counter >= 3)
{
avg3.Text = avg2.Text;
}
if (counter >= 4)
{
avg4.Text = avg3.Text;
}
if (counter >= 5)
{
avg5.Text = avg4.Text;
counter = 0;
}
Any help is much appreciated. Thanks.

Your problem is with your series of if statements. Simple debugging would allow you to see this, so I would suggest stepping through your code before coming here next time. With that, your IF statements can be refracted out into a simple method for you to use.
private void UpdateLabels(string newValue) {
avg5.Text = avg4.Text;
avg4.Text = avg3.Text;
avg3.Text = avg2.Text;
avg2.Text = avg1.Text;
avg1.Text = newValue;
}
What is important here is that you have the correct update order. Your original if statements where not in the correct order, which would be why you were having issues. If you want to see why this works, walk through both sets of code in a debugger and see how the Label.Text properties change.
Now you can call this new method after you get your new value from the database... Here we can update your timer code to be slightly better.
sql_cmd = sql_conn.CreateCommand();
sql_cmd.CommandText = "SELECT * FROM temp where id=12";
string newValue = String.Empty;
try {
sql_conn.Open();
sql_reader = sql_cmd.ExecuteReader();
while (sql_reader.Read()) {
newValue = sql_reader["Temp1"].ToString(); // store in local variable
}
} catch (Exception e) {
MessageBox.Show(e.Message);
} finally {
sql_conn.Close(); // SqlConnection.Close should be in finally block
}
UpdateLabels(newValue);
First there is no need for a counter anymore (based off your original code you posted, it was never needed). Since Label.Text can accept blank strings, you can always copy the values, regardless if its the first update or one millionth.
Second you can store your database value in a temporary variable. This will allow you to update the Labels even if there is a database error. After all the database operations are finished, you then call UpdateLabels with your new value and you are all set.

Find the first window satisfying a condition in a Deedle Series

Given a Deedle Series with time as the row index, I need to find the time at which the signal first satisfies a condition (in this case, stays below 0.005 for 50ms).
Currently I take a 50ms moving window and create a series from the start time and maximum value of each window, then get the first one whose max is < 0.005. It works well enough but can be very inefficient.
// Assume a timestep of 1ms
int numSteps = 50;
// Create a series from the first index and max of each window
var windowMaxes = mySeries.WindowInto(
numSteps,
s => new KeyValuePair<double, double>(s.FirstKey(), s.Max()));
var zeroes = windowMaxes.Where(kvp => kvp.Value <= 0.005);
// Set to -1 if the condition was never satisfied
var timeOfZero = zeroes.KeyCount > 0 ? zeroes.FirstKey() : -1D;
The problem is that it searches the entire series (which can get very large) even if the first window meets the condition.
Is there a simply way to do this but stop when the first window is found, instead of searching the entire Series?

Well I couldn't find a Deedly one-liner or any handy LINQ commands to do it, so I wrote the following extension method:
public static K FirstWindowWhere<K, V>(
this Series<K, V> series,
Func<V, bool> condition,
int windowSize)
{
int consecutiveTrues = 0;
foreach (var datum in series.Observations)
{
if (condition(datum.Value))
{
consecutiveTrues++;
}
else
{
consecutiveTrues = 0;
}
if (consecutiveTrues == windowSize)
{
return datum.Key;
}
}
return default(K);
}
To call with my above condition:
double zeroTime = mySeries.FirstWindowWhere(d => d <= 0.005, numSteps);
I tried a few different methods including a nice elegant one that used Series.Between instead of Series.GetObservations but it was noticeably slower. So this will do unless someone has a simpler/better solution.

Parallel execution with StackExchange.Redis?

I have a 1M items store in List<Person> Which I'm serializing in order to insert to Redis. (2.8)
I divide work among 10 Tasks<> where each takes its own section ( List<> is thread safe for readonly ( It is safe to perform multiple read operations on a List)
Simplification :
example:
For ITEMS=100, THREADS=10 , each Task will capture its own PAGE and deal with the relevant range.
For exaple :
void Main()
{
var ITEMS=100;
var THREADS=10;
var PAGE=4;
List<int> lst = Enumerable.Range(0,ITEMS).ToList();
for (int i=0;i< ITEMS/THREADS ;i++)
{
lst[PAGE*(ITEMS/THREADS)+i].Dump();
}
}
PAGE=0 will deal with : 0,1,2,3,4,5,6,7,8,9
PAGE=4 will deal with : 40,41,42,43,44,45,46,47,48,49
All ok.
Now back to SE.redis.
I wanted to implement this pattern and so I did : (with ITEMS=1,000,000)
My testing :
(Here is dbsize checking each second) :
As you can see , 1M records were added via 10 threads.
Now , I don't know if it's fast but , when I change ITEMS from 1M to 10M -- things get really slow and I get exception :
The exception is on the for loop.
Unhandled Exception: System.AggregateException: One or more errors
occurred. ---
System.TimeoutException: Timeout performing SET urn:user>288257, inst: 1, queu e: 11, qu=0, qs=11, qc=0, wr=0/0, in=0/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
messa ge, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgen
t\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultip
lexer.cs:line 1722 at
StackExchange.Redis.RedisBase.ExecuteSync[T](Message message,
ResultProces sor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df
18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 79
... ... Press any key to continue . . .
Question:
Is my way of dividing work is the RIGHT way (fastest)
How can I get things faster ( a sample code would be much appreciated)
How can I resolve this exception?
Related info :
<gcAllowVeryLargeObjects enabled="true" /> Is present in App.config ( otherwise i'm getting outOfmemoryException ) , also - build for x64bit, I have 16GB , , ssd drive , i7 cpu).

Currently, your code is using the synchronous API (StringSet), and is being loaded by 10 threads concurrently. This will present no appreciable challenge to SE.Redis - it works just fine here. I suspect that it genuinely is a timeout where the server has taken longer than you would like to process some of the data, most likely also related to the server's allocator. One option, then, is to simply increase the timeout a bit. Not a lot... try 5 seconds instead of the default 1 second. Likely, most of the operations are working very fast anyway.
With regards to speeding it up: one option here is to not wait - i.e. keep pipelining data. If you are content not to check every single message for an error state, then one simple way to do this is to add , flags: CommandFlags.FireAndForget to the end of your StringSet call. In my local testing, this sped up the 1M example by 25% (and I suspect a lot of the rest of the time is actually spent in string serialization).
The biggest problem I had with the 10M example was simply the overhead of working with the 10M example - especially since this takes huge amounts of memory for both the redis-server and the application, which (to emulate your setup) are on the same machine. This creates competing memory pressure, with GC pauses etc in the managed code. But perhaps more importantly: it simply takes forever to start doing anything. Consequently, I refactored the code to use parallel yield return generators rather than a single list. For example:
static IEnumerable<Person> InventPeople(int seed, int count)
{
for(int i = 0; i < count; i++)
{
int f = 1 + seed + i;
var item = new Person
{
Id = f,
Name = Path.GetRandomFileName().Replace(".", "").Substring(0, appRandom.Value.Next(3, 6)) + " " + Path.GetRandomFileName().Replace(".", "").Substring(0, new Random(Guid.NewGuid().GetHashCode()).Next(3, 6)),
Age = f % 90,
Friends = ParallelEnumerable.Range(0, 100).Select(n => appRandom.Value.Next(1, f)).ToArray()
};
yield return item;
}
}
static IEnumerable<T> Batchify<T>(this IEnumerable<T> source, int count)
{
var list = new List<T>(count);
foreach(var item in source)
{
list.Add(item);
if(list.Count == count)
{
foreach (var x in list) yield return x;
list.Clear();
}
}
foreach (var item in list) yield return item;
}
with:
foreach (var element in InventPeople(PER_THREAD * counter1, PER_THREAD).Batchify(1000))
Here, the purpose of Batchify is to ensure that we aren't helping the server too much by taking appreciable time between each operation - the data is invented in batches of 1000 and each batch is made available very quickly.
I was also concerned about JSON performance, so I switched to JIL:
public static string ToJSON<T>(this T obj)
{
return Jil.JSON.Serialize<T>(obj);
}
and then just for fun, I moved the JSON work into the batching (so that the actual processing loops :
foreach (var element in InventPeople(PER_THREAD * counter1, PER_THREAD)
.Select(x => new { x.Id, Json = x.ToJSON() }).Batchify(1000))
This got the times down a bit more, so I can load 10M in 3 minutes 57 seconds, a rate of 42,194 rops. Most of this time is actually local processing inside the application. If I change it so that each thread loads the same item ITEMS / THREADS times, then this changes to 1 minute 48 seconds - a rate of 92,592 rops.
I'm not sure if I've answered anything really, but the short version might be simply "try a longer timeout; consider using fire-and-forget).

How to test C# Sql server sequential GUID generator?

There are many how-to's on how to create Guid's that are Sql server index friendly, for example this tutorial. Another popular method is the one (listed below) from the NHibernate implementation. So I thought it could be fun to write a test method that actually tested the sequential requirements of such code. But I fail - I don't know what makes a good Sql server sequence. I can't figure out how they are ordered.
For example, given the two different way to create a sequential guid, how to determine which is the best (other than speed)? For example it looks like both have the disadvantage that if their clock is set back 2 minutes (e.g. timeserver update) they sequences are suddenly broken? But would that also mean trouble for the Sql sever index?
I use this code to produce the sequential Guid:
public static Guid CombFromArticle()
{
var randomBytes = Guid.NewGuid().ToByteArray();
byte[] timestampBytes = BitConverter.GetBytes(DateTime.Now.Ticks / 10000L);
if (BitConverter.IsLittleEndian)
Array.Reverse(timestampBytes);
var guidBytes = new byte[16];
Buffer.BlockCopy(randomBytes, 0, guidBytes, 0, 10);
Buffer.BlockCopy(timestampBytes, 2, guidBytes, 10, 6);
return new Guid(guidBytes);
}
public static Guid CombFromNHibernate()
{
var destinationArray = Guid.NewGuid().ToByteArray();
var time = new DateTime(0x76c, 1, 1);
var now = DateTime.Now;
var span = new TimeSpan(now.Ticks - time.Ticks);
var timeOfDay = now.TimeOfDay;
var bytes = BitConverter.GetBytes(span.Days);
var array = BitConverter.GetBytes((long)(timeOfDay.TotalMilliseconds / 3.333333));
Array.Reverse(bytes);
Array.Reverse(array);
Array.Copy(bytes, bytes.Length - 2, destinationArray, destinationArray.Length - 6, 2);
Array.Copy(array, array.Length - 4, destinationArray, destinationArray.Length - 4, 4);
return new Guid(destinationArray);
}
The one from the article is slightly faster but which creates the best sequence for SQL server? I could populate 1 million records and compare the fragmentation but I'm not even sure how to validate that properly. And in any case, I'd like to understand how I could write a test case that ensures the sequences are sequences as defined by Sql server!
Also I'd like some comments on these two implementations. What makes one better than the other?

I generated sequential GUIDs for an SQL Server. I never looked at too many articles beforehand.. still, it seems sound.
The first one, I generate with a system function (to get a proper one) and the following ones, I simply increment. You have to look for overflow and the like, of course (also, a GUID has several fields).
Apart from that, nothing hard to consider. If 2 GUIDs are unique, so is a sequence of them, if.. you stay below a few million. Well, it is mathematics.. even 2 GUIDs aren't guaranteed to be unique, at least not in the long run (if humanity keeps growing). So by using this kind of sequence, you probably increase the probability of a collision from nearly 0 to nearly 0 (but slightly more). If at all.. ask a mathematician.. it is the Birthday Problem http://en.wikipedia.org/wiki/Birthday_problem , with an insane number of days.
It is in C, but that should be easily translatable to more comfortable languages. Especially, you don't neet to worry about converting the wchar to a char.
GUID guid;
bool bGuidInitialized = false;
void incrGUID()
{
for (int i = 7; i >= 0; --i)
{
++guid.Data4[i];
if (guid.Data4[i] != 0)
return;
}
++guid.Data3;
if (guid.Data3 != 0)
return;
++guid.Data2;
if (guid.Data2 != 0)
return;
++guid.Data1;
if (guid.Data1 != 0)
return;
}
GenerateGUID(char *chGuid)
{
if (!bGuidInitialized)
{
CoCreateGuid(&guid);
bGuidInitialized = true;
}
else
incrGUID();
WCHAR temp[42];
StringFromGUID2(guid, temp, 42-1);
wcstombs(chGuid, &(temp[1]), 42-1);
chGuid[36] = 0;
if (!onlyOnceLogGUIDAlreadyDone)
{
onlyOnceLogGUIDAlreadyDone = true;
WR_cTools_LogTime(chGuid);
}
return ReturnCode;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.