Algorithm to flatten peak usage over time? - c#

I have an environment that serves many devices spread across 3 time zones by receiving and sending data during the wee hours of the night. The distribution of these devices was determined pseudo-randomly based on an identification number and a simple calculation using a modulo operation. The result of such a calculation creates an unnecessary artificial peak which consumes more resources than I'd like during certain hours of the night.
As part of our protocol I can instruct devices when to connect to our system on subsequent nights.
I am seeking an algorithm which can generally distribute the peak into a more level line (albeit generally higher at most times) or at least a shove in the right direction - meaning what sort of terminology should I spend my time reading about. I have available to me identification numbers for devices, the current time, and the device's time zone as inputs for performing calculation. I can also perform some up front analytical calculations to create pools from which to draw slots from, though I feel this approach may be less elegant than I am hoping for (though a learning algorithm may not be a bad thing...).
(Ultimately and somewhat less relevant I will be implementing this algorithm using C#.)

If you want to avoid the spikes associated with using random times, look at the various hashing functions used for hashtables. Your reading might start at the wikipedia articles on the subject:
http://en.wikipedia.org/wiki/Hash_function
Basically, divide whatever you want your update window to be into the appropriate number of buckets. One option might be 3 hours * 60 minutes * 60 seconds = 10800 buckets. Then use that as your hashtable size, for the chosen hashing function. Your unique input might be device ID. Don't forget to use GMT for the chosen time. Your programming language of choice probably has a number of built in hashing functions, but the article should provide some links to get you started if you want to implement one from scratch.
This approach is superior to the earlier answer of random access times because it has much better evenness properties, and ensures that your access patterns will be approximately flat, as compared to the random function which is likely to sometimes exhibit spikes.
Here's some more specific information on how to implement various functions:
http://www.partow.net/programming/hashfunctions/index.html

You say that you can tell devices what time to connect, so I don't see why you need anything random or modulused. When each device connects, pick a time tomorrow which currently doesn't have many devices assigned to it, and assign the device to that time. If the devices all take about the same amount of resources to service, then a trivial greedy algorithm will produce a completely smooth distribution - assign each device to whatever time is currently least congested. If the server handles other work than just these devices, then you'd want to start with its typical load profile, then add the device load to that. I wouldn't really call this "analytical calculations", just storing a histogram of expected load against time for the next 24 hours.
Or do you have the problem that the device might not obey instructions (for example it might be offline at its assigned time, and then connect whenever it's next on)? Obviously if your users in a particular time zone all start work at the same time in the morning, then that would be a problematic strategy.

Simply take the number of devices and divide your time interval into n equal segments and allocate each segment to a device, informing them of when to connect when they next connect.
This will give you an optimally uniform distribution in all cases.
Normalize all times to GMT, what do you care about time zones or day light savings time or whatever? Now is now no matter what time zone you're in.
Adding a random distribution can lead to clumping (a uniform random distribution is only uniform in the limit, but not necessarily for any particular sample), and really should be used if there's no feedback mechanism. Since you can control to some extent when they connect a random component is not at all necessary and is not even remotely optimal.
If you're concerned about clock drift across devices consider even if you added randomness this wouldn't decrease the randomness of your clock drift in any way, and would only contribute to an even less optimal allocation.
If you want to ensure a stable distribution of devices by region, then compute the ratio of devices per region, and distribute the slot allocations appropriately. For instance, if you have 50/25/25 by time zone respectively, assign slots to the first time zone, then the next two slots to the remaining time zones, then repeat.

Related

Proper 50% chance random number generator implementation for player

I am playing with some demo of rolling a dice in javascript, where i select 3 numbers out of 6 and use following snippet to get the random result:
let randomNumber = Math.floor(Math.random() * 6 + 1); // Random number between 1 and 6
this is global random, meaning that one player can have much higher win rate than 50% and other lower etc.
How would I implement something that would keep every players win rate exactly towards 50% and not higher? At the moment I am experiencing having huge win strikes while I am rolling, which would mean that in real life that would be the same case, so how would I limit myself to never exceed for example 55% win rate and if it does should slowly go towards 50%, and vice versa if going below 50%?
Is there any api service, that when i provide the ID for player, it would keep the count of wins and losses and keep the rate of win around 50%? Aka keep the historical data for the user etc and decide on win/loss ratio what next number would be by knowing the roll before, but if 50 win /50 losses would do honest random roll. This would destroy win strike though i guess.
So I think I would have to do something like:
if win/loss ratio is 50%, do legit random rolls until 51%, then decide against player to go back to 50% of win/rate. I know this wouldn't be honest but are there any other honest ways to really keep 50% of randomization by just running the snippet above?
How long are those win-strikes? When I test this:
console.log(Math.random());
I get four or five values in a row that are bigger than 0.5 then one below 0.5 then another bigger than 0.5. If this is a problem for game logic, then you don't need random numbers but plot-armor. If you use only one random number generator for all players, then of course some players can steal another player's "destiny" and those players can have very bad or very good dice rolls.
To overcome the issue of "fairness" between players, you can have a unique seed for each player's random number generator. This is possible (and fast) in C++. Calling a compiled C++ console program from Node-js is easy. The only issue would be optimizing it for millions of concurrent players. C++'s std::mt19937 is fast enough and takes a seed value (from os too).
Since you tagged C#, same thing can be done within C# too.new Random(some random seed here) should give similar results. Then you can host the algorithm as a micro-service and make it accessed by nodejs (assuming main backend part of app is on nodejs).
Lastly, you can have one micro-service process per player for the RND. This should give everyone their own seeds for random number generations. But too many players would mean too many threads which is very bad for performance unless dice-rolls are very rare. For example,I can start 100 processes on my 8 core cpu with this:
"use strict";
var cluster = require('cluster');
if(cluster.isMaster){for(let i=0;i<100;i++) { cluster.fork(); }}
console.log(Math.random());
but it starts very slow and hosting a RND server per process could be even slower. So, only main process should host the RND service and internally communicate with worker processes. Cluster module of NodeJs lets worker processes communicate with main process. Pure event-driven communication (no spin-wait whatsoever) between worker(RND) processes and main process should be CPU-friendly but still if all of (millions of) players throw dices at the same time then the process/thread-switching overhead would be visible and each process takes certain amount of memory so the RAM capacity becomes important too. There are custom random number generators for NodeJs that can take seed for much less CPU/RAM usage.

How to generate identifiers in a distributed system with low probability of duplicates?

I need to generate identifiers in a distributed system.
Duplicates will be detected by the system and will cause the operation that created that identifier to fail. I need to minimize the probability of failing operations by generating identifiers with low collision probability.
I'd also like to be able to describe mathematically how likely it is that a duplicate number is generated. I'm not sure what such a description would look like, preferably I'd like to know the X in something like:
When generating 1000 random numbers per second for 10 years no more than X duplicates should have been generated.
These random numbers can only have 35 significant bits. The system is written in C# and runs on top of Microsoft's .NET platform.
So this is actually two questsions in one (but I guess they depend on each other):
What component/pattern should I use to generate identifiers?
How can I compute the X value?
For (1) I see the following candidates:
System.Random
System.Guid
System.Security.Cryptography.RNGCryptoServiceProvider
The fact that I need numbers to have 35 significant bits is not a problem when it comes to generating values as it is fine to generate a larger number and then just extracting 35 of those bits. However, it do affect the mathematical computation i presume.
UPDATE
I can see now that 35-bits aren't nearly enough for my description above. I don't really need 1 number per millisecond for 10 years. That was an overstatement.
What I really need is a way to distributively generate identifiers that have 35 significant bits with as low probability of a conflict as possible. As time goes by the system will "clean up" identifiers so that it is possible for the same number to be used again without it causing a failure.
I understand that I could of course implement some kind of centralized counter. But I would like to be able to avoid that if possible. I want to minimize the number of network operations needed to maintain the identifiers.
Any suggestions are welcome!
You are wanting to generate 1000 numbers each second for 10 years. So you will generate
1000*60*60*365*10 = 315360000000
You want to use numbers with 35 bits. There are
2**35 = 34359738368
The minimum number of duplicates that you will generate is 315360000000 - 34359738368 which equals 281000261632. That is a lower bound on X. This is self-evident. Suppose by some amazing freak that you manage to sample each and every possible value from the 2**35 available. Then every other sample you make is a duplicate.
I guess we can safely conclude that 35 bits is not enough.
As far as generating good quality pseudo random numbers, it should be fairly obvious that System.Security.Cryptography.RNGCryptoServiceProvider the best choice of the three that you present.
If you really want uniqueness that I suggest that you do the following:
Allocate to each distributed node a unique range of IDs.
Have each node allocate uniquely from that pool of IDs values. For instance, the node starts at the first value and increments the ID by one every time it is asked to generate a new one.
This is really the best strategy if uniqueness matters. But you will likely need to dedicate more bits for your IDs.
Since the probability of collisions steadily increases with a random allocation as you use up more addresses, the system steadily degrades in performance. There is also the looming specter of a non-zero probability of your random selection never terminating because it never chooses a non-conflicting id (PRNGs have cycle lengths for any given seed much smaller than their theoretical full range of output.) Whether this is a problem in practice of course depends on how saturated you expect your address space to be in the long run.
If the IDs don't need to be random, then you almost certainly want to rely on some form of coordination to assign IDs (such as partitioning the address space or using a coordinating manager of some sort to assign IDs) rather than creating random numbers and reconciling collisions after they happen. It will be easier to implement, probably more performant and will allow better saturation of your address space.
In response to comment:
The design for a specific mechanism of coordination depends on a lot of factors, such as how many nodes you expect to have, how flexible you need to be in regards to adding/dropping nodes, how long the IDs need to remain unique (i.e. what is your strategy for managing ID lifetime), etc. It's a complex problem that warrants a careful analysis of your expected use cases, including a look at your future scalability requirements. A simple partioning scheme is sufficient if your number of nodes and/or number of IDs is small, but if you need to scale to larger volumes, it's a much more challenging problem, perhaps requiring more complex allocation strategies.
One possible partitioning design is that you have a centralized manager that allocates IDs in blocks. Each node can then freely increment IDs within that block and only needs to request a new block when it runs out. This can scale well if you expect your ID lifetime to correlate with age, as that generally means that whole blocks will be freed up over time. If ID lifetime is more randomly distributed, though, this could potentially lead to fragmentation and exhaustion of available blocks. So, again, it's a matter of understanding your requirements so that you can design for the scale and usage patterns your application requires.
You can't use random numbers in your case: the Birthday Paradox states that 1st collistion will be at
sqrt(2 * N)
in your case:
sqrt(2 * 2^35) = sqrt(2^36) = 2^18 = 250000 items before the 1st collistion
So GUID-based value is the best choice.
I think for your particular problem all those random numbers providers will work relatively the same - all should generate nearly ideal even distribution of values.
I heard GUID generation includes MAC address as part of generation, so it might influence some part more than other, but I'm not sure. Most likely it is even distribute as well, but you must check that before relying on it.
The main question you should answer is do you really need random numbers, or consequtive is fine? Maybe consequtive addresses will work better and have better performance because of caching? So it might be good to distribute address space among your machines and have full guarantee when collision will be occured and handle it appropriately?

Comparing available processing power of two machines

Think of a load balancer which is to balance the load according to the available (remaining) processing power of its units. How would you calculate this parameter to compare?
I'm trying to implement this in C# and so far I can query the CPU usage in percentage but that doesn't do since different machines might be using different processors. Perhaps if I could find out the processing power of each machine multiplied by its free CPU percentage, it would be a good estimate.
But what are the important parameters of a processor to include and how to aggregate them into one single number?

How can I compare two captures to see which one is louder?

Given two byte arrays of data captured from a microphone, how can I determine which one has more spikes in noise? I would assume there is an algorithm I can apply to the data, but I have no idea where to start.
Getting down to it, I need to be able to determine when a baby is crying vs ambient noise in the room.
If it helps, I am using the Microsoft.Xna.Framework.Audio.Microphone class to capture the sound.
you can convert each sample (normalised to a range 1.0 to -1.0) into a decibel rating by applying the formula
dB = 20 * log-base-10 (sample-value)
To be honest, so long as you don't mind the occasional false positive, and your microphone is set up OK, you should have no problem telling the difference between a baby crying and ambient background noise, without going through the hassle of doing an FFT.
I'd recommend you having a look at the source code for a noise gate, which does pretty much what you are after, with configurable attack times & thresholds.
First use a Fast Fourier Transform to transform the signal into the frequency domain.
Then check if the signal in the typical "cry-frequencies" is significantly higher than the other amplitudes.
The preprocessor of the speex codec supports noise vs signal detection, but I don't know if you can get it to work with XNA.
Or if you really want some kind of loudness calculate the sum of squares of the amplitudes from the frequencies you're interested in (for example 50-20000Hz) and if the average of that over the last 30 seconds is significantly higher than the average over the last 10 minutes or exceeds a certain absolute threshold sound the alarm.
Louder at what point? The signal's average amplitude will tell you which one is louder on average, but that is kind of a dumb, brute force way to go about it. It may work for you in practice though.
Getting down to it, I need to be able to determine when a baby is crying vs ambient noise in the room.
Ok, so, I'm just throwing out ideas here; I am by no means an expert on audio processing.
If you know your input, i.e., a baby crying (relatively loud with a high pitch) versus ambient noise (relatively quiet), you should be able to analyze the signal in terms of pitch (frequency) and amplitude (loudness). Of course, if during he recording someone drops some pots and pans onto the kitchen floor, that will be tough to discern.
As a first pass I would simply traverse the signal, maintaining a standard deviation of pitch and amplitude throughout, and then set a flag when those deviations jump beyond some threshold that you will have to define. When they come back down you may be able to safely assume that you captured the baby's cry.
Again, just throwing you an idea here. You will have to see how it works in practice with actual data.
I agree with #Ed Swangren, it will take a lot of playing with samples of data for a lot of sources. To me, it sounds like the trick will be to limit or hopefully eliminate false positives. My experience with babies is they are much louder crying than the environment. so, keeping track of the average measurements (freq/amp/??) of the normal environment and then classifying how well the changes match the characteristics of a crying baby which changes from kid to kid, so you'll probably want a system that 'learns'. Best of luck.
update: you might find this library useful http://naudio.codeplex.com/

Determining system requirements (hardware, processor & memory) for a batch based software application

I am tasked with building an application wherein the business users will be defining a number of rules for data manipulation & processing (e.g. taking one numerical value and splitting it equally amongst a number of records selected on the basis of the condition specified in the rule).
On a monthly basis, a batch application has to be run in order to process around half a million records as per the rules defined. Each record has around 100 fields. The environment is .NET, C# and SQL server with a third party rule engine
Could you please suggest how to go about defining and/or ascertaining what kind of hardware will be best suited if the requirement is to process records within a timeframe of let's say around 8 to 10 hours. How will the specs vary if the user either wants to increase or decrease the timeframe depending on the hardware costs?
Thanks in advance
Abby
Create the application and profile it?
Step 0. Create the application. It is impossible to tell real world performance of a multi-computer system like you're describing from "paper" specifications... You need to try it and see what holds the biggest slow downs... This is traditionally physical IO, but not always...
Step 1. Profile with sample sets of data in an isolated environment. This is a gross metric. You're not trying to isolate what takes the time, just measuring the overall time it takes to run the rules.
What does isolated environment mean? You want to use the same sorts of network hardware between the machines, but do not allow any other traffic on that network segment. That introduces too many variables at this point.
What does profile mean? With current hardware, measure how long it takes to complete under the following circumstances. Write a program to automate the data generation.
Scenario 1. 1,000 of the simplest rules possible.
Scenario 2. 1,000 of the most complex rules you can reasonably expect users to enter.
Scenarios 3 & 4. 10,000 Simplest and most complex.
Scenarios 5 & 6. 25,000 Simplest and Most complex
Scenarios 7 & 8. 50,000 Simplest and Most complex
Scenarios 9 & 10. 100,000 Simplest and Most complex
Step 2. Anaylze the data.
See if there are trends in completion time. Figure out if they appear tied to strictly the volume of rules or if the complexity also factors in... I assume it will.
Develop a trend line that shows how long you can expect it to take if there are 200,000 and 500,000 rules. Perform another run at 200,000. See if the trend line is correct, if not, revise your method of developing the trend line.
Step 3. Measure the database and network activity as the system processes the 20,000 rule sets. See if there is more activity happening with more rules. If so the more you speed up the throughput to and from the SQL server the faster it will run.
If these are "relatively low," then CPU and RAM speed are likely where you'll want to beef up the requested machines specification...
Of course if all this testing is going to cost your employer more than buying the beefiest server hardware possible, just quantify the cost of the time spent testing vs. the cost of buying the best server and being done with it and only tweaking your app and the SQL that you control to improve performance...
If this system is not first of a kind, so you can consider following:
Re-use (after additional evaluation) hardware requirements from previous projects
Evaluate hardware requirements based on workload and hardware configuration of existing application
If that is not the case and performance requirements are very important, then the best way would be to create a prototype with, say, 10 rules implemented. Process the dataset using the prototype and extrapolate to a full rule set. Based on this information you should be able to derive initial performance and hardware requirements. Then you can fine tune these specifications taking into account planned growth in processed data volume, scalability requirements and redundancy.

Categories