Related
I have an Array in C# which contains numbers (e.g. int, float or double); I have another array of ranges (each defined as a lower and upper bound). My current implementation is something like this.
foreach (var v in data)
{
foreach (var row in ranges)
{
if (v >= row.lower && v <= row.high)
{
statistics[row]++;
break;
}
}
}
So the algorithm is O(mn) where m is the number of ranges and n is the size of the numbers.
Can this be improved? because in practical, the n is big and I want this to be as fast as possible.
Sort data array, then for each interval - find the first index that is in this range in data, and the last one (both using binary search). The number of elements that in this interval is than easily computed by reducing lastIdx-firstIdx (or add +1, depending if lastIdx is inclusive or not).
This is done in O(mlogm + nlogm), where m is the number of data, and n number of intervals.
Bonus: If data is changing constantly, you can use an order statistics tree, with the same approach (since this tree allows you to find easily the index of each element, and is supporting modifying the data).
Bonus2: Optimality proof
Using comparisons based algorithms, this cannot be done better, since if we could, we could also solve element distinctness problem better.
Element Distinctness Problem:
Given an array a1,a2,...,an - find out if there are i,j such that
i!=j, ai=aj.
This problem is known to have Omega(nlogn) time bound using comparisons based algorithms.
Reduction:
Given an instance of element distinctness problem a1,...,an - create data=a1,...,an, and intervals: [a1,a1], [a2,a2],..., [an,an] - and run the algorithm.
If there are more than n matches - there is duplicates, otherwise there is none.
The complexity of the above algorithm is O(n+f(n)), where n is the number of elements, and f(n) is the complexity of this algorithm. this has to be Omega(nlogn), so does f(n), and we can conclude there is no more efficient algorithm.
Assuming the ranges are ordered, you always take the first range that fits, right?
This means that you could easily build a binary tree of the lower bounds. You find the highest lower bound that's lower than your number, and check if it fits the higher bound. If the tree is properly balanced, this can get you quite close to O(nlog m). Of course, if you don't need to change the ranges frequently, a simple ordered array will do - just use the usual binary search methods.
Using a hashtable instead could get you pretty close to O(n), depending on how the ranges are structured. If data is also ordered, you could get even better results.
An alternate solution that doesn't involve sorting the data:
var dictionary = new Dictionary<int, int>();
foreach (var v in data) {
if (dictionary.ContainsKey(v)){
dictionary[v]++;
} else {
dictionary[v] = 1;
}
}
foreach (var row in ranges) {
for (var i = row.lower; i <= row.higher; i++) {
statistics[row] += dictionary[i];
}
}
Get a count of the number of occurrences of each value in data, and then sum the counts between the bounds of your range.
I'm trying to find multiple ways to solve Project Euler's problem #13. I've already got it solved two different ways, but what I am trying to do this time is to have my solution read from a text file that contains all of the numbers, from there it converts it and adds the column numbers farthest to the right. I also want to solve this problem in a way such that if we were to add new numbers to our list, the list can contain any amount of rows or columns, so it's length is not predefined (non array? I'm not sure if a jagged array would apply properly here since it can't be predefined).
So far I've got:
static void Main(string[] args)
{
List<int> sum = new List<int>();
string bigIntFile = #"C:\Users\Justin\Desktop\BigNumbers.txt";
string result;
StreamReader streamReader = new StreamReader(bigIntFile);
while ((result = streamReader.ReadLine()) != null)
{
for (int i = 0; i < result.Length; i++)
{
int converted = Convert.ToInt32(result.Substring(i, 1));
sum.Add(converted);
}
}
}
which reads the file and converts each character from the string to a single int. I'm trying to think how I can store that int in a collection that is like 2D array, but the collection needs to be versatile and store any # of rows / columns. Any ideas on how to store these digits other than just a basic list? Is there maybe a way I can set up a list so it's like a 2D array that is not predefined? Thanks in advance!
UPDATE: Also I don't want to use "BigInteger". That'd be a little too easy to read the line, convert the string to a BigInt, store it in a BigInt list and then sum up all the integers from there.
There is no resizable 2D collection built into the .NET framework. I'd just go with the "jagged arrays" type of data structure, just with lists:
List<List<int>>
You can also vary this pattern by using an array for each row:
List<int[]>
If you want to read the file a little simpler, here is how:
List<int[]> numbers =
File.EnumerateLines(path)
.Select(lineStr => lineStr.Select(#char => #char - '0').ToArray())
.ToList();
Far less code. You can reuse a lot of built-in stuff to do basic data transformations. That gives you less code to write and to maintain. It is more extensible and it is less prone to bugs.
If you want to select a column from this structure, do it like this:
int colIndex = ...;
int[] column = numbers.Select(row => row[index]).ToArray();
You can encapsulate this line into a helper method to remove noise from your main addition algorithm.
Note, that the efficiency of all those patterns is far less than a 2D array, but in your case it is good enough.
In this case you can simply use an 2D array, since you actually do know in advance its dimensions: 100 x 50.
If for some reason you want to solve a more general problem, you may indeed use a List of Lists, List>.
having said that, I wonder: are you actually trying to sum up all the numbers? if so, I would suggest another approach: consider just which section part of the 50 digit numbers actually influences the first digits of their sum. Hint: you don't need the entire number.
i want to generate a sequence of unique random numbers in the range of 00000001 to 99999999.
So the first one might be 00001010, the second 40002928 etc.
The easy way is to generate a random number and store it in the database, and every next time do it again and check in the database if the number already exists and if so, generate a new one, check it again, etc.
But that doesn't look right, i could be regenerating a number maybe 100 times if the number of generated items gets large.
Is there a smarter way?
EDIT
as allways i forgot to say WHY i wanted this, and it will probably make things clearer and maybe get an alternative, and it is:
we want to generate an ordernumber for a booking, so we could just use 000001, 000002 etc. But we don't want to give the competitors a clue of how much orders are created (because it's not a high volume market, and we don't want them to know if we are on order 30 after 2 months or at order 100. So we want to have an order number which is random (yet unique)
You can use either an Linear Congruential Generator (LCG) or Linear Feedback Shift Register (LFSR). Google or wikipedia for more info.
Both can, with the right parameters, operate on a 'full-cycle' (or 'full period') basis so that they will generate a 'psuedo-random number' only once in a single period, and generate all numbers within the range. Both are 'weak' generators, so no good for cyptography, but perhaps 'good enough' for apparent randomness. You may have to constrain the period to work within your 'decimal' maximum as having 'binary' periods is necessary.
Update: I should add that it is not necessary to pre-calculate or pre-store previous values in any way, you only need to keep the previous seed-value (single int) and calculate 'on-demand' the next number in the sequence. Of course you can save a chain of pre-calculated numbers to your DB if desired, but it isn't necessary.
How about creating a set all of possible numbers and simply randomising the order? You could then just pick the next number from the tail.
Each number appears only once in the set, and when you want a new one it has already been generated, so the overhead is tiny at the point at which you want one. You could do this in memory or the database of your choice. You'll just need a sensible locking strategy for pulling the next available number.
You could build a table with all the possible numbers in it, give the record a 'used' field.
Select all records that have not been 'used'
Pick a random number (r) between 1 and record count
Take record number r
Get your 'random value' from the record
Set the 'used' flag and update the db.
That should be more efficient than picking random numbers, querying the database and repeat until not found as that's just begging for an eternity for the last few values.
Use Pseudo-random Number Generators.
For example - Linear Congruential Random Number Generator
(if increment and n are coprime, then code will generate all numbers from 0 to n-1):
int seed = 1, increment = 3;
int n = 10;
int x = seed;
for(int i = 0; i < n; i++)
{
x = (x + increment) % n;
Console.WriteLine(x);
}
Output:
4
7
0
3
6
9
2
5
8
1
Basic Random Number Generators
Mersenne Twister
Using this algorithm might be suitable, though it's memory consuming:
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
Put the numbers in the array from 1 to 99999999 and do the shuffle.
For the extremely limited size of your numbers no you cannot expect uniqueness for any type of random generation.
You are generating a 32bit integer, whereas to reach uniqueness you need a much larger number in terms around 128bit which is the size GUIDs use which are guaranteed to always be globally unique.
In case you happen to have access to a library and you want to dig into and understand the issue well, take a look at
The Art of Computer Programming, Volume 2: Seminumerical Algorithms
by Donald E. Knuth. Chapter 3 is all about random numbers.
You could just place your numbers in a set. If the size of the set after generation of your N numbers is too small, generate some more.
Do some trial runs. How many numbers do you have to generate on average? Try to find out an optimal solution to the tradeoff "generate too many numbers" / "check too often for duplicates". This optimal is a number M, so that after generating M numbers, your set will likely hold N unique numbers.
Oh, and M can also be calculated: If you need an extra number (your set contains N-1), then the chance of a random number already being in the set is (N-1)/R, with R being the range. I'm going crosseyed here, so you'll have to figure this out yourself (but this kinda stuff is what makes programming fun, no?).
You could put a unique constraint on the column that contains the random number, then handle any constraint voilations by regenerating the number. I think this normally indexes the column as well so this would be faster.
You've tagged the question with C#, so I'm guessing you're using C# to generate the random number. Maybe think about getting the database to generate the random number in a stored proc, and return it.
You could try giving writing usernames by using a starting number and an incremental number. You start at a number (say, 12000), then, for each account created, the number goes up by the incremental value.
id = startValue + (totalNumberOfAccounts * inctrementalNumber)
If incrementalNumber is a prime value, you should be able to loop around the max account value and not hit another value. This creates the illusion of a random id, but should also have very little conflicts. In the case of a conflicts, you could add a number to increase when there's a conflict, so the above code becomes. We want to handle this case, since, if we encounter one account value that is identical, when we increment, we will bump into another conflict when we increment again.
id = startValue + (totalNumberOfAccounts * inctrementalNumber) + totalConflicts
By fallowing line we can get e.g. 6 non repetitive random numbers for range e.g. 1 to 100.
var randomNumbers = Enumerable.Range(1, 100)
.OrderBy(n => Guid.NewGuid())
.Take(6)
.OrderBy(n => n);
I've had to do something like this before (create a "random looking" number for part of a URL). What I did was create a list of keys randomly generated. Each time it needed a new number it simply randomly selected a number from keys.Count and XOR the key and the given sequence number, then outputted XORed value (in base 62) prefixed with the keys index (in base 62).
I also check the output to ensure it does not contain any naught words. If it does simply take the next key and have a second go.
Decrypting the number is equally simple (the first digit is the index to the key to use, a simple XOR and you are done).
I like andora's answer if you are generating new numbers and might have used it had I known. However if I was to do this again I would have simply used UUIDs. Most (if not every) platform has a method for generating them and the length is just not an issue for URLs.
You could try shuffling the set of possible values then using them sequentially.
I like Lazarus's solution, but if you want to avoid effectively pre-allocating the space for every possible number, just store the used numbers in the table, but build an "unused numbers" list in memory by adding all possible numbers to a collection then deleting every one that's present in the database. Then select one of the remaining numbers and use that, adding it to the list in the database, obviously.
But, like I say, I like Lazaru's solution - I think that's your best bet for most scenarios.
function getShuffledNumbers(count) {
var shuffledNumbers = new Array();
var choices = new Array();
for (var i = 0; i<count; i++) {
// choose a number between 1 and amount of numbers remaining
choices[i] = selectedNumber = Math.ceil(Math.random()*(99999999 - i));
// Now to figure out the number based on this selection, work backwards until
// you figure out which choice this number WOULD have been on the first step
for (var j = 0; j < i; j++) {
if (choices[i - 1 - j] >= selectedNumber) {
// This basically says "it was choice number (selectedNumber) on the last step,
// but if it's greater than or equal to this, it must have been choice number
// (selectedNumber + 1) on THIS step."
selectedNumber++;
}
}
shuffledNumbers[i] = selectedNumber;
}
return shuffledNumbers;
}
This is as fast a way I could think of and only uses memory as it needs, however if you run it all the way through it will use double as much memory because it has two arrays, choices and shuffledNumbers.
Running a linear congruential generator once to generate each number is apt to produce rather feeble results. Running it through a number of iterations which is relatively prime to your base (100,000,000 in this case) will improve it considerably. If before reporting each output from the generator, you run it through one or more additional permutation functions, the final output will still be a duplicate-free permutation of as many numbers as you want (up to 100,000,000) but if the proper functions are chosen the result can be cryptographically strong.
create and store ind db two shuffled versions(SHUFFLE_1 and SHUFFLE_2) of the interval [0..N), where N=10'000;
whenever a new order is created, you assign its id like this:
ORDER_FAKE_INDEX = N*SHUFFLE_1[ORDER_REAL_INDEX / N] + SHUFFLE_2[ORDER_REAL_INDEX % N]
I also came with same kind of problem but in C#. I finally solved it. Hope it works for you also.
Suppose I need random number between 0 and some MaxValue and having a Random type object say random.
int n=0;
while(n<MaxValue)
{
int i=0;
i=random.Next(n,MaxValue);
n++;
Write.Console(i.ToString());
}
the stupid way: build a table to record, store all the numble first, and them ,every time the numble used, and flag it as "used"
System.Random rnd = new System.Random();
IEnumerable<int> numbers = Enumerable.Range(0, 99999999).OrderBy(r => rnd.Next());
This gives a randomly shuffled collection of ints in your range. You can then iterate through the collection in order.
The nice part about this is that you're not actually creating the entire collection in memory.
See comments below - this will generate the entire collection in memory when you iterate to the first element.
You can genearate number like below if you are ok with consumption of memory.
import java.util.ArrayList;
import java.util.Collections;
public class UniqueRandomNumbers {
public static void main(String[] args) {
ArrayList<Integer> list = new ArrayList<Integer>();
for (int i=1; i<11; i++) {
list.add(i);
}
Collections.shuffle(list);
for (int i=0; i<11; i++) {
System.out.println(list.get(i));
}
}
}
I was trying to create this helper function in C# that returns the first n prime numbers. I decided to store the numbers in a dictionary in the <int,bool> format. The key is the number in question and the bool represents whether the int is a prime or not. There are a ton of resources out there calculating/generating the prime numbers(SO included), so I thought of joining the masses by crafting another trivial prime number generator.
My logic goes as follows:
public static Dictionary<int,bool> GetAllPrimes(int number)
{
Dictionary<int, bool> numberArray = new Dictionary<int, bool>();
int current = 2;
while (current <= number)
{
//If current has not been marked as prime in previous iterations,mark it as prime
if (!numberArray.ContainsKey(current))
numberArray.Add(current, true);
int i = 2;
while (current * i <= number)
{
if (!numberArray.ContainsKey(current * i))
numberArray.Add(current * i, false);
else if (numberArray[current * i])//current*i cannot be a prime
numberArray[current * i] = false;
i++;
}
current++;
}
return numberArray;
}
It will be great if the wise provide me with suggestions,optimizations, with possible refactorings. I was also wondering if the inclusion of the Dictionary helps with the run-time of this snippet.
Storing integers explicitly needs at least 32 bits per prime number, with some overhead for the container structure.
At around 231, the maximal value a signed 32 bit integer can take, about every 21.5th number is prime. Smaller primes are more dense, about 1 in ln(n) numbers is prime around n.
This means it is more memory efficient to use an array of bits than to store numbers explicitly. It will also be much faster to look up if a number is prime, and reasonably fast to iterate through the primes.
It seems this is called a BitArray in C# (in Java it is BitSet).
The first thing that bothers is that, why are you storing the number itself ?
Can't you just use the index itself which will represent the number?
PS: I'm not a c# developer so maybe it is not possible with a dictionary, but it can be done with the appropriate structure.
First, you only have to loop untill the square root of the number. Make all numbers false by default and have a simple flag that you set true at the beginning of every iteration.
Further, don't store it in a dictionary. Make it a bool array and have the index be the number you're looking for. Only 0 won't make any sense, but that doesn't matter. You don't have to init either; bools are false by default. Just declare an bool[] of number length.
Then, I would init like this:
primes[2] = true;
for(int i = 3; i < sqrtNumber; i += 2) {
}
So you skip all the even numbers automatically.
By the way, never declare a variable (i) in a loop, it makes it slower.
So that's about it. For more info see this page.
I'm pretty sure the Dictionary actually hurts performance, since it doesn't enable you to perform the trial divisions in an optimal order. Traditionally, you would store the known primes so that they could be iterated from smallest to largest, since smaller primes are factors of more composite numbers than larger primes. Additionally, you never need to try division with any prime larger than the square root of the candidate prime.
Many other optimizations are possible (as you yourself point out, this problem has been studied to death) but those are the ones that I can see off the top of my head.
The dictionary really doesn't make sense here -- just store all primes up to a given number in a list. Then follow these steps:
Is given number in the list?
Yes - it's prime. Done.
Not in list
Is given number larger than the list maximum?
No - it's not prime. Done.
Bigger than maximum; need to fill list up to maximum.
Run a sieve up to given number.
Repeat.
1) From the perspective of the client to this function, wouldn't it be better if the return type was bool[] (from 0 to number perhaps)? Internally, you have three states (KnownPrime, KnownComposite, Unknown), which could be represented by an enumeration. Storing an an array of this enumeration internally, prepopulated with Unknown, will be faster than a dictionary.
2) If you stick with the dictionary, the part of the sieve that marks multiples of the current number as composite could be replaced with a numberArray.TryGetValue() pattern rather than multiple checks for ContainsKey and subsequent retrieval of the value by key.
The trouble with returning an object that holds the primes is that unless you're careful to make it immutable, client code is free to mess up the values, in turn meaning you're not able to cache the primes you've already calculated.
How about having a method such as:
bool IsPrime(int primeTest);
in your helper class that can hide the primes it's already calculated, meaning you don't have to re-calculate them every time.
I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.
Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?
Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.
Update 2: here is sample code.
private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
if (_rowStates.ContainsKey(rowIndex))
{
if (state == 0)
{
_rowStates.Remove(rowIndex);
}
else
{
_rowStates[rowIndex] = state;
}
}
else
{
if (state != 0)
{
_rowStates.Add(rowIndex, state);
}
}
}
private int GetRowState(int rowIndex)
{
if (_rowStates.ContainsKey(rowIndex))
{
return (int)_rowStates[rowIndex];
}
else
{
return 0;
}
}
This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:
Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).
I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.
In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.
--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.
How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.
Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.
You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.
Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.
You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?
Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.
You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.
I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:
List<int> myList = new List<int>();
The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:
public class MyDbFields
{
public MyDbFields(int count, int nonzero)
{
Count = count;
NonZero = nonzero;
}
public int Count { get; set; }
public int NonZero { get; set; }
}
Then you can create a list like this:
List<MyDbFields> fields_list = new List<MyDbFields>();
and then fill it with data:
fields_list.Add(new MyDbFields(100, 11));
I am not sure if this will fully help you solve your problem, but just my suggestion.
If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.
Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.
Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.
If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.
If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.
If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.
Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row?
Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc.
My point here is that your database may solve this problem more elegantly than you can.