Faster way to count number of sets an item appears in? - c#

I've got a list of bookmarks. Each bookmark has a list of keywords (stored as a HashSet). I also have a set of all possible keywords ("universe").
I want to find the keyword that appears in the most bookmarks.
I have 1356 bookmarks with a combined total of 698,539 keywords, with 187,358 unique.
If I iterate through every keyword in the universe and count the number of bookmarks it appears in, I'm doing 254,057,448 checks. This takes 35 seconds on my machine.
The algorithm is pretty simple:
var biggest = universe.MaxBy(kw => bookmarks.Count(bm => bm.Keywords.Contains(kw)));
Using Jon Skeet's MaxBy.
I'm not sure it's possible to speed this up much, but is there anything I can do? Perhaps parallelize it somehow?
dtb's solution takes under 200 ms to both build the universe and find the biggest element. So simple.
var freq = new FreqDict();
foreach(var bm in bookmarks) {
freq.Add(bm.Keywords);
}
var biggest2 = freq.MaxBy(kvp => kvp.Value);
FreqDict is just a little class I made built on top of a Dictionary<string,int>.

You can get all keywords, group them, and get the biggest group. This uses more memory, but should be faster.
I tried this, and in my test it was about 80 times faster:
string biggest =
bookmarks
.SelectMany(m => m.Keywords)
.GroupBy(k => k)
.OrderByDescending(g => g.Count())
.First()
.Key;
Test run:
1536 bookmarks
153600 keywords
74245 unique keywords
Original:
12098 ms.
biggest = "18541"
New:
148 ms.
biggest = "18541"

I don't have your sample data nor have I done any benchmarking, but I'll take a stab. One problem that could be improved upon is that most of the bm.Keywords.Contains(kw) checks are misses, and I think those can be avoided. The most constraining is the set of keywords any one given bookmark has (ie: it will typically be much smaller than universe) so we should start in that direction instead of the other way.
I'm thinking something along these lines. The memory requirement is much higher and since I haven't benchmarked anything, it could be slower, or not helpful, but I'll just delete my answer if it doesn't work out for you.
Dictionary<string, int> keywordCounts = new Dictionary<string, int>(universe.Length);
foreach (var keyword in universe)
{
keywordCounts.Add(keyword, 0);
}
foreach (var bookmark in bookmarks)
{
foreach (var keyword in bookmark.Keywords)
{
keywordCounts[keyword] += 1;
}
}
var mostCommonKeyword = keywordCounts.MaxBy(x => x.Value).Key;

You don't need to iterate through whole universe. Idea is to create a lookup and track max.
public Keyword GetMaxKeyword(IEnumerable<Bookmark> bookmarks)
{
int max = 0;
Keyword maxkw = null;
Dictionary<Keyword, int> lookup = new Dictionary<Keyword, int>();
foreach (var item in bookmarks)
{
foreach (var kw in item.Keywords)
{
int val = 1;
if (lookup.ContainsKey(kw))
{
val = ++lookup[kw];
}
else
{
lookup.Add(kw, 1);
}
if (max < val)
{
max = val;
maxkw = kw;
}
}
}
return maxkw;
}

50ms in python:
>>> import random
>>> universe = set()
>>> bookmarks = []
>>> for i in range(1356):
... bookmark = []
... for j in range(698539//1356):
... key_word = random.randint(1000, 1000000000)
... universe.add(key_word)
... bookmark.append(key_word)
... bookmarks.append(bookmark)
...
>>> key_word_count = {}
>>> for bookmark in bookmarks:
... for key_word in bookmark:
... key_word_count[key_word] = key_word_count.get(key_word, 0) + 1
...
>>> print max(key_word_count, key=key_word_count.__getitem__)
408530590
>>> print key_word_count[408530590]
3
>>>

Related

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:
You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();
General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.
You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

Median Maintenance Algorithm - Same implementation yields different results depending on Int32 or Int64

I found something interesting while doing a HW question.
The howework question asks to code the Median Maintenance algorithm.
The formal statement is as follows:
The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 5 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting xi denote the ith number of the file, the kth median mk is defined as the median of the numbers x1,…,xk. (So, if k is odd, then mk is ((k+1)/2)th smallest number among x1,…,xk; if k is even, then m1 is the (k/2)th smallest number among x1,…,xk.)
In order to get O(n) running time, this should be implemented using heaps obviously. Anyways, I coded this using Brute Force (deadline was too soon and needed an answer right away) (O(n2)) with the following steps:
Read data in
Sort array
Find Median
Add it to running time
I ran the algorithm through several test cases (with a known answer) and got the correct results, however when I was running the same algorithm on a larger data set I was getting the wrong answer. I was doing all the operations using Int64 ro represent the data.
Then I tried switching to Int32 and magically I got the correct answer which makes no sense to me.
The code is below, and it is also found here (the data is in the repo). The algorithm starts to give erroneous results after the 3810 index:
private static void Main(string[] args)
{
MedianMaintenance("Question2.txt");
}
private static void MedianMaintenance(string filename)
{
var txtData = File.ReadLines(filename).ToArray();
var inputData32 = new List<Int32>();
var medians32 = new List<Int32>();
var sums32 = new List<Int32>();
var inputData64 = new List<Int64>();
var medians64 = new List<Int64>();
var sums64 = new List<Int64>();
var sum = 0;
var sum64 = 0f;
var i = 0;
foreach (var s in txtData)
{
//Add to sorted list
var intToAdd = Convert.ToInt32(s);
inputData32.Add(intToAdd);
inputData64.Add(Convert.ToInt64(s));
//Compute sum
var count = inputData32.Count;
inputData32.Sort();
inputData64.Sort();
var index = 0;
if (count%2 == 0)
{
//Even number of elements
index = count/2 - 1;
}
else
{
//Number is odd
index = ((count + 1)/2) - 1;
}
var val32 = Convert.ToInt32(inputData32[index]);
var val64 = Convert.ToInt64(inputData64[index]);
if (i > 3810)
{
var t = sum;
var t1 = sum + val32;
}
medians32.Add(val32);
medians64.Add(val64);
//Debug.WriteLine("Median is {0}", val);
sum += val32;
sums32.Add(Convert.ToInt32(sum));
sum64 += val64;
sums64.Add(Convert.ToInt64(sum64));
i++;
}
Console.WriteLine("Median Maintenance result is {0}", (sum).ToString("N"));
Console.WriteLine("Median Maintenance result is {0}", (medians32.Sum()).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (sum64).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (medians64.Sum()).ToString("N"));
}
What's more interesting is that the running sum (in the sum64 variable) yields a different result than summing all items in the list with LINQ's Sum() function.
The results (the thirs one is the one that's wrong):
These are the computer details:
I'll appreciate if someone can give me some insights on why is this happening.
Thanks,
0f is initializing a 32 bit float variable, you meant 0d or 0.0 to receive a 64 bit floating point.
As for linq, you'll probably get better results if you use strongly typed lists.
new List<int>()
new List<long>()
The first thing I notice is what the commenter did: var sum64 = 0f initializes sum64 as a float. As the median value of a collection of Int64s will itself be an Int64 (the specified rules don't use the mean between two midpoint values in a collection of even cardinality), you should instead declare this variable explicitly as a long. In fact, I would go ahead and replace all usages of var in this code example; the convenience of var is being lost here in causing type-related bugs.

Get unique 3 digit zips from a list C#

I have a list of integers that represent US ZIP codes, and I want to get unique values based on the first three digits of the ZIP code. For example this is my list:
10433
30549
10456
54933
60594
30569
30659
My result should contain only:
10433
30549
54933
60594
30659
The US ZIP codes excluded from my list are: 10456 and 30659 because I already have the ZIPs that contain 104xx and 306xx.
I really don't know how to get this done, I guess it's not that hard, but I have no idea. I've made a function, that saves me the unique first three digits, and I've added some random 2 digits at the end of each zip. But it didn't worked out because I got for example 10423 but 10423 is not in my list, and I don't have a specific pattern that all my numbers have the last 2 digits in a range.
A little Linq should work. If using a list of ints:
var zips = new[] { 10433, 30549, 10456, 54933, 60594, 30569, 30659 };
var results = zips.GroupBy(z => z / 100).Select(g => g.First());
Or if using a list of strings:
var zips = new[] { "10433", "30549", "10456", "54933", "60594", "30569", "30659" };
var results = zips.GroupBy(z => z.Remove(3)).Select(g => g.First());
Another solution would be to use a custom IEqualityComparer<T>. For ints:
class ZipComparer : IEqualityComparer<int> {
public bool Equals(int x, int y) {
return x / 100 == y / 100;
}
public int GetHashCode(int x) {
return x / 100;
}
}
For strings:
class ZipComparer : IEqualityComparer<string> {
public bool Equals(string x, string y) {
return x.Remove(3) == y.Remove(3);
}
public int GetHashCode(string x) {
return x.Remove(3).GetHashCode();
}
}
Then to use it, you can simply call Distinct:
var result = zips.Distinct(new ZipComparer());
Finally, you also use MoreLINQ's DistinctBy extension method (also available on NuGet):
var results = zips.DistinctBy(z => z / 100);
// or
var results = zips.DistinctBy(z => z.Remove(3));
The popular answer here gives code, but doesn't solving the problem. The problem is that you don't seem to have an algorithm. So...
How would you solve this problem on paper?
I would imagine the process would be something like this:
For each number, determine the 3 digit prefix
If you don't already have a zip with that prefix, keep the number
If you do already have a zip with that prefix, discard the number
How would you write this in code?
Well, there's a couple things you need:
A bucket to keep track of what prefixes you have found, and which values you've kept.
A loop over all of the items
A way to determine the prefix
Here's one way to write this (you can convert it to using strings instead as an exercise):
ICollection<int> GetUniqueZipcodes(int[] zips)
{
Dictionary<int, int> bucket = new Dictionary<int,int>();
foreach (var zip in zips)
{
int prefix = GetPrefix(zip);
if(!bucket.ContainsKey(prefix))
{
bucket.Add(prefix, zip);
}
}
return bucket.Values;
}
int GetPrefix(int zip)
{
return zip / 100;
}
Getting concise
Now, many programmers these days would say "OMG so many lines of code this could be a one liner". And they're right, it could. Borrowing from p.s.w.g's answer, this can be condensed (in a very readable manner) to:
var results = zips.GroupBy(z => z / 100).Select(g => g.First());
So how does this work? zips.GroupBy(z => z/100) does the bucketing operation. You end up with a collection of groups that looks like this:
{
104: { 10433, 10456 },
305: { 30549, 30569 },
549: { 54933 },
605: { 60594 },
306: { 30659 }
}
Then we use .Select(g => g.First()), which takes the first item from each group.

Whats the most concise way to pick a random element by weight in c#?

Lets assume:
List<element> which element is:
public class Element {
int Weight { get; set; }
}
What I want to achieve is, select an element randomly by the weight.
For example:
Element_1.Weight = 100;
Element_2.Weight = 50;
Element_3.Weight = 200;
So
the chance Element_1 got selected is 100/(100+50+200)=28.57%
the chance Element_2 got selected is 50/(100+50+200)=14.29%
the chance Element_3 got selected is 200/(100+50+200)=57.14%
I know I can create a loop, calculate total, etc...
What I want to learn is, whats the best way to do this by Linq in ONE line (or as short as possible), thanks.
UPDATE
I found my answer below. First thing I learn is: Linq is NOT magic, it's slower then well-designed loop.
So my question becomes find a random element by weight, (without as short as possible stuff :)
If you want a generic version (useful for using with a (singleton) randomize helper, consider whether you need a constant seed or not)
usage:
randomizer.GetRandomItem(items, x => x.Weight)
code:
public T GetRandomItem<T>(IEnumerable<T> itemsEnumerable, Func<T, int> weightKey)
{
var items = itemsEnumerable.ToList();
var totalWeight = items.Sum(x => weightKey(x));
var randomWeightedIndex = _random.Next(totalWeight);
var itemWeightedIndex = 0;
foreach(var item in items)
{
itemWeightedIndex += weightKey(item);
if(randomWeightedIndex < itemWeightedIndex)
return item;
}
throw new ArgumentException("Collection count and weights must be greater than 0");
}
// assuming rnd is an already instantiated instance of the Random class
var max = list.Sum(y => y.Weight);
var rand = rnd.Next(max);
var res = list
.FirstOrDefault(x => rand >= (max -= x.Weight));
This is a fast solution with precomputation. The precomputation takes O(n), the search O(log(n)).
Precompute:
int[] lookup=new int[elements.Length];
lookup[0]=elements[0].Weight-1;
for(int i=1;i<lookup.Length;i++)
{
lookup[i]=lookup[i-1]+elements[i].Weight;
}
To generate:
int total=lookup[lookup.Length-1];
int chosen=random.GetNext(total);
int index=Array.BinarySearch(lookup,chosen);
if(index<0)
index=~index;
return elements[index];
But if the list changes between each search, you can instead use a simple O(n) linear search:
int total=elements.Sum(e=>e.Weight);
int chosen=random.GetNext(total);
int runningSum=0;
foreach(var element in elements)
{
runningSum+=element.Weight;
if(chosen<runningSum)
return element;
}
This could work:
int weightsSum = list.Sum(element => element.Weight);
int start = 1;
var partitions = list.Select(element =>
{
var oldStart = start;
start += element.Weight;
return new { Element = element, End = oldStart + element.Weight - 1};
});
var randomWeight = random.Next(weightsSum);
var randomElement = partitions.First(partition => (partition.End > randomWeight)).
Select(partition => partition.Element);
Basically, for each element a partition is created with an end weight.
In your example, Element1 would associated to (1-->100), Element2 associated to (101-->151) and so on...
Then a random weight sum is calculated and we look for the range which is associated to it.
You could also compute the sum in the method group but that would introduce another side effect...
Note that I'm not saying this is elegant or fast. But it does use linq (not in one line...)
.Net 6 introduced .MaxBy making this much easier.
This could now be simplified to the following one-liner:
list.MaxBy(x => rng.GetNext(x.weight));
This works best if the weights are large or floating point numbers, otherwise there will be collisions, which can be prevented by multiplying the weight by some factor.

Faster way to do a List<T>.Contains()

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)
// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();
// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.
There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().
This operation can be called a symmetric difference.
You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.
UPDATE:
I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:
Original: 79499 ms
Hashset: 33 ms
BTW, there is a method on HashSet called SymmetricExceptWith which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, rather than leaving the initial two sets unmodified, and the code would be more elegant.
Here is the code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// foo and bar have some identical elements (given a case-insensitive match)
var foo = getRandomStrings();
var bar = getRandomStrings();
var timer = new Stopwatch();
timer.Start();
// remove non matches
var f = foo.Where(x => !bar.Contains(x)).ToList();
var b = bar.Where(x => !foo.Contains(x)).ToList();
timer.Stop();
Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));
timer.Reset();
timer.Start();
var intersect = new HashSet<String>(foo);
intersect.IntersectWith(bar);
var fSet = new HashSet<String>(foo);
var bSet = new HashSet<String>(bar);
fSet.ExceptWith(intersect);
bSet.ExceptWith(intersect);
timer.Stop();
var fCheck = new HashSet<String>(f);
var bCheck = new HashSet<String>(b);
Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));
Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
Console.ReadKey();
}
static Random _rnd = new Random();
private const int Count = 50000;
private static List<string> getRandomStrings()
{
var strings = new List<String>(Count);
var chars = new Char[10];
for (var i = 0; i < Count; i++)
{
var len = _rnd.Next(2, 10);
for (var j = 0; j < len; j++)
{
var c = (Char)_rnd.Next('a', 'z');
chars[j] = c;
}
strings.Add(new String(chars, 0, len));
}
return strings;
}
}
With intersect it would be done like this:
var matches = ((from f in foo
select f)
.Intersect(
from b in bar
select b, StringComparer.InvariantCultureIgnoreCase))
If the elements are unique within each list you should consider using an HashSet
The HashSet(T) class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order.
With sorted list, you can use binary search.
Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Categories