Efficient storage, lookup and manipulation of a large undirected graph - c#

I have an undirected graph G with about 5000 nodes. Any pair of nodes may be connected by an edge. Length, direction, or other features of the edge are irrelevant, and there can be at most one edge between two points, so the relationships between nodes are binary. Thus, there are 12497500 total potential edges.
Each node is identified by a string name, not a number.
I would like to store such a graph (loaded as input data into my program) but I'm not sure what kind of data structure is best.
I will need to lookup whether a given pair of nodes are connected or not, many many times, so lookup performance is probably the primary concern.
Performance cost of adding and removing elements is not a concern.
I would also like to keep the syntax simple and elegant if possible, to reduce the possibility of introducing bugs and make debugging easier.
Two possibilities:
bool[numNodes, numNodes] and a Dictionary<string, int> to match each node name to an index. Pros: Simple, fast lookup. Cons: Can't easily remove or add nodes (would have to add/remove rows/columns), redundant (have to be careful about g[n1, n2] vs g[n2, n1]), clumsy syntax because I have to go through the HashMap every time.
HashSet<HashSet<string>>. Pros: Intuitive, nodes directly identified by strings, easy to "add/remove nodes" because only edges are stored and nodes themselves are implicit. Cons: Possible to enter garbage input (edges that "connect" three nodes because the set has three members).
Regarding the second option, I am also unclear about a number of things:
Is it going to take much more memory than an array of bool?
Are two .NET sets equivalent to mathematical sets in the sense that they are equal if and only if they have the exact same members (as opposed to being distinguished by capacity or order of elements and so on) for purposes of HashSet membership? (ie is querying with outerSets.Contains(new HashSet<string>{"node1", "node2"}) actually going to work?)
Is lookup going to take much longer than an array of bool?

I was curious about using string concatenation vs. a Tuple when generating a key representing an edge in a hashtable in order to approach O(1) lookup performance. Two possibilities here for handling the undirected edge requirement:
Normalize the key so that it is the same no matter which node is specified first in the description of the edge. In my test, I simply choose to take the node with lowest ordinal comparison value as being the first component in the key.
Make two entries in the hashtable, one for each direction of the edge.
A critical assumption here is that the string node identifiers are not very long, so that key normalization is inexpensive relative to the lookup.
The string concatenation and Tuple versions with key normalization seems to work about the same: was completing about 2 million random lookups in about 3 seconds in a VirtualBox VM in release mode.
To see if the key normalization was swamping the effect of the lookup operation, a third implementation does no key normalization, but maintains symmetric entries with respect to both possible directions of an edge. This seems to be about 30-40% slower on lookups, which was slightly unexpected (to me). Perhaps the underlying hash table buckets have higher average occupancy due to having twice the number of elements, requiring longer linear searches within each hash bucket (on average)?
interface IEdgeCollection
{
bool AddEdge(string node1, string node2);
bool ContainsEdge(string node1, string node2);
bool RemoveEdge(string node1, string node2);
}
class EdgeSet1 : IEdgeCollection
{
private HashSet<string> _edges = new HashSet<string>();
private static string MakeEdgeKey(string node1, string node2)
{
return StringComparer.Ordinal.Compare(node1, node2) < 0 ? node1 + node2 : node2 + node1;
}
public bool AddEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Add(key);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Remove(key);
}
}
class EdgeSet2 : IEdgeCollection
{
private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();
private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
{
return StringComparer.Ordinal.Compare(node1, node2) < 0
? new Tuple<string, string>(node1, node2)
: new Tuple<string, string>(node2, node1);
}
public bool AddEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Add(key);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Remove(key);
}
}
class EdgeSet3 : IEdgeCollection
{
private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();
private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
{
return new Tuple<string, string>(node1, node2);
}
public bool AddEdge(string node1, string node2)
{
var key1 = MakeEdgeKey(node1, node2);
var key2 = MakeEdgeKey(node2, node1);
return _edges.Add(key1) && _edges.Add(key2);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key1 = MakeEdgeKey(node1, node2);
var key2 = MakeEdgeKey(node2, node1);
return _edges.Remove(key1) && _edges.Remove(key2);
}
}
class Program
{
static void Test(string[] nodes, IEdgeCollection edges, int edgeCount)
{
// use edgeCount as seed to rng to ensure test reproducibility
var rng = new Random(edgeCount);
// store known edges in a separate data structure for validation
var edgeList = new List<Tuple<string, string>>();
Stopwatch stopwatch = new Stopwatch();
// randomly generated edges
stopwatch.Start();
for (int i = 0; i < edgeCount; i++)
{
string node1 = nodes[rng.Next(nodes.Length)];
string node2 = nodes[rng.Next(nodes.Length)];
edges.AddEdge(node1, node2);
edgeList.Add(new Tuple<string, string>(node1, node2));
}
var addElapsed = stopwatch.Elapsed;
// non random lookups
int nonRandomFound = 0;
stopwatch.Start();
foreach (var edge in edgeList)
{
if (edges.ContainsEdge(edge.Item1, edge.Item2))
nonRandomFound++;
}
var nonRandomLookupElapsed = stopwatch.Elapsed;
if (nonRandomFound != edgeList.Count)
{
Console.WriteLine("The edge collection {0} is not working right!", edges.GetType().FullName);
return;
}
// random lookups
int randomFound = 0;
stopwatch.Start();
for (int i = 0; i < edgeCount; i++)
{
string node1 = nodes[rng.Next(nodes.Length)];
string node2 = nodes[rng.Next(nodes.Length)];
if (edges.ContainsEdge(node1, node2))
randomFound++;
}
var randomLookupElapsed = stopwatch.Elapsed;
// remove all
stopwatch.Start();
foreach (var edge in edgeList)
{
edges.RemoveEdge(edge.Item1, edge.Item2);
}
var removeElapsed = stopwatch.Elapsed;
Console.WriteLine("Test: {0} with {1} edges: {2}s addition, {3}s non-random lookup, {4}s random lookup, {5}s removal",
edges.GetType().FullName,
edgeCount,
addElapsed.TotalSeconds,
nonRandomLookupElapsed.TotalSeconds,
randomLookupElapsed.TotalSeconds,
removeElapsed.TotalSeconds);
}
static void Main(string[] args)
{
var rng = new Random();
var nodes = new string[5000];
for (int i = 0; i < nodes.Length; i++)
{
StringBuilder name = new StringBuilder();
int length = rng.Next(7, 15);
for (int j = 0; j < length; j++)
{
name.Append((char) rng.Next(32, 127));
}
nodes[i] = name.ToString();
}
IEdgeCollection edges1 = new EdgeSet1();
IEdgeCollection edges2 = new EdgeSet2();
IEdgeCollection edges3 = new EdgeSet3();
Test(nodes, edges1, 2000000);
Test(nodes, edges2, 2000000);
Test(nodes, edges3, 2000000);
Console.ReadLine();
}
}

The C5 Collections Library has some useful stuff regarding graphs
http://www.itu.dk/research/c5/
This question, Most efficient implementation for a complete undirected graph, looks useful as well.
SortedDictionary is under the hood, a height balanced red-black tree, so lookups are O(log n).

Related

How to return tvalues from dictionary C#

I've got a dictionary, i need to return the tvalue based on the search from a string array.
How do I return the tvalue for the tkey matching the string i'm searching for and the tvalue for the entry directly after it (to calculate the length of the entry.... this is so I can then access the original file and import the data).
parameters = a string array to be found.
The dictionary (dict) is setup as tkey = name, tvalue = bytes from the start of the file.
input 2 is the file with all the info in it.
foreach (var p in parameters)
{
if (dict.ContainsKey(p))
{
int posstart = //tvalue of the parameter found;
int posfinish = //tvalue ofnext entry ;
using (FileStream fs = new FileStream(input[2], FileMode.Open, FileAccess.Read))
{
byte[] bytes = //posstart to pos finish
System.Console.WriteLine(Encoding.Default.GetString(bytes));
}
}
else
{
Console.WriteLine($"error, {p} not found");
}
}
Any help is welcomed, Thank you in advanced.
The key problem here is this comment:
// tvalue of next entry
In C# dictionaries are not ordered, so there is no "next entry". (A SortedDictionary sorts on key, not value, so that's no help to you. An OrderedDictionary might be what you want, but let's assume that you have a Dictionary in hand and solve the problem from that point.)
Let's transform your unsorted name -> offset data structure into a better data structure.
Suppose we start with this:
// name -> offset
var dict = new Dictionary<string, int>() {
{ "foo", 100 }, { "bar", 40 }, { "blah", 200 } };
We'll sort the dictionary by value and put the offsets into this sorted list:
// index -> offset
var list = new List<int>();
And then we'll make a new dictionary that maps names to indexes into this list:
// name -> index
var newDict = new Dictionary<string, int>();
Let's build the list and new dictionary from the old one:
foreach (var pair in dict.OrderBy(pair => pair.Value))
{
newDict[pair.Key] = list.Count;
list.Add(pair.Value);
}
We also need the offset of the last byte as the last thing in the list:
list.Add(theOffsetOfTheLastByte);
And now we can do a two-step lookup to get the offset and the next offset. First we look up the index by name, and then the offset by index:
int fooIndex = newDict["foo"]; // 1
int fooOffset = list[fooIndex]; // 100
int nextIndex = fooIndex + 1;
int nextOffset = list[nextIndex]; // 200
Make sense?

fastest starts with search algorithm

I need to implement a search algorithm which only searches from the start of the string rather than anywhere within the string.
I am new to algorithms but from what I can see it seems as though they go through the string and find any occurrence.
I have a collection of strings (over 1 million) which need to be searched everytime the user types a keystroke.
EDIT:
This will be an incremental search. I currently have it implemented with the following code and my searches are coming back ranging between 300-700ms from over 1 million possible strings. The collection isnt ordered but there is no reason it couldnt be.
private ICollection<string> SearchCities(string searchString) {
return _cityDataSource.AsParallel().Where(x => x.ToLower().StartsWith(searchString)).ToArray();
}
I've adapted the code from this article from Visual Studio Magazine that implements a Trie.
The following program demonstrates how to use a Trie to do fast prefix searching.
In order to run this program, you will need a text file called "words.txt" with a large list of words. You can download one from Github here.
After you compile the program, copy the "words.txt" file into the same folder as the executable.
When you run the program, type a prefix (such as prefix ;)) and press return, and it will list all the words beginning with that prefix.
This should be a very fast lookup - see the Visual Studio Magazine article for more details!
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace ConsoleApp1
{
class Program
{
static void Main()
{
var trie = new Trie();
trie.InsertRange(File.ReadLines("words.txt"));
Console.WriteLine("Type a prefix and press return.");
while (true)
{
string prefix = Console.ReadLine();
if (string.IsNullOrEmpty(prefix))
continue;
var node = trie.Prefix(prefix);
if (node.Depth == prefix.Length)
{
foreach (var suffix in suffixes(node))
Console.WriteLine(prefix + suffix);
}
else
{
Console.WriteLine("Prefix not found.");
}
Console.WriteLine();
}
}
static IEnumerable<string> suffixes(Node parent)
{
var sb = new StringBuilder();
return suffixes(parent, sb).Select(suffix => suffix.TrimEnd('$'));
}
static IEnumerable<string> suffixes(Node parent, StringBuilder current)
{
if (parent.IsLeaf())
{
yield return current.ToString();
}
else
{
foreach (var child in parent.Children)
{
current.Append(child.Value);
foreach (var value in suffixes(child, current))
yield return value;
--current.Length;
}
}
}
}
public class Node
{
public char Value { get; set; }
public List<Node> Children { get; set; }
public Node Parent { get; set; }
public int Depth { get; set; }
public Node(char value, int depth, Node parent)
{
Value = value;
Children = new List<Node>();
Depth = depth;
Parent = parent;
}
public bool IsLeaf()
{
return Children.Count == 0;
}
public Node FindChildNode(char c)
{
return Children.FirstOrDefault(child => child.Value == c);
}
public void DeleteChildNode(char c)
{
for (var i = 0; i < Children.Count; i++)
if (Children[i].Value == c)
Children.RemoveAt(i);
}
}
public class Trie
{
readonly Node _root;
public Trie()
{
_root = new Node('^', 0, null);
}
public Node Prefix(string s)
{
var currentNode = _root;
var result = currentNode;
foreach (var c in s)
{
currentNode = currentNode.FindChildNode(c);
if (currentNode == null)
break;
result = currentNode;
}
return result;
}
public bool Search(string s)
{
var prefix = Prefix(s);
return prefix.Depth == s.Length && prefix.FindChildNode('$') != null;
}
public void InsertRange(IEnumerable<string> items)
{
foreach (string item in items)
Insert(item);
}
public void Insert(string s)
{
var commonPrefix = Prefix(s);
var current = commonPrefix;
for (var i = current.Depth; i < s.Length; i++)
{
var newNode = new Node(s[i], current.Depth + 1, current);
current.Children.Add(newNode);
current = newNode;
}
current.Children.Add(new Node('$', current.Depth + 1, current));
}
public void Delete(string s)
{
if (!Search(s))
return;
var node = Prefix(s).FindChildNode('$');
while (node.IsLeaf())
{
var parent = node.Parent;
parent.DeleteChildNode(node.Value);
node = parent;
}
}
}
}
A couple of thoughts:
First, your million strings need to be ordered, so that you can "seek" to the first matching string and return strings until you no longer have a match...in order (seek via C# List<string>.BinarySearch, perhaps). That's how you touch the least number of strings possible.
Second, you should probably not try to hit the string list until there's a pause in input of at least 500 ms (give or take).
Third, your queries into the vastness should be async and cancelable, because it's certainly going to be the case that one effort will be superseded by the next keystroke.
Finally, any subsequent query should first check that the new search string is an append of the most recent search string...so that you can begin your subsequent seek from the last seek (saving lots of time).
I suggest using linq.
string x = "searchterm";
List<string> y = new List<string>();
List<string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();
Where x is your keystroke search text term, y is your collection of strings to search, and Matches is the matches from your collection.
I tested this with the first 1 million prime numbers, here is the code adapted from above:
Stopwatch SW = new Stopwatch();
SW.Start();
string x = "2";
List<string> y = System.IO.File.ReadAllText("primes1.txt").Split(' ').ToList();
y.RemoveAll(xo => xo == " " || xo == "" || xo == "\r\r\n");
List <string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();
SW.Stop();
Console.WriteLine("matches: " + Matches.Count);
Console.WriteLine("time taken: " + SW.Elapsed.TotalSeconds);
Console.Read();
Result is:
matches: 77025
time taken: 0.4240604
Of course this is testing against numbers and I don't know whether linq converts the values before, or if numbers make any difference.

algorithm for selecting N random elements from a List<T> in C# [duplicate]

This question already has answers here:
Randomize a List<T>
(28 answers)
Closed 6 years ago.
I need a quick algorithm to select 4 random elements from a generic list. For example, I'd like to get 4 random elements from a List and then based on some calculations if elements found not valid then it should again select next 4 random elements from the list.
You could do it like this
public static class Extensions
{
public static Dictionary<int, T> GetRandomElements<T>(this IList<T> list, int quantity)
{
var result = new Dictionary<int, T>();
if (list == null)
return result;
Random rnd = new Random(DateTime.Now.Millisecond);
for (int i = 0; i < quantity; i++)
{
int idx = rnd.Next(0, list.Count);
result.Add(idx, list[idx]);
}
return result;
}
}
Then use the extension method like this:
List<string> list = new List<string>() { "a", "b", "c", "d", "e", "f", "g", "h" };
Dictionary<int, string> randomElements = list.GetRandomElements(3);
foreach (KeyValuePair<int, string> elem in randomElements)
{
Console.WriteLine($"index in original list: {elem.Key} value: {elem.Value}");
}
something like that:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(4);
list.Add(5);
int n = 4;
var rand = new Random();
var randomObjects = new List<int>();
for (int i = 0; i<n; i++)
{
var index = rand.Next(list.Count);
randomObjects.Add(list[index]);
}
}
}
You can store indexes in some list to get non-repeated indexes:
List<T> GetRandomElements<T>(List<T> allElements, int randomCount = 4)
{
if (allElements.Count < randomCount)
{
return allElements;
}
List<int> indexes = new List<int>();
// use HashSet if performance is very critical and you need a lot of indexes
//HashSet<int> indexes = new HashSet<int>();
List<T> elements = new List<T>();
Random random = new Random();
while (indexes.Count < randomCount)
{
int index = random.Next(allElements.Count);
if (!indexes.Contains(index))
{
indexes.Add(index);
elements.Add(allElements[index]);
}
}
return elements;
}
Then you can do some calculation and call this method:
void Main(String[] args)
{
do
{
List<int> elements = GetRandomelements(yourElements);
//do some calculations
} while (some condition); // while result is not right
}
Suppose that the length of the List is N. Now suppose that you will put these 4 numbers in another List called out. Then you can loop through the List and the probability of the element you are on being chosen is
(4 - (out.Count)) / (N - currentIndex)
funcion (list)
(
loop i=0 i < 4
index = (int) length(list)*random(0 -> 1)
element[i] = list[index]
return element
)
while(check == false)
(
elements = funcion (list)
Do some calculation which returns check == false /true
)
This is the pseudo code, but i think you should of come up with this yourself.
Hope it helps:)
All the answers up to now have one fundamental flaw; you are asking for an algorithm that will generate a random combination of n elements and this combination, following some logic rules, will be valid or not. If its not, a new combination should be produced. Obviously, this new combination should be one that has never been produced before. All the proposed algorithms do not enforce this. If for example out of 1000000 possible combinations, only one is valid, you might waste a whole lot of resources until that particular unique combination is produced.
So, how to solve this? Well, the answer is simple, create all possible unique solutions, and then simply produce them in a random order. Caveat: I will suppose that the input stream has no repeating elements, if it does, then some combinations will not be unique.
First of all, lets write ourselves a handy immutable stack:
class ImmutableStack<T> : IEnumerable<T>
{
public static readonly ImmutableStack<T> Empty = new ImmutableStack<T>();
private readonly T head;
private readonly ImmutableStack<T> tail;
public int Count { get; }
private ImmutableStack()
{
Count = 0;
}
private ImmutableStack(T head, ImmutableStack<T> tail)
{
this.head = head;
this.tail = tail;
Count = tail.Count + 1;
}
public T Peek()
{
if (this == Empty)
throw new InvalidOperationException("Can not peek a empty stack.");
return head;
}
public ImmutableStack<T> Pop()
{
if (this == Empty)
throw new InvalidOperationException("Can not pop a empty stack.");
return tail;
}
public ImmutableStack<T> Push(T item) => new ImmutableStack<T>(item, this);
public IEnumerator<T> GetEnumerator()
{
var current = this;
while (current != Empty)
{
yield return current.head;
current = current.tail;
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This will make our life easier while producing all combinations by recursion. Next, let's get the signature of our main method right:
public static IEnumerable<IEnumerable<T>> GetAllPossibleCombinationsInRandomOrder<T>(
IEnumerable<T> data, int combinationLength)
Ok, that looks about right. Now let's implement this thing:
var allCombinations = GetAllPossibleCombinations(data, combinationLength).ToArray();
var rnd = new Random();
var producedIndexes = new HashSet<int>();
while (producedIndexes.Count < allCombinations.Length)
{
while (true)
{
var index = rnd.Next(allCombinations.Length);
if (!producedIndexes.Contains(index))
{
producedIndexes.Add(index);
yield return allCombinations[index];
break;
}
}
}
Ok, all we are doing here is producing random indexees, checking we haven't produced it yet (we use a HashSet<int> for this), and returning the combination at that index.
Simple, now we only need to take care of GetAllPossibleCombinations(data, combinationLength).
Thats easy, we'll use recursion. Our bail out condition is when our current combination is the specified length. Another caveat: I'm omitting argument validation throughout the whole code, things like checking for null or if the specified length is not bigger than the input length, etc. should be taken care of.
Just for the fun, I'll be using some minor C#7 syntax here: nested functions.
public static IEnumerable<IEnumerable<T>> GetAllPossibleCombinations<T>(
IEnumerable<T> stream, int length)
{
return getAllCombinations(stream, ImmutableStack<T>.Empty);
IEnumerable<IEnumerable<T>> getAllCombinations<T>(IEnumerable<T> currentData, ImmutableStack<T> combination)
{
if (combination.Count == length)
yield return combination;
foreach (var d in currentData)
{
var newCombination = combination.Push(d);
foreach (var c in getAllCombinations(currentData.Except(new[] { d }), newCombination))
{
yield return c;
}
}
}
}
And there we go, now we can use this:
var data = "abc";
var random = GetAllPossibleCombinationsInRandomOrder(data, 2);
foreach (var r in random)
{
Console.WriteLine(string.Join("", r));
}
And sure enough, the output is:
bc
cb
ab
ac
ba
ca

Grouping by an unknown initial prefix

Say I have the following array of strings as an input:
foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe
There are 3 different prefixes used here, "foo-", "barbaz" and "baz" - however these prefixes are not known ahead of time (they could be something completely different).
How could you establish what the different common prefixes are so that they could then be grouped by? This is made a bit tricky since in the data I've provided there's two that start with "bazg" and one that starts "bazf" where of course "baz" is the prefix.
What I've tried so far is sorting them into alphabetical order, and then looping through them in order and counting how many characters in a row are identical to the previous. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is it falls over at the "bazg" and "bazf" problem I mentioned earlier and separates those into two different groups (one with just one element in it)
Edit: Alright, let's throw a few more rules in:
Longer potential groups should generally be preferred over shorter ones, unless there is a closely matching group of less than X characters difference in length. (So where X is 2, baz would be preferred over bazg)
A group must have at least Y elements in it or not be a group at all
It's okay to simply throw away elements that don't match any of the 'groups' to within the rules above.
To clarify the first rule in relation to the second, if X was 0 and Y was 2, then the two 'bazg' entries would be in a group, and the 'bazf' would be thrown away because its on its own.
Well, here's a quick hack, probably O(something_bad):
IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
// TODO: error checking
return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}
IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
if(source.Any())
{
var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
if (tuple.Item2.Count() >= minGroupSize)
yield return tuple;
foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
yield return element;
}
}
Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}
IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
while (source.Any() && source.Peek().StartsWith(prefix))
yield return source.Pop();
}
string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
var s = new Stack<string>(source);
var counter = new DictionaryWithDefault<string, int>(0);
while(s.Any())
{
var g = GetCommonPrefix(s);
if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
counter[g]++;
s.Pop();
}
return counter.OrderBy(c => c.Value).Last().Key;
}
string GetCommonPrefix(IEnumerable<string> coll)
{
return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
let possibleMatch = coll.First().Substring(0, len)
where coll.All(f => f.StartsWith(possibleMatch))
select possibleMatch).FirstOrDefault();
}
public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
TValue _default;
public TValue DefaultValue {
get { return _default; }
set { _default = value; }
}
public DictionaryWithDefault() : base() { }
public DictionaryWithDefault(TValue defaultValue) : base() {
_default = defaultValue;
}
public new TValue this[TKey key]
{
get { return base.ContainsKey(key) ? base[key] : _default; }
set { base[key] = value; }
}
}
Example usage:
string[] input = {
"foo-139875913",
"foo-aeuefhaiu",
"foo-95hw9ghes",
"barbazabejgoiagjaegioea",
"barbaz8gs98ghsgh9es8h",
"barbaza98fyae9fghaefag",
"bazfa90eufa0e9u",
"bazgeajga8ugae89u",
"bazguea9guae",
"9a8efa098fea0",
"aifeaufhiuafhe"
};
GuessGroups(input, 3, 2).Dump();
Ok, well as discussed, the problem wasn't initially well defined, but here is how I'd go about it.
Create a tree T
Parse the list, for each element:
for each letter in that element
if a branch labeled with that letter exists then
Increment the counter on that branch
Descend that branch
else
Create a branch labelled with that letter
Set its counter to 1
Descend that branch
This gives you a tree where each of the leaves represents a word in your input. Each of the non-leaf nodes has a counter representing how many leaves are (eventually) attached to that node. Now you need a formula to weight the length of the prefix (the depth of the node) against the size of the prefix group. For now:
S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour
So now you can iterate over each of the non-leaf node and assign them a score S. Then, to work out your groups you would
For each non-leaf node
Assign score S
Insertion sort the node in to a list, so the head is the highest scoring node
Starting at the root of the tree, traverse the nodes
If the node is the highest scoring node in the list
Mark it as a prefix
Remove all nodes from the list that are a descendant of it
Pop itself off the front of the list
Return up the tree
This should give you a list of prefixes. The last part feels like some clever data structures or algorithms could speed it up (the last part of removing all the children feels particularly weak, but if you input size is small, I guess speed isn't too important).
I'm wondering if your requirements aren't off. It seems as if you are looking for a specific grouping size as opposed to specific key size requirements. I have below a program that will, based on a specified group size, break up the strings into the largest possible groups up too, and including the group size specified. So if you specify a group size of 5, then it will group items on the smallest key possible to make a group of size 5. In your example it would group foo- as f since there is no need to make a more complex key as an identifier.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
/// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
{
var smallItems = from item in items
where item.Length < keySize
select item;
var largeItems = from item in items
where keySize < item.Length
select item;
var largeItemsq = (from item in largeItems
let key = item.Substring(0, keySize)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
if (smallItems.Any())
{
var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
var smallItemsq = (from item in smallItems
let key = item.Substring(0, smallestLength)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
return Combine(smallItemsq, largeItemsq);
}
return largeItemsq;
}
static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
var x = new Dictionary<bool,Dictionary<string,List<string>>> {
{ true, null },
{ false, null }
};
foreach(var condition in new bool[] { true, false }) {
var hasA = a.ContainsKey(condition);
var hasB = b.ContainsKey(condition);
x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
: hasA ? a[condition]
: hasB ? b[condition]
: new Dictionary<string, List<string>>();
}
return x;
}
public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
{
var toReturn = new Dictionary<string, List<string>>();
var both = Split(maxGroupSize, keySize, items);
if (both.ContainsKey(false))
foreach (var key in both[false].Keys)
toReturn.Add(key, both[false][key]);
if (both.ContainsKey(true))
{
var keySize_ = keySize + 1;
var xs = from needsFix in both[true]
select needsFix;
foreach (var x in xs)
{
var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
}
}
return toReturn;
}
static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
static readonly int maxAllowed = allowedChars.Length - 1;
static IEnumerable<string> GenerateText()
{
var list = new List<string>();
for (int i = 0; i < 100; i++)
{
var stringLength = rand.Next(3,25);
var chars = new List<char>(stringLength);
for (int j = stringLength; j > 0; j--)
chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
list.Add(newString);
}
return list;
}
static void Main(string[] args)
{
// runs 1000 times over autogenerated groups of sample text.
for (int i = 0; i < 1000; i++)
{
var s = GenerateText();
Go(s);
}
Console.WriteLine();
Console.WriteLine("DONE");
Console.ReadLine();
}
static void Go(IEnumerable<string> items)
{
var dict = Group(3, items, 1);
foreach (var key in dict.Keys)
{
Console.WriteLine(key);
foreach (var item in dict[key])
Console.WriteLine("\t{0}", item);
}
}
}
}

Create big Two-Dimensional Array

simple question:
How can I use a huge two-dimensional array in C#? What I want to do is the following:
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Lenght,Nodes.Lenght];
It just figures that I got an out of memory error.
Is there a chance to work with such big data in-memory? (4gb RAM and a 6 core CPU)^^
The integers I want to save in the two-dimensional array are small. I guess from 0 to 1000.
Update: I tried to save the Relations using Dictionary<KeyValuePair<int, int>, int>. It works for some adding loops. Here is the class wich should create the graph. The instance of CreateGraph get's its data from a xml streamreader.
Main (C# backgroundWorker_DoWork)
ReadXML Reader = new ReadXML(tBOpenFile.Text);
CreateGraph Creater = new CreateGraph();
int WordsCount = (int)nUDLimit.Value;
if (nUDLimit.Value == 0) WordsCount = Reader.CountWords();
// word loop
for (int Position = 0; Position < WordsCount; Position++)
{
// reading and parsing
Reader.ReadNextWord();
// add to graph builder
Creater.AddWord(Reader.CurrentWord, Reader.GetRelations(Reader.CurrentText));
}
string[] Words = Creater.GetWords();
Dictionary<KeyValuePair<int, int>, int> Relations = Creater.GetRelations();
ReadXML
class ReadXML
{
private string Path;
private XmlReader Reader;
protected int Word;
public string CurrentWord;
public string CurrentText;
public ReadXML(string FilePath)
{
Path = FilePath;
LoadFile();
Word = 0;
}
public int CountWords()
{
// caching
if(Path.Contains("filename") == true) return 1000;
int Words = 0;
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
Words++;
}
}
LoadFile();
return Words;
}
public void ReadNextWord()
{
while(Reader.Read())
{
if(Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "name")
{
XElement Title = XElement.ReadFrom(Reader) as XElement;
CurrentWord = Title.Value;
break;
}
}
while(Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "rels")
{
XElement Text = XElement.ReadFrom(Reader) as XElement;
CurrentText = Text.Value;
break;
}
}
break;
}
}
}
public Dictionary<string, int> GetRelations(string Text)
{
Dictionary<string, int> Relations = new Dictionary<string,int>();
string[] RelationStrings = Text.Split(';');
foreach (string RelationString in RelationStrings)
{
string[] SplitString = RelationString.Split(':');
if (SplitString.Length == 2)
{
string RelationName = SplitString[0];
int RelationWeight = Convert.ToInt32(SplitString[1]);
Relations.Add(RelationName, RelationWeight);
}
}
return Relations;
}
private void LoadFile()
{
Reader = XmlReader.Create(Path);
Reader.MoveToContent();
}
}
CreateGraph
class CreateGraph
{
private Dictionary<string, int> CollectedWords = new Dictionary<string, int>();
private Dictionary<KeyValuePair<int, int>, int> CollectedRelations = new Dictionary<KeyValuePair<int, int>, int>();
public void AddWord(string Word, Dictionary<string, int> Relations)
{
int SourceNode = GetIdCreate(Word);
foreach (KeyValuePair<string, int> Relation in Relations)
{
int TargetNode = GetIdCreate(Relation.Key);
CollectedRelations.Add(new KeyValuePair<int,int>(SourceNode, TargetNode), Relation.Value); // here is the error located
}
}
public string[] GetWords()
{
string[] Words = new string[CollectedWords.Count];
foreach (KeyValuePair<string, int> CollectedWord in CollectedWords)
{
Words[CollectedWord.Value] = CollectedWord.Key;
}
return Words;
}
public Dictionary<KeyValuePair<int,int>,int> GetRelations()
{
return CollectedRelations;
}
private int WordsIndex = 0;
private int GetIdCreate(string Word)
{
if (!CollectedWords.ContainsKey(Word))
{
CollectedWords.Add(Word, WordsIndex);
WordsIndex++;
}
return CollectedWords[Word];
}
}
Now I get another error: An element with the same key already exists. (At the Add in the CreateGraph class.)
You'll have a better chance when you set Relations up as a jagged array (array of array) :
//int[,] Relations = new int[Nodes.Length,Nodes.Length];
int[][] Relations = new int[Nodes.length] [];
for (int i = 0; i < Relations.Length; i++)
Relations[i] = new int[Nodes.Length];
And then you still need 10k * 10k * sizeof(int) = 400M
Which should be possible, even when running in 32 bits .
Update:
With the new number, it's 1M * 1M * 4 = 4 TB, that' not going to work.
And using short to replace int will only bring it down to 2 TB
Since you seem to need to assign weights to (sparse) connections between nodes, you should see if something like this could work:
struct WeightedRelation
{
public readonly int node1;
public readonly int node2;
public readonly int weight;
}
int[] Nodes = new int[1146445];
List<WeightedRelation> Relations = new List<WeightedRelation>();
Relations.Add(1, 2, 10);
...
This just the basic idea, you may need a double dictionary to do fast lookups. But your memory size would be proportional to the number of actual (non 0) relations.
Okay, now we know what you're really trying to do...
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Length ,Nodes.Length];
You're trying to allocate a single object which has 1,314,336,138,025 elements, each of size 4 bytes. That's over 5,000 GB. How exactly did you expect that to work?
Whatever you do, you're obviously going to run out of physical memory for that many elements... even if the CLR let you allocate a single object of that size.
Let's take a small exampler of 50,000, where you end up with ~9GB of requires space. I can't remember what the current limit is (which depends on CLR version number and whether you're using the 32 or 64-bit CLR) but I don't think any of them will support that.
You can break your array up into "rows" as shown in Henk's answer - that will take up more memory in total, but each array will be small enough to cope with on its own in the CLR. It's not going to help you fit the whole thing into memory though - at best you'll end up swapping to oblivion.
Can you use sparse arrays instead, where you only allocate space for elements you really need to access (or some approximation of that)? Or map the data to disk? If you give us more context, we may be able to come up with a solution.
Jon and Henk have alluded to sparse arrays; this would be useful if many of your nodes are unrelated to each other. Even if all nodes are related to all others, you may not need an n by n array.
For example, perhaps nodes cannot be related to themselves. Perhaps, given nodes x and y, "x is related to y" is the same as "y is related to x". If both of those are true, then for 4 nodes, you only have 6 relations, not 16:
a <-> b
a <-> c
a <-> d
b <-> c
b <-> d
c <-> d
In this case, an n-by-n array is wasting somewhat more than half of its space. If large numbers of nodes are unrelated to each other, you're wasting that much more than half of the space.
One quick way to implement this would be as a Dictionary<KeyType, RelationType>, where the key uniquely identifies the two nodes being related. Depending on your exact needs, this could take one of several different forms. Here's an example based on the nodes and relations defined above:
Dictionary<KeyType, Relation> x = new Dictionary<KeyType, RelationType>();
x.Add(new KeyType(a, b), new RelationType(a, b));
x.Add(new KeyType(a, c), new RelationType(a, c));
... etc.
If relations are reflexive, then KeyType should ensure that new KeyType(b, a) creates an object that is equivalent to the one created by new KeyType(a, b).

Categories