Create big Two-Dimensional Array - c#

simple question:
How can I use a huge two-dimensional array in C#? What I want to do is the following:
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Lenght,Nodes.Lenght];
It just figures that I got an out of memory error.
Is there a chance to work with such big data in-memory? (4gb RAM and a 6 core CPU)^^
The integers I want to save in the two-dimensional array are small. I guess from 0 to 1000.
Update: I tried to save the Relations using Dictionary<KeyValuePair<int, int>, int>. It works for some adding loops. Here is the class wich should create the graph. The instance of CreateGraph get's its data from a xml streamreader.
Main (C# backgroundWorker_DoWork)
ReadXML Reader = new ReadXML(tBOpenFile.Text);
CreateGraph Creater = new CreateGraph();
int WordsCount = (int)nUDLimit.Value;
if (nUDLimit.Value == 0) WordsCount = Reader.CountWords();
// word loop
for (int Position = 0; Position < WordsCount; Position++)
{
// reading and parsing
Reader.ReadNextWord();
// add to graph builder
Creater.AddWord(Reader.CurrentWord, Reader.GetRelations(Reader.CurrentText));
}
string[] Words = Creater.GetWords();
Dictionary<KeyValuePair<int, int>, int> Relations = Creater.GetRelations();
ReadXML
class ReadXML
{
private string Path;
private XmlReader Reader;
protected int Word;
public string CurrentWord;
public string CurrentText;
public ReadXML(string FilePath)
{
Path = FilePath;
LoadFile();
Word = 0;
}
public int CountWords()
{
// caching
if(Path.Contains("filename") == true) return 1000;
int Words = 0;
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
Words++;
}
}
LoadFile();
return Words;
}
public void ReadNextWord()
{
while(Reader.Read())
{
if(Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "name")
{
XElement Title = XElement.ReadFrom(Reader) as XElement;
CurrentWord = Title.Value;
break;
}
}
while(Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "rels")
{
XElement Text = XElement.ReadFrom(Reader) as XElement;
CurrentText = Text.Value;
break;
}
}
break;
}
}
}
public Dictionary<string, int> GetRelations(string Text)
{
Dictionary<string, int> Relations = new Dictionary<string,int>();
string[] RelationStrings = Text.Split(';');
foreach (string RelationString in RelationStrings)
{
string[] SplitString = RelationString.Split(':');
if (SplitString.Length == 2)
{
string RelationName = SplitString[0];
int RelationWeight = Convert.ToInt32(SplitString[1]);
Relations.Add(RelationName, RelationWeight);
}
}
return Relations;
}
private void LoadFile()
{
Reader = XmlReader.Create(Path);
Reader.MoveToContent();
}
}
CreateGraph
class CreateGraph
{
private Dictionary<string, int> CollectedWords = new Dictionary<string, int>();
private Dictionary<KeyValuePair<int, int>, int> CollectedRelations = new Dictionary<KeyValuePair<int, int>, int>();
public void AddWord(string Word, Dictionary<string, int> Relations)
{
int SourceNode = GetIdCreate(Word);
foreach (KeyValuePair<string, int> Relation in Relations)
{
int TargetNode = GetIdCreate(Relation.Key);
CollectedRelations.Add(new KeyValuePair<int,int>(SourceNode, TargetNode), Relation.Value); // here is the error located
}
}
public string[] GetWords()
{
string[] Words = new string[CollectedWords.Count];
foreach (KeyValuePair<string, int> CollectedWord in CollectedWords)
{
Words[CollectedWord.Value] = CollectedWord.Key;
}
return Words;
}
public Dictionary<KeyValuePair<int,int>,int> GetRelations()
{
return CollectedRelations;
}
private int WordsIndex = 0;
private int GetIdCreate(string Word)
{
if (!CollectedWords.ContainsKey(Word))
{
CollectedWords.Add(Word, WordsIndex);
WordsIndex++;
}
return CollectedWords[Word];
}
}
Now I get another error: An element with the same key already exists. (At the Add in the CreateGraph class.)

You'll have a better chance when you set Relations up as a jagged array (array of array) :
//int[,] Relations = new int[Nodes.Length,Nodes.Length];
int[][] Relations = new int[Nodes.length] [];
for (int i = 0; i < Relations.Length; i++)
Relations[i] = new int[Nodes.Length];
And then you still need 10k * 10k * sizeof(int) = 400M
Which should be possible, even when running in 32 bits .
Update:
With the new number, it's 1M * 1M * 4 = 4 TB, that' not going to work.
And using short to replace int will only bring it down to 2 TB
Since you seem to need to assign weights to (sparse) connections between nodes, you should see if something like this could work:
struct WeightedRelation
{
public readonly int node1;
public readonly int node2;
public readonly int weight;
}
int[] Nodes = new int[1146445];
List<WeightedRelation> Relations = new List<WeightedRelation>();
Relations.Add(1, 2, 10);
...
This just the basic idea, you may need a double dictionary to do fast lookups. But your memory size would be proportional to the number of actual (non 0) relations.

Okay, now we know what you're really trying to do...
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Length ,Nodes.Length];
You're trying to allocate a single object which has 1,314,336,138,025 elements, each of size 4 bytes. That's over 5,000 GB. How exactly did you expect that to work?
Whatever you do, you're obviously going to run out of physical memory for that many elements... even if the CLR let you allocate a single object of that size.
Let's take a small exampler of 50,000, where you end up with ~9GB of requires space. I can't remember what the current limit is (which depends on CLR version number and whether you're using the 32 or 64-bit CLR) but I don't think any of them will support that.
You can break your array up into "rows" as shown in Henk's answer - that will take up more memory in total, but each array will be small enough to cope with on its own in the CLR. It's not going to help you fit the whole thing into memory though - at best you'll end up swapping to oblivion.
Can you use sparse arrays instead, where you only allocate space for elements you really need to access (or some approximation of that)? Or map the data to disk? If you give us more context, we may be able to come up with a solution.

Jon and Henk have alluded to sparse arrays; this would be useful if many of your nodes are unrelated to each other. Even if all nodes are related to all others, you may not need an n by n array.
For example, perhaps nodes cannot be related to themselves. Perhaps, given nodes x and y, "x is related to y" is the same as "y is related to x". If both of those are true, then for 4 nodes, you only have 6 relations, not 16:
a <-> b
a <-> c
a <-> d
b <-> c
b <-> d
c <-> d
In this case, an n-by-n array is wasting somewhat more than half of its space. If large numbers of nodes are unrelated to each other, you're wasting that much more than half of the space.
One quick way to implement this would be as a Dictionary<KeyType, RelationType>, where the key uniquely identifies the two nodes being related. Depending on your exact needs, this could take one of several different forms. Here's an example based on the nodes and relations defined above:
Dictionary<KeyType, Relation> x = new Dictionary<KeyType, RelationType>();
x.Add(new KeyType(a, b), new RelationType(a, b));
x.Add(new KeyType(a, c), new RelationType(a, c));
... etc.
If relations are reflexive, then KeyType should ensure that new KeyType(b, a) creates an object that is equivalent to the one created by new KeyType(a, b).

Related

Can I represent a given word (String) as a number?

Suppose I have a list of words e.g.
var words = new [] {"bob", "alice", "john"};
Is there a way to represent each of those words as numbers so that one could use such numbers to sort the words.
One use-case which I think this can be used for is to use Counting Sort to sort a list of words. Again I am only interested in whether this is at all possible not that it may not be the most efficient way to sort a list of words.
Do note this is not about hash-codes or different sorting algorithms. I am curious to find out if a string can be represented as a number.
You can use a dictionary instead of an array.
public class Program
{
static void Main(string[] args)
{
IDictionary<int, string> words = new Dictionary<int, string>();
words.Add(0, "bob");
words.Add(1, "alice");
words.Add(2, "john");
foreach (KeyValuePair<int, string> word in words.OrderBy(w => w.Key))
{
Console.WriteLine(word.Value);
}
Console.ReadLine();
}
}
NOTE: It's better to work with collections in place of arrays is easier to use for most developers.
I don't understand the down votes but hey this is what I have come up with so far:
private int _alphabetLength = char.MaxValue - char.MinValue;
private BigInteger Convert(string data)
{
var value = new BigInteger();
var startPoint = data.Length - 1;
for (int i = data.Length - 1; i >= 0; i--)
{
var character = data[i];
var charNumericValue = character;
var exponentialWeight = startPoint - i;
var weightedValue = new BigInteger(charNumericValue * Math.Pow(_alphabetLength, exponentialWeight));
value += weightedValue;
}
return value;
}
Using the above to convert the following:
var words = new [] {"bob", "alice", "john" };
420901224533 // bob
-9223372036854775808 // alice
29835458486206476 // john
Despite the overflow the output looks sorted to me, I need to improve this and test it properly but at least it is a start.

algorithm for selecting N random elements from a List<T> in C# [duplicate]

This question already has answers here:
Randomize a List<T>
(28 answers)
Closed 6 years ago.
I need a quick algorithm to select 4 random elements from a generic list. For example, I'd like to get 4 random elements from a List and then based on some calculations if elements found not valid then it should again select next 4 random elements from the list.
You could do it like this
public static class Extensions
{
public static Dictionary<int, T> GetRandomElements<T>(this IList<T> list, int quantity)
{
var result = new Dictionary<int, T>();
if (list == null)
return result;
Random rnd = new Random(DateTime.Now.Millisecond);
for (int i = 0; i < quantity; i++)
{
int idx = rnd.Next(0, list.Count);
result.Add(idx, list[idx]);
}
return result;
}
}
Then use the extension method like this:
List<string> list = new List<string>() { "a", "b", "c", "d", "e", "f", "g", "h" };
Dictionary<int, string> randomElements = list.GetRandomElements(3);
foreach (KeyValuePair<int, string> elem in randomElements)
{
Console.WriteLine($"index in original list: {elem.Key} value: {elem.Value}");
}
something like that:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(4);
list.Add(5);
int n = 4;
var rand = new Random();
var randomObjects = new List<int>();
for (int i = 0; i<n; i++)
{
var index = rand.Next(list.Count);
randomObjects.Add(list[index]);
}
}
}
You can store indexes in some list to get non-repeated indexes:
List<T> GetRandomElements<T>(List<T> allElements, int randomCount = 4)
{
if (allElements.Count < randomCount)
{
return allElements;
}
List<int> indexes = new List<int>();
// use HashSet if performance is very critical and you need a lot of indexes
//HashSet<int> indexes = new HashSet<int>();
List<T> elements = new List<T>();
Random random = new Random();
while (indexes.Count < randomCount)
{
int index = random.Next(allElements.Count);
if (!indexes.Contains(index))
{
indexes.Add(index);
elements.Add(allElements[index]);
}
}
return elements;
}
Then you can do some calculation and call this method:
void Main(String[] args)
{
do
{
List<int> elements = GetRandomelements(yourElements);
//do some calculations
} while (some condition); // while result is not right
}
Suppose that the length of the List is N. Now suppose that you will put these 4 numbers in another List called out. Then you can loop through the List and the probability of the element you are on being chosen is
(4 - (out.Count)) / (N - currentIndex)
funcion (list)
(
loop i=0 i < 4
index = (int) length(list)*random(0 -> 1)
element[i] = list[index]
return element
)
while(check == false)
(
elements = funcion (list)
Do some calculation which returns check == false /true
)
This is the pseudo code, but i think you should of come up with this yourself.
Hope it helps:)
All the answers up to now have one fundamental flaw; you are asking for an algorithm that will generate a random combination of n elements and this combination, following some logic rules, will be valid or not. If its not, a new combination should be produced. Obviously, this new combination should be one that has never been produced before. All the proposed algorithms do not enforce this. If for example out of 1000000 possible combinations, only one is valid, you might waste a whole lot of resources until that particular unique combination is produced.
So, how to solve this? Well, the answer is simple, create all possible unique solutions, and then simply produce them in a random order. Caveat: I will suppose that the input stream has no repeating elements, if it does, then some combinations will not be unique.
First of all, lets write ourselves a handy immutable stack:
class ImmutableStack<T> : IEnumerable<T>
{
public static readonly ImmutableStack<T> Empty = new ImmutableStack<T>();
private readonly T head;
private readonly ImmutableStack<T> tail;
public int Count { get; }
private ImmutableStack()
{
Count = 0;
}
private ImmutableStack(T head, ImmutableStack<T> tail)
{
this.head = head;
this.tail = tail;
Count = tail.Count + 1;
}
public T Peek()
{
if (this == Empty)
throw new InvalidOperationException("Can not peek a empty stack.");
return head;
}
public ImmutableStack<T> Pop()
{
if (this == Empty)
throw new InvalidOperationException("Can not pop a empty stack.");
return tail;
}
public ImmutableStack<T> Push(T item) => new ImmutableStack<T>(item, this);
public IEnumerator<T> GetEnumerator()
{
var current = this;
while (current != Empty)
{
yield return current.head;
current = current.tail;
}
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This will make our life easier while producing all combinations by recursion. Next, let's get the signature of our main method right:
public static IEnumerable<IEnumerable<T>> GetAllPossibleCombinationsInRandomOrder<T>(
IEnumerable<T> data, int combinationLength)
Ok, that looks about right. Now let's implement this thing:
var allCombinations = GetAllPossibleCombinations(data, combinationLength).ToArray();
var rnd = new Random();
var producedIndexes = new HashSet<int>();
while (producedIndexes.Count < allCombinations.Length)
{
while (true)
{
var index = rnd.Next(allCombinations.Length);
if (!producedIndexes.Contains(index))
{
producedIndexes.Add(index);
yield return allCombinations[index];
break;
}
}
}
Ok, all we are doing here is producing random indexees, checking we haven't produced it yet (we use a HashSet<int> for this), and returning the combination at that index.
Simple, now we only need to take care of GetAllPossibleCombinations(data, combinationLength).
Thats easy, we'll use recursion. Our bail out condition is when our current combination is the specified length. Another caveat: I'm omitting argument validation throughout the whole code, things like checking for null or if the specified length is not bigger than the input length, etc. should be taken care of.
Just for the fun, I'll be using some minor C#7 syntax here: nested functions.
public static IEnumerable<IEnumerable<T>> GetAllPossibleCombinations<T>(
IEnumerable<T> stream, int length)
{
return getAllCombinations(stream, ImmutableStack<T>.Empty);
IEnumerable<IEnumerable<T>> getAllCombinations<T>(IEnumerable<T> currentData, ImmutableStack<T> combination)
{
if (combination.Count == length)
yield return combination;
foreach (var d in currentData)
{
var newCombination = combination.Push(d);
foreach (var c in getAllCombinations(currentData.Except(new[] { d }), newCombination))
{
yield return c;
}
}
}
}
And there we go, now we can use this:
var data = "abc";
var random = GetAllPossibleCombinationsInRandomOrder(data, 2);
foreach (var r in random)
{
Console.WriteLine(string.Join("", r));
}
And sure enough, the output is:
bc
cb
ab
ac
ba
ca

How to compare two csv files by 2 columns?

I have 2 csv files
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I read both files and add data to lists
var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();
Then I find all unique strings from both lists and add it to result lists
var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();
rezlist contains this data
[0] = "italy;russia;france"
[1] = "india;iran;pakistan"
At now I want to compare, make except and union by second and third column in all rows.
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.
rezlist must contains
india;iran;pakistan
I added class
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
...
}
public int GetHashCode(string obj)
{
...
}
}
StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();
Your question is not very clear: for instance, is india;iran;pakistan the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.
Then there is this: find all unique string from both lists which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].
Also note that a data sample of 2 leaves room for a fair amount of error.
Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:
// parse "1.CSV"
List<string[]> lst1 = File.ReadAllLines(#"C:\Temp\1.csv").
Select(line => line.Split(';')).
ToList();
// parse "2.CSV"
List<string[]> lst2 = File.ReadAllLines(#"C:\Temp\2.csv").
Select(line => line.Split(';')).
ToList();
// extracting once speeds things up in the next step
// and leaves open the possibility of iterating in a method
List<List<string>> tgts = new List<List<string>>();
tgts.Add(lst1.Select(z => z[1]).Distinct().ToList());
tgts.Add(lst1.Select(z => z[2]).Distinct().ToList());
var tmpLst = lst2.Where(x => !tgts[0].Contains(x[1]) ||
!tgts[1].Contains(x[2])).
ToList();
That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.
Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:
var unique = tmpLst.
GroupBy(g => g[1], (key, values) =>
new GroupItem(key,
values.ToArray()[0],
values.Count())
).Where(q => q.Count == 1).
GroupBy(g => g.Data[2], (key, values) => new
{
Item = string.Join(";", values.ToArray()[0]),
Count = values.Count()
}
).Where(q => q.Count == 1).Select(s => s.Item).
ToList();
The GroupItem class is trivial:
class GroupItem
{
public string Item { set; get; } // debug aide
public string[] Data { set; get; }
public int Count { set; get; }
public GroupItem(string n, string[] d, int c)
{
Item = n;
Data = d;
Count = c;
}
public override string ToString()
{
return string.Join(";", Data);
}
}
It starts with tmpList, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.
The second GroupBy acts on those results, this time looking at col[2]. Finally, it selects the joined string data.
Results
Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:
NepalX and ItalyX were the repeats in both files and they correctly eliminated each other.
With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.
your GetHashCode()-Method in EqualityComparer are buggy. Fixed version:
public int GetHashCode(string obj)
{
return obj.Split(';')[1].GetHashCode();
}
now the result are correct:
// one result: "india;iran;pakistan"
btw. "StringLengthEqualityComparer"is not a good name ;-)
private void GetUnion(List<string> lst1, List<string> lst2)
{
List<string> lstUnion = new List<string>();
foreach (string value in lst1)
{
string valueColumn1 = value.Split(';')[0];
string valueColumn2 = value.Split(';')[1];
string valueColumn3 = value.Split(';')[2];
string result = lst2.FirstOrDefault(s => s.Contains(";" + valueColumn2 + ";" + valueColumn3));
if (result != null)
{
if (!lstUnion.Contains(result))
{
lstUnion.Add(result);
}
}
}
}
class Program
{
static void Main(string[] args)
{
var lst1 = File.ReadLines(#"D:\test\1.csv").Select(x => new StringWrapper(x)).ToList();
var lst2 = File.ReadLines(#"D:\test\2.csv").Select(x => new StringWrapper(x));
var set = new HashSet<StringWrapper>(lst1);
set.SymmetricExceptWith(lst2);
foreach (var x in set)
{
Console.WriteLine(x.Value);
}
}
}
struct StringWrapper : IEquatable<StringWrapper>
{
public string Value { get; }
private readonly string _comparand0;
private readonly string _comparand14;
public StringWrapper(string value)
{
Value = value;
var split = value.Split(';');
_comparand0 = split[0];
_comparand14 = split[14];
}
public bool Equals(StringWrapper other)
{
return string.Equals(_comparand0, other._comparand0, StringComparison.OrdinalIgnoreCase)
&& string.Equals(_comparand14, other._comparand14, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
return obj is StringWrapper && Equals((StringWrapper) obj);
}
public override int GetHashCode()
{
unchecked
{
return ((_comparand0 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand0) : 0)*397)
^ (_comparand14 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand14) : 0);
}
}
}

Compare a set of three strings with another

I am making a list of unique "set of 3 strings" from some data, in a way that if the 3 strings come together they become a set, and I can only have unique sets in my list.
A,B,C
B,C,D
D,E,F and so on
And I keep adding sets to the list if they do not exist in the list already, so that if I encounter these three strings together {A,B,C} I wont put it in the list again. So I have 2 questions. And the answer to second one actually depends on the answer of the first one.
How to store this set of 3 string, use List or array or concatenate them or anything else? (I may add it to a Dictionary to record their count as well but that's for later)
How to compare a set of 3 strings with another, irrespective of their order, obviously depending on the structure used? I want to know a proper solution to this rather than doing everything naively!
I am using C# by the way.
Either an array or a list is your best bet for storing the data, since as wentimo mentioned in a comment, concatenating them means that you are losing data that you may need. To steal his example, "ab" "cd "ef" concatenated together is the same as "abcd" "e" and "f" concatenated, but shouldn't be treated as equivalent sets.
To compare them, I would order the list alphabetically, then compare each value in order. That takes care of the fact that the order of the values doesn't matter.
A pseudocode example might look like this:
Compare(List<string> a, List<string> b)
{
a.Sort();
b.Sort();
if(a.Length == b.Length)
{
for(int i = 0; i < a.Length; i++)
{
if(a[i] != b[i])
{
return false;
}
}
return true;
}
else
{
return false;
}
}
Update
Now that you stated in a comment that performance is an imporatant consideration since you may have millions of these sets to compare and that you won't have duplicate elements in a set, here is a more optimized version of my code, note that I no longer have to sort the two lists, which will save quite a bit of time in executing this function.
Compare(List<string> a, List<string> b)
{
if(a.Length == b.Length)
{
for(int i = 0; i < a.Length; i++)
{
if(!b.Contains(a[i]))
{
return false;
}
}
return true;
}
else
{
return false;
}
}
DrewJordan's approach of using a hashtable is still probably than my approach, since it just has to sort each set of three and then can do the comparison to your existing sets much faster than my approach can.
Probably the best way is to use a HashSet, if you don't need to have duplicate elements in your sets. It sounds like each set of 3 has 3 unique elements; if that is actually the case, I would combine a HashSet approach with the concatenation that you already worked out, i.e. order the elements, combine with some separator, and then add the concatenated elements to a HashSet which will prevent duplicates from ever occuring in the first place.
If your sets of three could have duplicate elements, then Kevin's approach is what you're going to have to do for each. You might get some better performance from using a list of HashSets for each set of three, but with only three elements the overhead of creating a hash for each element of potentially millions of sets seems like it would perform worse then just iterating over them once.
here is a simple string-wrapper for you:
/// The wrapper for three strings
public class StringTriplet
{
private List<string> Store;
// accessors to three source strings:
public string A { get; private set; }
public string B { get; private set; }
public string C { get; private set; }
// constructor (need to feel internal storage)
public StringTriplet(string a, string b, string c)
{
this.Store = new List<string>();
this.Store.Add(a);
this.Store.Add(b);
this.Store.Add(c);
// sort is reqiured, cause later we don't want to compare all strings each other
this.Store.Sort();
this.A = a;
this.B = b;
this.C = c;
}
// additional method. you could add IComparable declaration to the entire class, but it is not necessary in your task...
public int CompareTo(StringTriplet obj)
{
if (null == obj)
return -1;
int cmp;
cmp = this.Store.Count.CompareTo(obj.Store.Count);
if (0 != cmp)
return cmp;
for (int i = 0; i < this.Store.Count; i++)
{
if (null == this.Store[i])
return 1;
cmp = this.Store[i].CompareTo(obj.Store[i]);
if ( 0 != cmp )
return cmp;
}
return 0;
}
// additional method. it is a good practice : override both 'Equals' and 'GetHashCode'. See below..
override public bool Equals(object obj)
{
if (! (obj is StringTriplet))
return false;
var t = obj as StringTriplet;
return ( 0 == this.CompareTo(t));
}
// necessary method . it will be implicitly used on adding values to the HashSet
public override int GetHashCode()
{
int res = 0;
for (int i = 0; i < this.Store.Count; i++)
res = res ^ (null == this.Store[i] ? 0 : this.Store[i].GetHashCode()) ^ i;
return res;
}
}
Now you could just create hashset and add values:
var t = new HashSet<StringTriplet> ();
t.Add (new StringTriplet ("a", "b", "c"));
t.Add (new StringTriplet ("a", "b1", "c"));
t.Add (new StringTriplet ("a", "b", "c")); // dup
t.Add (new StringTriplet ("a", "c", "b")); // dup
t.Add (new StringTriplet ("1", "2", "3"));
t.Add (new StringTriplet ("1", "2", "4"));
t.Add (new StringTriplet ("3", "2", "1"));
foreach (var s in t) {
Console.WriteLine (s.A + " " + s.B + " " + s.C);
}
return 0;
You can inherit from List<String> and override Equals() and GetHashCode() methods:
public class StringList : List<String>
{
public override bool Equals(object obj)
{
StringList other = obj as StringList;
if (other == null) return false;
return this.All(x => other.Contains(x));
}
public override int GetHashCode()
{
unchecked
{
int hash = 19;
foreach (String s in this)
{
hash = hash + s.GetHashCode() * 31;
}
return hash;
}
}
}
Now, you can use HashSet<StringList> to store only unique sets

Efficient storage, lookup and manipulation of a large undirected graph

I have an undirected graph G with about 5000 nodes. Any pair of nodes may be connected by an edge. Length, direction, or other features of the edge are irrelevant, and there can be at most one edge between two points, so the relationships between nodes are binary. Thus, there are 12497500 total potential edges.
Each node is identified by a string name, not a number.
I would like to store such a graph (loaded as input data into my program) but I'm not sure what kind of data structure is best.
I will need to lookup whether a given pair of nodes are connected or not, many many times, so lookup performance is probably the primary concern.
Performance cost of adding and removing elements is not a concern.
I would also like to keep the syntax simple and elegant if possible, to reduce the possibility of introducing bugs and make debugging easier.
Two possibilities:
bool[numNodes, numNodes] and a Dictionary<string, int> to match each node name to an index. Pros: Simple, fast lookup. Cons: Can't easily remove or add nodes (would have to add/remove rows/columns), redundant (have to be careful about g[n1, n2] vs g[n2, n1]), clumsy syntax because I have to go through the HashMap every time.
HashSet<HashSet<string>>. Pros: Intuitive, nodes directly identified by strings, easy to "add/remove nodes" because only edges are stored and nodes themselves are implicit. Cons: Possible to enter garbage input (edges that "connect" three nodes because the set has three members).
Regarding the second option, I am also unclear about a number of things:
Is it going to take much more memory than an array of bool?
Are two .NET sets equivalent to mathematical sets in the sense that they are equal if and only if they have the exact same members (as opposed to being distinguished by capacity or order of elements and so on) for purposes of HashSet membership? (ie is querying with outerSets.Contains(new HashSet<string>{"node1", "node2"}) actually going to work?)
Is lookup going to take much longer than an array of bool?
I was curious about using string concatenation vs. a Tuple when generating a key representing an edge in a hashtable in order to approach O(1) lookup performance. Two possibilities here for handling the undirected edge requirement:
Normalize the key so that it is the same no matter which node is specified first in the description of the edge. In my test, I simply choose to take the node with lowest ordinal comparison value as being the first component in the key.
Make two entries in the hashtable, one for each direction of the edge.
A critical assumption here is that the string node identifiers are not very long, so that key normalization is inexpensive relative to the lookup.
The string concatenation and Tuple versions with key normalization seems to work about the same: was completing about 2 million random lookups in about 3 seconds in a VirtualBox VM in release mode.
To see if the key normalization was swamping the effect of the lookup operation, a third implementation does no key normalization, but maintains symmetric entries with respect to both possible directions of an edge. This seems to be about 30-40% slower on lookups, which was slightly unexpected (to me). Perhaps the underlying hash table buckets have higher average occupancy due to having twice the number of elements, requiring longer linear searches within each hash bucket (on average)?
interface IEdgeCollection
{
bool AddEdge(string node1, string node2);
bool ContainsEdge(string node1, string node2);
bool RemoveEdge(string node1, string node2);
}
class EdgeSet1 : IEdgeCollection
{
private HashSet<string> _edges = new HashSet<string>();
private static string MakeEdgeKey(string node1, string node2)
{
return StringComparer.Ordinal.Compare(node1, node2) < 0 ? node1 + node2 : node2 + node1;
}
public bool AddEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Add(key);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Remove(key);
}
}
class EdgeSet2 : IEdgeCollection
{
private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();
private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
{
return StringComparer.Ordinal.Compare(node1, node2) < 0
? new Tuple<string, string>(node1, node2)
: new Tuple<string, string>(node2, node1);
}
public bool AddEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Add(key);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Remove(key);
}
}
class EdgeSet3 : IEdgeCollection
{
private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();
private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
{
return new Tuple<string, string>(node1, node2);
}
public bool AddEdge(string node1, string node2)
{
var key1 = MakeEdgeKey(node1, node2);
var key2 = MakeEdgeKey(node2, node1);
return _edges.Add(key1) && _edges.Add(key2);
}
public bool ContainsEdge(string node1, string node2)
{
var key = MakeEdgeKey(node1, node2);
return _edges.Contains(key);
}
public bool RemoveEdge(string node1, string node2)
{
var key1 = MakeEdgeKey(node1, node2);
var key2 = MakeEdgeKey(node2, node1);
return _edges.Remove(key1) && _edges.Remove(key2);
}
}
class Program
{
static void Test(string[] nodes, IEdgeCollection edges, int edgeCount)
{
// use edgeCount as seed to rng to ensure test reproducibility
var rng = new Random(edgeCount);
// store known edges in a separate data structure for validation
var edgeList = new List<Tuple<string, string>>();
Stopwatch stopwatch = new Stopwatch();
// randomly generated edges
stopwatch.Start();
for (int i = 0; i < edgeCount; i++)
{
string node1 = nodes[rng.Next(nodes.Length)];
string node2 = nodes[rng.Next(nodes.Length)];
edges.AddEdge(node1, node2);
edgeList.Add(new Tuple<string, string>(node1, node2));
}
var addElapsed = stopwatch.Elapsed;
// non random lookups
int nonRandomFound = 0;
stopwatch.Start();
foreach (var edge in edgeList)
{
if (edges.ContainsEdge(edge.Item1, edge.Item2))
nonRandomFound++;
}
var nonRandomLookupElapsed = stopwatch.Elapsed;
if (nonRandomFound != edgeList.Count)
{
Console.WriteLine("The edge collection {0} is not working right!", edges.GetType().FullName);
return;
}
// random lookups
int randomFound = 0;
stopwatch.Start();
for (int i = 0; i < edgeCount; i++)
{
string node1 = nodes[rng.Next(nodes.Length)];
string node2 = nodes[rng.Next(nodes.Length)];
if (edges.ContainsEdge(node1, node2))
randomFound++;
}
var randomLookupElapsed = stopwatch.Elapsed;
// remove all
stopwatch.Start();
foreach (var edge in edgeList)
{
edges.RemoveEdge(edge.Item1, edge.Item2);
}
var removeElapsed = stopwatch.Elapsed;
Console.WriteLine("Test: {0} with {1} edges: {2}s addition, {3}s non-random lookup, {4}s random lookup, {5}s removal",
edges.GetType().FullName,
edgeCount,
addElapsed.TotalSeconds,
nonRandomLookupElapsed.TotalSeconds,
randomLookupElapsed.TotalSeconds,
removeElapsed.TotalSeconds);
}
static void Main(string[] args)
{
var rng = new Random();
var nodes = new string[5000];
for (int i = 0; i < nodes.Length; i++)
{
StringBuilder name = new StringBuilder();
int length = rng.Next(7, 15);
for (int j = 0; j < length; j++)
{
name.Append((char) rng.Next(32, 127));
}
nodes[i] = name.ToString();
}
IEdgeCollection edges1 = new EdgeSet1();
IEdgeCollection edges2 = new EdgeSet2();
IEdgeCollection edges3 = new EdgeSet3();
Test(nodes, edges1, 2000000);
Test(nodes, edges2, 2000000);
Test(nodes, edges3, 2000000);
Console.ReadLine();
}
}
The C5 Collections Library has some useful stuff regarding graphs
http://www.itu.dk/research/c5/
This question, Most efficient implementation for a complete undirected graph, looks useful as well.
SortedDictionary is under the hood, a height balanced red-black tree, so lookups are O(log n).

Categories