Speed up working with big arrays of string in c#

Speed up working with big arrays of string in c# - c#

This method takes the most frequent words from string array.
It works very slowly for big arrays (like 190.000 milliseconds for 70.000 strings).
I've measured (using Stopwatch()) that its first part is the slowest one:
public static List<WordDouble> MostFrequentWords(double count, string[] words)
{
var wordsAndNumbers = new List<WordDouble>();
foreach (var word in words)
{
if (wordsAndNumbers.Exists(e => e.Word == word.ToLower()))
wordsAndNumbers[wordsAndNumbers.FindIndex(e => e.Word == word.ToLower())].Count++;
else
{
var addWord = new WordDouble();
addWord.Word = word.ToLower();
addWord.Count = 1;
wordsAndNumbers.Add(addWord);
}
}
/*method goes on, other parts work fast and do not need improvement */
...
return something;
}
public class WordDouble
{
public string Word;
public double Count;
}
How can I improve performance of this method?

Checking for an item using Exists in a list is an O(n) operation, while checking for an item in a dictionary is an O(1) operation.
This runs in a fraction of the time (actually in about 1/2200 of the time):
Dictionary<string, int> wordsAndNumbers = new Dictionary<string, int>();
foreach (string word in words) {
if (wordsAndNumbers.ContainsKey(word.ToLower())) {
wordsAndNumbers[word.ToLower()]++;
} else {
wordsAndNumbers.Add(word.ToLower(), 1);
}
}
Here is the result of a test run with 70000 strings, for the original code, my code, and Console's code, respectively:
00:01:21.0804944
00:00:00.0360415
00:00:00.1060375
You can even speed it up a little more by doing ToLower only once in the loop:
var wordsAndNumbers = new Dictionary<string, int>();
foreach (var word in words) {
string s = word.ToLower();
if (wordsAndNumbers.ContainsKey(s)) {
wordsAndNumbers[s]++;
} else {
wordsAndNumbers.Add(s, 1);
}
}
Test run:
00:00:00.0235761

First of all why do you use a double to count words?
Use a long Dictionary and never cast to lower just for comparison.
Dictionary<string,long> wordsAndNumbers = new
Dictionary<string,long>(StringComparer.OrdinalIgnoreCase);
foreach(var word in words)
{
if (!wordsAndNumbers.ContainsKey(word))
wordsAndNumbers[word] = 1;
else
wordsAndNumbers[word]++;
}
with 70000 Words i get the following runtime: 00:00:00.0152345 which is significant faster then the to lower solution on my machine which takes 00:00:00.0320127

Related

Why my List saves only last element and last occurance? [duplicate]

This question already has answers here:
Why does adding a new value to list<> overwrite previous values in the list<>
(2 answers)
Closed 10 months ago.
So I have a code here:
string word = "abba";
Counter(word);
public static List<Occur> Counter(string word)
{
Occur Oc = new Occur();
var ListNubmersOfWord = new List<Occur>();
foreach (char c in word)
{
Oc.Letter = c;
Oc.Number = word.Where(x => x == c).Count();
ListNubmersOfWord.Add(Oc);
}
foreach(Occur item in ListNubmersOfWord)
{
Console.WriteLine(string.Join(" ", $"{item.Letter} {item.Number}"));
}
return ListNubmersOfWord;
}
Here is Occur Class:
public class Occur
{
public int Number { get; set; }
public char Letter { get; set; }
}
And the problem is that for some reason list "ListNubmersOfWord" only saves last letter and last occurrence. The outcome is : Any ideas?

In each iteration of the loop, you are setting Letter and Number to the same object reference. Try to instantiate new object inside the body of the foreach loop:
foreach (char c in word)
{
Occur Oc = new Occur
{
Letter = c,
Number = word.Where(x => x == c).Count()
};
ListNubmersOfWord.Add(Oc);
}

Not only is the list of objects inefficient in your use case, but it would also cause repeated values in your example, you don't need char a,b in the list twice with the same count of occurrences. I would suggest using a Dictionary<char, int> and your method would be O(n) where n eqauls the number of characters in your string. You could also lookup the occurrence of a character in O(1) as well as get the count of a character in O(1):
string word = "abba";
Dictionary<char, int> charcount = Counter(word);
public static Dictionary<char, int> Counter(string word)
{
Dictionary<char, int> charcount = new Dictionary<char, int>();
foreach (char c in word)
{
if(charcount.ContainsKey(c))
charcount[c]++;
else
charcount.Add(c, 1);
}
return charcount;
}

C# looping through a list to find character counts

I'm trying to loop through a string to find the character, ASCII value, and the number of times the character occurs. So far, I have found each unique character and ASCII value using foreach statements, and finding if the value was already in the list, then don't add it, otherwise add it. However I'm struggling with the count portion. I was thinking the logic would be "if I am already in the list, don't count me again, however, increment my frequency"
I've tried a few different things, such as trying to find the index of the character it found and adding to that specific index, but i'm lost.
string String = "hello my name is lauren";
char[] String1 = String.ToCharArray();
// int [] frequency = new int[String1.Length]; //array of frequency counter
int length = 0;
List<char> letters = new List<char>();
List<int> ascii = new List<int>();
List<int> frequency = new List<int>();
foreach (int ASCII in String1)
{
bool exists = ascii.Contains(ASCII);
if (exists)
{
//add to frequency at same index
//ascii.Insert(1, ascii);
//get { ASCII[index]; }
}
else
{
ascii.Add(ASCII);
//add to frequency at new index
}
}
foreach (char letter in String1)
{
bool exists = letters.Contains(letter);
if (exists)
{
//add to frequency at same index
}
else
{
letters.Add(letter);
//add to frequency at new index
}
}
length = letters.Count;
for (int j = 0; j<length; ++j)
{
Console.WriteLine($"{letters[j].ToString(),3} {"(" + ascii[j] + ")"}\t");
}
Console.ReadLine();
}
}
}

I'm not sure if I understand your question but that what you are looking for may be Dictionary<T,T> instead of List<T>. Here are examples of solutions to problems i think you trying to solve.
Counting frequency of characters appearance
Dictionary<int, int> frequency = new Dictionary<int, int>();
foreach (int j in String)
{
if (frequency.ContainsKey(j))
{
frequency[j] += 1;
}
else
{
frequency.Add(j, 1);
}
}
Method to link characters to their ASCII
Dictionary<char, int> ASCIIofCharacters = new Dictionary<char, int>();
foreach (char i in String)
{
if (ASCIIofCharacters.ContainsKey(i))
{
}
else
{
ASCIIofCharacters.Add(i, (int)i);
}
}

A simple LINQ approach is to do this:
string String = "hello my name is lauren";
var results =
String
.GroupBy(x => x)
.Select(x => new { character = x.Key, ascii = (int)x.Key, frequency = x.Count() })
.ToArray();
That gives me:

If I understood your question, you want to map each char in the provided string to the count of times it appears in the string, right?
If that is the case, there are tons of ways to do that, and you also need to choose in which data structure you want to store the result.
Assuming you want to use linq and store the result in a Dictionary<char, int>, you could do something like this:
static IDictionary<char, int> getAsciiAndFrequencies(string str) {
return (
from c in str
group c by Convert.ToChar(c)
).ToDictionary(c => c.Key, c => c.Count());
}
And use if like this:
var f = getAsciiAndFrequencies("hello my name is lauren");
// result: { h: 1, e: 3, l: 3, o: 1, ... }

You are creating a histogram. But you should not use List.Contains as it gets ineffective as the list grows. You have to go through the list one item after another. Better use Dictionary which is based on hashing and you go directly to the item. The code may look like this
string str = "hello my name is lauren";
var dict = new Dictionary<char, int>();
foreach (char c in str)
{
dict.TryGetValue(c, out int count);
dict[c] = ++count;
}
foreach (var pair in dict.OrderBy(r => r.Key))
{
Console.WriteLine(pair.Value + "x " + pair.Key + " (" + (int)pair.Key + ")");
}
which gives
4x (32)
2x a (97)
3x e (101)
1x h (104)
1x i (105)
3x l (108)
2x m (109)
2x n (110)
1x o (111)
1x r (114)
1x s (115)
1x u (117)
1x y (121)

How to sort a dictionary in C# .net

I need to make a frequency analysis console program using c#. It has to show the 10 most frequent letters from a textfile. I have managed to display the first 10 letters read by the program and the frequency of each character. I, however, don't know how to sort the dictionary. This is the code I have so far.
I must also give the user the option to the frequency analysis in case sensitive mode (as it is right now) and case insensitive. Help with this issue will also be appreciated. Thank You!
static void Main(string[] args)
{
// 1.
// Array to store frequencies.
int[] c = new int[(int)char.MaxValue];
// 2.
// Read entire text file.
// string root = Server.MapPath("~");
// string FileName = root + "/App_Data/text.txt";
//string s = File.ReadAllText(FileName);
foreach (string line in File.ReadLines(#"c:\Users\user\Documents\Visual Studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\App_Data\text.txt", Encoding.UTF8)) {
var fileStream = new FileStream(#"c:\Users\user\Documents\Visual Studio 2015\Projects\ConsoleApplication1\ConsoleApplication1\App_Data\text.txt", FileMode.Open, FileAccess.Read);
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
string line2;
while ((line2 = streamReader.ReadLine()) != null)
{
// process the line
// 3.
// Iterate over each character.
foreach (char t in line)
{
// Increment table.
c[(int)t]++;
}
// 4.
// Write all letters found.
int counter = 0;
for (int i = 0; i < (int)char.MaxValue; i++)
{
if (c[i] > 0 && counter < 11 &&
char.IsLetterOrDigit((char)i))
{
++counter;
Console.WriteLine("Letter: {0} Frequency: {1}",
(char)i,
c[i]);
}
}
}
}
Console.ReadLine();
}
}

If all you want to do is to found frequencies, you don't want any dictionaries, but a Linq. Such tasks are ones Linq has been designed for:
...
using System.Linq;
...
static void Main(string[] args) {
var result = File
.ReadLines(#"...", Encoding.UTF8)
.SelectMany(line => line) // string into characters
.Where(c => char.IsLetterOrDigit(c))
.GroupBy(c => c)
.Select(chunk => new {
Letter = chunk.Key,
Count = chunk.Count() })
.OrderByDescending(item => item.Count)
.ThenBy(item => item.Letter) // in case of tie sort by letter
.Take(10)
.Select(item => $"{item.Letter} freq. {item.Count}"); // $"..." - C# 6.0 syntax
Console.Write(string.Join(Environment.NewLine, result));
}

I like #Dmitry Bychenko's answer because it's very terse. But, if you have a very large file then that solution may not be optimal for you. The reason being, that solution has to read the entire file into memory to process it. So, in my tests, I got up to around 1GB of memory usage for a 500MB file. The solution below, while not quite as terse, uses constant memory (basically 0) and runs as fast or faster than the Linq version in my tests.
Dictionary<char, int> freq = new Dictionary<char, int>();
using (StreamReader sr = new StreamReader(#"yourBigFile")) {
string line;
while ((line = sr.ReadLine()) != null) {
foreach (char c in line) {
if (!freq.ContainsKey(c)) {
freq[c] = 0;
}
freq[c]++;
}
}
}
var result = freq.Where(c => char.IsLetterOrDigit(c.Key)).OrderByDescending(x => x.Value).Take(10);
Console.WriteLine(string.Join(Environment.NewLine, result));

It would be easier to use the actual Dictionary type in C# here, rather than an array:
Dictionary<char, int> characterCountDictionary = new Dictionary<char, int>();
You add a key if it doesn't exist already (and insert a value of 1), or you increment the value if it does exist. Then you can pull out the keys of your dictionary as a list and sort them, iterating to find the values. If you do case insensitive you'd just convert all upper case to lower case before inserting into the dictionary.
Here's the MSDN page for the examples for Dictionary: https://msdn.microsoft.com/en-us/library/xfhwa508(v=vs.110).aspx#Examples

Counting words using LinkedList

I have a class WordCount which has string wordDic and int count. Next, I have a List.
I have ANOTHER List which has lots of words inside it. I am trying to use List to count the occurrences of each word inside List.
Below is where I am stuck.
class WordCount
{
string wordDic;
int count;
}
List<WordCount> usd = new List<WordCount>();
foreach (string word in wordsList)
{
if (usd.wordDic.Contains(new WordCount {wordDic=word, count=0 }))
usd.count[value] = usd.counts[value] + 1;
else
usd.Add(new WordCount() {wordDic=word, count=1});
}
I don't know how to properly implement this in code but I am trying to search my List to see if the word in wordsList already exists and if it does, add 1 to count but if it doesn't then insert it inside usd with count of 1.
Note: *I have to use Lists to do this. I am not allowed to use anything else like hash tables...*

This is the answer before you edited to only use lists...btw, what is driving that requirement?
List<string> words = new List<string> {...};
// For case-insensitive you can instantiate with
// new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase)
Dictionary<string, int> counts = new Dictionary<string, int>();
foreach (string word in words)
{
if (counts.ContainsKey(word))
{
counts[word] += 1;
}
else
{
counts[word] = 1;
}
}
If you can only use lists, Can you use List<KeyValuePair<string,int>> counts which is the same thing as a dictionary (although I'm not sure it would guarantee uniqueness). The solution would be very similar. If you can only use lists the following will work.
List<string> words = new List<string>{...};
List<string> foundWord = new List<string>();
List<int> countWord = new List<int>();
foreach (string word in words)
{
if (foundWord.Contains(word))
{
countWord[foundWord.IndexOf(word)] += 1;
}
else
{
foundWord.Add(word);
countWord.Add(1);
}
}
Using your WordCount class
List<string> words = new List<string>{...};
List<WordCount> foundWord = new List<WordCount>();
foreach (string word in words)
{
WordCount match = foundWord.SingleOrDefault(w => w.wordDic == word);
if (match!= null)
{
match.count += 1;
}
else
{
foundWord.Add(new WordCount { wordDic = word, count = 1 });
}
}

You can use Linq to do this.
static void Main(string[] args)
{
List<string> wordsList = new List<string>()
{
"Cat",
"Dog",
"Cat",
"Hat"
};
List<WordCount> usd = wordsList.GroupBy(x => x)
.Select(x => new WordCount() { wordDic = x.Key, count = x.Count() })
.ToList();
}

Use linq: Assuming your list of words :
string[] words = { "blueberry", "chimpanzee", "abacus", "banana", "abacus","apple", "cheese" };
You can do:
var count =
from word in words
group word.ToUpper() by word.ToUpper() into g
where g.Count() > 0
select new { g.Key, Count = g.Count() };
(or in your case, select new WordCount()... it'll depend on how you have your constructor set up)...
the result will look like:

First, all of your class member is private, thus, they could not be accessed somewhere out of your class. Let's assume you're using them in WordCount class too.
Second, your count member is an int. Therefore, follow statement will not work:
usd.count[value] = usd.counts[value] + 1;
And I think you've made a mistype between counts and count.
To solve your problem, find the counter responding your word. If it exists, increase count value, otherwise, create the new one.
foreach (string word in wordsList) {
WordCount counter = usd.Find(c => c.wordDic == word);
if (counter != null) // Counter exists
counter.count++;
else
usd.Add(new WordCount() { wordDic=word, count = 1 }); // Create new one
}

You should use a Dictionary as its faster when using the "Contains" method.
Just replace your list with this
Dictionary usd = new Dictionary();
foreach (string word in wordsList)
{
if (usd.ContainsKey(word.ToLower()))
usd.count[word.ToLower()].count++;
else
usd.Add(word.ToLower(), new WordCount() {wordDic=word, count=1});
}

Create big Two-Dimensional Array

simple question:
How can I use a huge two-dimensional array in C#? What I want to do is the following:
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Lenght,Nodes.Lenght];
It just figures that I got an out of memory error.
Is there a chance to work with such big data in-memory? (4gb RAM and a 6 core CPU)^^
The integers I want to save in the two-dimensional array are small. I guess from 0 to 1000.
Update: I tried to save the Relations using Dictionary<KeyValuePair<int, int>, int>. It works for some adding loops. Here is the class wich should create the graph. The instance of CreateGraph get's its data from a xml streamreader.
Main (C# backgroundWorker_DoWork)
ReadXML Reader = new ReadXML(tBOpenFile.Text);
CreateGraph Creater = new CreateGraph();
int WordsCount = (int)nUDLimit.Value;
if (nUDLimit.Value == 0) WordsCount = Reader.CountWords();
// word loop
for (int Position = 0; Position < WordsCount; Position++)
{
// reading and parsing
Reader.ReadNextWord();
// add to graph builder
Creater.AddWord(Reader.CurrentWord, Reader.GetRelations(Reader.CurrentText));
}
string[] Words = Creater.GetWords();
Dictionary<KeyValuePair<int, int>, int> Relations = Creater.GetRelations();
ReadXML
class ReadXML
{
private string Path;
private XmlReader Reader;
protected int Word;
public string CurrentWord;
public string CurrentText;
public ReadXML(string FilePath)
{
Path = FilePath;
LoadFile();
Word = 0;
}
public int CountWords()
{
// caching
if(Path.Contains("filename") == true) return 1000;
int Words = 0;
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
Words++;
}
}
LoadFile();
return Words;
}
public void ReadNextWord()
{
while(Reader.Read())
{
if(Reader.NodeType == XmlNodeType.Element & Reader.Name == "word")
{
while (Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "name")
{
XElement Title = XElement.ReadFrom(Reader) as XElement;
CurrentWord = Title.Value;
break;
}
}
while(Reader.Read())
{
if (Reader.NodeType == XmlNodeType.Element & Reader.Name == "rels")
{
XElement Text = XElement.ReadFrom(Reader) as XElement;
CurrentText = Text.Value;
break;
}
}
break;
}
}
}
public Dictionary<string, int> GetRelations(string Text)
{
Dictionary<string, int> Relations = new Dictionary<string,int>();
string[] RelationStrings = Text.Split(';');
foreach (string RelationString in RelationStrings)
{
string[] SplitString = RelationString.Split(':');
if (SplitString.Length == 2)
{
string RelationName = SplitString[0];
int RelationWeight = Convert.ToInt32(SplitString[1]);
Relations.Add(RelationName, RelationWeight);
}
}
return Relations;
}
private void LoadFile()
{
Reader = XmlReader.Create(Path);
Reader.MoveToContent();
}
}
CreateGraph
class CreateGraph
{
private Dictionary<string, int> CollectedWords = new Dictionary<string, int>();
private Dictionary<KeyValuePair<int, int>, int> CollectedRelations = new Dictionary<KeyValuePair<int, int>, int>();
public void AddWord(string Word, Dictionary<string, int> Relations)
{
int SourceNode = GetIdCreate(Word);
foreach (KeyValuePair<string, int> Relation in Relations)
{
int TargetNode = GetIdCreate(Relation.Key);
CollectedRelations.Add(new KeyValuePair<int,int>(SourceNode, TargetNode), Relation.Value); // here is the error located
}
}
public string[] GetWords()
{
string[] Words = new string[CollectedWords.Count];
foreach (KeyValuePair<string, int> CollectedWord in CollectedWords)
{
Words[CollectedWord.Value] = CollectedWord.Key;
}
return Words;
}
public Dictionary<KeyValuePair<int,int>,int> GetRelations()
{
return CollectedRelations;
}
private int WordsIndex = 0;
private int GetIdCreate(string Word)
{
if (!CollectedWords.ContainsKey(Word))
{
CollectedWords.Add(Word, WordsIndex);
WordsIndex++;
}
return CollectedWords[Word];
}
}
Now I get another error: An element with the same key already exists. (At the Add in the CreateGraph class.)

You'll have a better chance when you set Relations up as a jagged array (array of array) :
//int[,] Relations = new int[Nodes.Length,Nodes.Length];
int[][] Relations = new int[Nodes.length] [];
for (int i = 0; i < Relations.Length; i++)
Relations[i] = new int[Nodes.Length];
And then you still need 10k * 10k * sizeof(int) = 400M
Which should be possible, even when running in 32 bits .
Update:
With the new number, it's 1M * 1M * 4 = 4 TB, that' not going to work.
And using short to replace int will only bring it down to 2 TB
Since you seem to need to assign weights to (sparse) connections between nodes, you should see if something like this could work:
struct WeightedRelation
{
public readonly int node1;
public readonly int node2;
public readonly int weight;
}
int[] Nodes = new int[1146445];
List<WeightedRelation> Relations = new List<WeightedRelation>();
Relations.Add(1, 2, 10);
...
This just the basic idea, you may need a double dictionary to do fast lookups. But your memory size would be proportional to the number of actual (non 0) relations.

Okay, now we know what you're really trying to do...
int[] Nodes = new int[1146445];
int[,] Relations = new int[Nodes.Length ,Nodes.Length];
You're trying to allocate a single object which has 1,314,336,138,025 elements, each of size 4 bytes. That's over 5,000 GB. How exactly did you expect that to work?
Whatever you do, you're obviously going to run out of physical memory for that many elements... even if the CLR let you allocate a single object of that size.
Let's take a small exampler of 50,000, where you end up with ~9GB of requires space. I can't remember what the current limit is (which depends on CLR version number and whether you're using the 32 or 64-bit CLR) but I don't think any of them will support that.
You can break your array up into "rows" as shown in Henk's answer - that will take up more memory in total, but each array will be small enough to cope with on its own in the CLR. It's not going to help you fit the whole thing into memory though - at best you'll end up swapping to oblivion.
Can you use sparse arrays instead, where you only allocate space for elements you really need to access (or some approximation of that)? Or map the data to disk? If you give us more context, we may be able to come up with a solution.

Jon and Henk have alluded to sparse arrays; this would be useful if many of your nodes are unrelated to each other. Even if all nodes are related to all others, you may not need an n by n array.
For example, perhaps nodes cannot be related to themselves. Perhaps, given nodes x and y, "x is related to y" is the same as "y is related to x". If both of those are true, then for 4 nodes, you only have 6 relations, not 16:
a <-> b
a <-> c
a <-> d
b <-> c
b <-> d
c <-> d
In this case, an n-by-n array is wasting somewhat more than half of its space. If large numbers of nodes are unrelated to each other, you're wasting that much more than half of the space.
One quick way to implement this would be as a Dictionary<KeyType, RelationType>, where the key uniquely identifies the two nodes being related. Depending on your exact needs, this could take one of several different forms. Here's an example based on the nodes and relations defined above:
Dictionary<KeyType, Relation> x = new Dictionary<KeyType, RelationType>();
x.Add(new KeyType(a, b), new RelationType(a, b));
x.Add(new KeyType(a, c), new RelationType(a, c));
... etc.
If relations are reflexive, then KeyType should ensure that new KeyType(b, a) creates an object that is equivalent to the one created by new KeyType(a, b).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Speed up working with big arrays of string in c# - c#

Related

Why my List saves only last element and last occurance? [duplicate]

C# looping through a list to find character counts

How to sort a dictionary in C# .net

Counting words using LinkedList

Create big Two-Dimensional Array

Categories

Resources