Bag of Words representation problem

Bag of Words representation problem - c#

Basically i have a dictionary containing all the words of my vocabulary as keys, and all with 0 as value.
To process a document into a bag of words representation i used to copy that dictionary with the appropriate IEqualityComparer and simply checked if the dictionary contained every word in the document and incremented it's key.
To get the array of the bag of words representation i simply used the ToArray method.
This seemed to work fine, but i was just told that the dictionary doesnt assure the same Key order, so the resulting arrays might represent the words in different order, making it useless.
My current idea to solve this problem is to copy all the keys of the word dictionary into an ArrayList, create an array of the proper size and then use the indexOf method of the array list to fill the array.
So my question is, is there any better way to solve this, mine seems kinda crude... and won't i have issues because of the IEqualityComparer?

Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have
W1 --> 0
W2 --> 1
W3 --> 4
indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is
W1 --> 1
W2 --> 2
W3 --> 0
You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.
Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.
OK, so order them.
vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();
and you're done.
Does that solve your problem, or am I misunderstanding the scenario?

If I understand correctly, you want to split a document by word frequency.
You could take the document and run a Regex over it to split out the words:
var words=Regex
.Matches(input,#"\w+")
.Cast<Match>()
.Where(m=>m.Success)
.Select(m=>m.Value);
To make the frequency map:
var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});
There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.
Reading your comments, to create a corresponding sequence of only frequencies:
map.Select(a=>a.frequency)
This sequence will be in exactly the same order as the sequence map above.
Is this any help at all?

There is also an OrderedDictionary.
Represents a collection of key/value
pairs that are accessible by the key
or index.

Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.
class WordCounter
{
public Dictionary dictionary = new Dictionary();
public void CountWords(string text)
{
if (text != null && text != string.Empty)
{
text = text.ToLower();
string[] words = text.Split(' ');
if (dictionary.ContainsKey(words[0]))
{
if (text.Length > words[0].Length)
{
text = text.Substring(words[0].Length + 1);
CountWords(text);
}
}
else
{
int count = words.Count(
delegate(string s)
{
if (s == words[0]) { return true; }
else { return false; }
});
dictionary.Add(words[0], count);
if (text.Length > words[0].Length)
{
text = text.Substring(words[0].Length + 1);
CountWords(text);
}
}
}
}
public int[] GetWordCount(string text)
{
CountWords(text);
return dictionary.Values.ToArray<int>();
}
}

Would be this helpful to you:
SortedDictionary<string, int> dic = new SortedDictionary<string, int>();
for (int i = 0; i < 10; i++)
{
if (dic.ContainsKey("Word" + i))
dic["Word" + i]++;
else
dic.Add("Word" + i, 0);
}
//to get the array of words:
List<string> wordsList = new List<string>(dic.Keys);
string[] wordsArr = wordsList.ToArray();
//to get the array of values
List<int> valuesList = new List<int>(dic.Values);
int[] valuesArr = valuesList.ToArray();

If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.
While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

Related

C# why does binarysearch have to be made on sorted arrays and lists?

C# why does binarysearch have to be made on sorted arrays and lists?
Is there any other method that does not require me to sort the list?
It kinda messes with my program in a way that I cannot sort the list for it to work as I want to.

A binary search works by dividing the list of candidates in half using equality. Imagine the following set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We can also represent this as a binary tree, to make it easier to visualise:
Source
Now, say we want to find the number 3. We can do it like so:
Is 3 smaller than 8? Yes. OK, now we're looking at everything between 1 and 7.
Is 3 smaller than 4? Yes. OK, now we're looking at everything between 1 and 3.
Is 3 smaller than 2? No. OK, now we're looking at 3.
We found it!
Now, if your list isn't sorted, how will we divide the list in half? The simple answer is: we can't. If we swap 3 and 15 in the example above, it would work like this:
Is 3 smaller than 8? Yes. OK, now we're looking at everything between 1 and 7.
Is 3 smaller than 4? Yes. OK, now we're looking at everything between 1 and 3 (except we swapped it with 15).
Is 3 smaller than 2? No. OK, now we're looking at 15.
Huh? There's no more items to check but we didn't find it. I guess it's not in the list.
The solution is to use an appropriate data type instead. For fast lookups of key/value pairs, I'll use a Dictionary. For fast checks if something already exists, I'll use a HashSet. For general storage I'll use a List or an array.
Dictionary example:
var values = new Dictionary<int, string>();
values[1] = "hello";
values[2] = "goodbye";
var value2 = values[2]; // this lookup will be fast because Dictionaries are internally optimised inside and partition keys' hash codes into buckets.
HashSet example:
var mySet = new HashSet<int>();
mySet.Add(1);
mySet.Add(2);
if (mySet.Contains(2)) // this lookup is fast for the same reason as a dictionary.
{
// do something
}
List exmaple:
var list = new List<int>();
list.Add(1);
list.Add(2);
if (list.Contains(2)) // this isn't fast because it has to visit each item in the list, but it works OK for small sets or places where performance isn't so important
{
}
var idx2 = list.IndexOf(2);
If you have multiple values with the same key, you could store a list in a Dictionary like this:
var values = new Dictionary<int, List<string>>();
if (!values.ContainsKey(key))
{
values[key] = new List<string>();
}
values[key].Add("value1");
values[key].Add("value2");

There is no way you use binary search on unordered collections. Sorting collection is the main concept of the binary search. The key is that on every move u take the middle index between l and r. On first step they are 0 and size - 1, after every step one of them becomes middle index between them. If x > arr[m] then l becomes m + 1, otherwise r becomes m - 1. Basically, on every step you take half of the array you had and, of course, it remains sorted. This code is recursive, if you don't know what recursion is(which is very important in programming), you can review and learn here.
// C# implementation of recursive Binary Search
using System;
class GFG {
// Returns index of x if it is present in
// arr[l..r], else return -1
static int binarySearch(int[] arr, int l,
int r, int x)
{
if (r >= l) {
int mid = l + (r - l) / 2;
// If the element is present at the
// middle itself
if (arr[mid] == x)
return mid;
// If element is smaller than mid, then
// it can only be present in left subarray
if (arr[mid] > x)
return binarySearch(arr, l, mid - 1, x);
// Else the element can only be present
// in right subarray
return binarySearch(arr, mid + 1, r, x);
}
// We reach here when element is not present
// in array
return -1;
}
// Driver method to test above
public static void Main()
{
int[] arr = { 2, 3, 4, 10, 40 };
int n = arr.Length;
int x = 10;
int result = binarySearch(arr, 0, n - 1, x);
if (result == -1)
Console.WriteLine("Element not present");
else
Console.WriteLine("Element found at index "
+ result);
}
}
Output:
Element is present at index 3

Sure there is.
var list = new List<int>();
list.Add(42);
list.Add(1);
list.Add(54);
var index = list.IndexOf(1); //TADA!!!!
EDIT: Ok, I hoped the irony was obvious. But strictly speaking, if your array is not sorted, you are pretty much stuck with the linear search, readily available by means of IndexOf() or IEnumerable.First().

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:

You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();

General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.

You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

String Array Searching

I had an interview for a Jr. developer position a few days ago, and they asked:
"If you had an array of letters "a" and "b" how would you write a method to count how many instances of those letters are in the array?"
I said that you would have a for loop with an if else statement that would increment 1 of 2 counter variables. After that, though, they asked how I would solve that same problem, if the array could contain any letter of the alphabet. I said that I would go about it the same way, with a long IF statement, or a switch statement. In hindsight, that doesn't seem so efficient; is there an easier way to go about doing this?

You could declare the array of size 256 (number of possible character codes) zero it and simply increase the one which corresponds to a char code you read.
For example if you are reading the 'a' the corresponding code is ASCII 97 so you increase the array[97] you can optimize the amount of memory decreasing the code by 97 (if you know the input is going to be characters only) you also need to be aware what to do with capital characters ( are you conciser them as different or not) also in this case you need to take care to decrease the character by 65.
So at the end code would look like this:
int counts[122 - 97] = {0}; // codes of a - z
char a = get_next_char();
if ( is_capital(a)){
counts[a - 65]++;
}
else
{
counts[a - 97] ++;
}
this code assumes the 'A' = 'a'
if its not the case you need to have different translation in the if's but you can probably figure out the idea now. This saves a lot of comparing as opposed to your approach.

Depending on whether the objective is CPU efficiency, memory efficiency, or developer efficiency, you could just do:
foreach(var grp in theString.GroupBy(c => c)) {
Console.WriteLine("{0}: {1}", grp.Key, grp.Count());
}
Not awesome efficiency, but fine for virtually on non-pathological scenarios. In real scenarios, due to unicode, I'd probably use a dictionary as a counter - unicode is to big to pre-allocate an array.
Dictionary<char, int> counts = new Dictionary<char, int>();
foreach(char c in theString) {
int count;
if(!counts.TryGetValue(c, out count)) count = 0;
counts[c] = count + 1;
}
foreach(var pair in counts) {
Console.WriteLine("{0}: {1}", pair.Key, pair.Value);
}

You can create Dictionary<string, int>, then iterate through array, check if element exist as key in dictionary and increment value.
Dictionary<string, int> counter = new Dictionary<string, int>();
foreach(var item in items)
{
if(counter.ContainsKey(item))
{
counter[item] = counter[item] + 1;
}
}

Here is wonderful example given, it may resolve your query.
http://www.dotnetperls.com/array-find
string[] array1 = { "cat", "dog", "carrot", "bird" };</br>
//
// Find first element starting with substring.
//
string value1 = Array.Find(array1,
element => element.StartsWith("car", StringComparison.Ordinal));</br>
//
// Find first element of three characters length.
//
string value2 = Array.Find(array1,
element => element.Length == 3);
//
// Find all elements not greater than four letters long.
//
string[] array2 = Array.FindAll(array1,
element => element.Length <= 4);
Console.WriteLine(value1);
Console.WriteLine(value2);
Console.WriteLine(string.Join(",", array2));

What is the simplest way to refine a list's contents (words) based on character frequency and position? (C#)

I'm writing a console-environment Hangman game for my introductory programming class. The player chooses the word length and number of guesses they would like. 'Easy mode' is simple enough... generate a random number to use as the list's index and check that the chosen word is the right length. However, 'hard mode' requires the list to be refined as the game progresses, choosing the largest list of possibilities given the letters guessed.
I should note, we are not using the C# List class but instead, creating array-based structs:
struct ListType
{
public type[] items;
public int count;
}
//defined as:
ListType myList = new ListType();
myList.items = new type[max value];
myList.count = 0;
Anyway, here's an example of the way 'hard mode' should go:
Word List:
hole
airplane
lame
photos
cart
mole
(player chooses word length of 4)
Word List (refined):
hole
lame
cart
mole
(player guesses "l", then "e")
Word List (refined):
hole
mole
"Lame" is omitted because more words have the "...le..." pattern. The technique that makes sense to me (but isn't working the way I'd like) is storing each word's pattern to an array (ie: "mole" and "hole" = 0011, and "lame" = 1001), and counting up the duplicates to determine the larger list.
Is this the way I should be doing it? I'm new to programming and have just under a year's worth of experience, so I guess answer as such.
Thanks!!

There are a few ways of approaching this. A simple way would be to keep track of a list for all candidate words and calculate the amount of matching sequences for that word as well as log the best matching sequence. This way you can both sort on the best sequence and the amount of sequences when the best sequence alone is not a good enough measurement tool. I hope it becomes obvious how to modify this code in order to only sort on the best sequence.
Firstly i setup a test case like:
// mimic the scenario given by the QA
string[] wordList = new string[] { "hole", "airplane", "lame", "photos", "cart", "mole" };
int wordLength = 4;
List<char> requiredCharacters = new List<char>{ 'l', 'e'};
After which i filter the wordList and calculate the best matches which i finally group together to produce the desired result:
// filter all words that dont match the required length
var candidateWords = wordList.Where(x => x.Length == wordLength);
// define a result set holding all the words and all their matches
Dictionary<string, List<int>> refinedWordSet = new Dictionary<string, List<int>>();
foreach (string word in candidateWords)
{
List<int> matches = new List<int>() { 0 };
int currentMatchCount = 0;
foreach (char character in word)
{
if (requiredCharacters.Contains(character))
{
currentMatchCount++;
}
else
{
// if there were previous matches
if (currentMatchCount > 0)
{
// save the current match
matches.Add(currentMatchCount);
currentMatchCount = 0;
}
}
}
// if there was a match at the end
if (currentMatchCount > 0)
{
// save the last match
matches.Add(currentMatchCount);
}
refinedWordSet.Add(word, matches);
}
// sort by a combination of the total amount of matches as well as the highest match
var goupedRefinedWords = from entry in refinedWordSet
group entry.Key by new { Max = entry.Value.Max(), Total = entry.Value.Sum() } into grouped
select grouped;
foreach (var entry in goupedRefinedWords)
{
Console.WriteLine("Word list with best match: {0} and total match {1}: {2}",
entry.Key.Max,
entry.Key.Total,
entry.Aggregate("", (result, nextWord) => result += nextWord + ", "));
}
Console.ReadLine();
Pay attention to the comments in the code

So you look through the array for strings that match the guess pattern.
In the specific case of "le" you culd simply use String.IndexOf(). If you require a more cmplex pattern.. say "*le?" (where * and ? follow DOS-like wildcard pattern) you could employ a dynamicly-cnstructed regex pattern (easy, but performace-heavy if used in a near-realtime system), or character scanning (read each char from the screen and match to your pattern) (more difficult, harder to maintain, better performance for a small number of elements in a near-RT system).
As this is homework, I wouldn't worry about performace profiling at all right now.
Also, that struct looks mighty goofy. There are certainly better constructs for this type of thing. Like a List<String>, or just a String[]... both of which have a .Count property.

Making a Dictionary's key based on a for loop position

I am going to a directory picking up some files and then adding them to a Dictionary.
The first time in the loop the key needs to be A, second time B etc. Afer 26/Z the number represents different characters and from 33 it starts at lowercase a up to 49 which is lowercase q.
Without having a massive if statement to say if i == 1 then Key is 'A' etc etc how can I can keep this code tidy?

Sounds like you just need to keep an index of where you've got to, then some mapping function:
int index = 0;
foreach (...)
{
...
string key = MapIndexToKey(index);
dictionary[key] = value;
index++;
}
...
// Keys as per comments
private static readonly List<string> Keys =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopq"
.Select(x => x.ToString())
.ToList();
// This doesn't really need to be a separate method at the moment, but
// it means it's flexible for future expansion.
private static string MapIndexToKey(int index)
{
return Keys[index];
}
EDIT: I've updated the MapIndexToKey method to make it simpler. It's not clear why you want a string key if you only ever use a single character though...
Another edit: I believe you could actually just use:
string key = ((char) (index + 'A')).ToString();
instead of having the mapping function at all, given your requirements, as the characters are contiguous in Unicode order from 'A'...

Keep incrementing from 101 to 132, ignoring missing sequence, and convert them to character. http://www.asciitable.com/
Use reminder (divide by 132) to identify second loop

This gives you the opportunity to map letters to specific numbers, perhaps not alphabet ordered.
var letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
.Select((chr, index) => new {character = chr, index = index + 1 });
foreach(var letter in letters)
{
int index = letter.index;
char chr = letter.character;
// do something
}

How about:
for(int i=0; i<26; ++i)
{
dict[(char)('A'+ (i % 26))] = GetValueFor(i);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Bag of Words representation problem - c#

There is also an OrderedDictionary. Represents a collection of key/value pairs that are accessible by the key or index.

Related

C# why does binarysearch have to be made on sorted arrays and lists?

How to speed up string operations and avoid slow loops

String Array Searching

What is the simplest way to refine a list's contents (words) based on character frequency and position? (C#)

Making a Dictionary's key based on a for loop position

Categories

Resources