How to merge 2 text files - c#

I'm looking for different solutions, including those in which is forbidden to use .NET libraries and those where I can use all of the advantages of them.
Here is the problem, I have two text files, textFile1 and textFile2. Each of them contains sorted integer numbers(this is the most important condition), like these displayed below :
textFile1 textFile2
0 1
2 3
4 5
I need to create 3rd text file, for example textFile3 by merging those two files, and expected result should be :
textFile3
0
1
2
3
4
5
My first idea was to copy those two text files line by line into two separate arrays and than use solution for merging two sorted arrays in new one, provided
in this question.
After that, I will copy those members of new array into textFile3, line by line.
Do you have any suggestion ? Maybe better approach ? Please write all of your ideas here, each of them will be helpful to me .

Merging two files is a fairly simple modification to merging two arrays. The idea is to replace the array index increment with a read of the next line of the file. For example, the standard merge algorithm that I show in my blog (http://blog.mischel.com/2014/10/24/merging-sorted-sequences/) is:
while (not end of List A and not end of List B)
if (List A current item <= List B current item)
output List A current item
advance List A index
else
output List B current item
advance List B index
// At this point, one of the lists is empty.
// Output remaining items from the other
while (not end of List A)
output List A current item
advance List A index
while (not end of List B)
output List B current item
advance List B index
To make that merge files, you start by opening and reading the first line of each file. It gets kind of screwy, though, because you have to check for end of file. "Get the next line" is a bit ... odd.
int item1;
int item2;
bool eof1 = false;
bool eof2 = false;
string temp;
var file1 = File.OpenText(textFile1);
temp = file1.ReadLine();
if (temp == null)
eof1 = true;
else
item1 = int.Parse(temp);
// do the same thing for file2
Then we can do the standard merge:
while (!eof1 && !eof2)
{
if (item1 <= item2)
{
outputFile.WriteLine(item1);
// get next item from file1
temp = file1.ReadLine();
if (temp == null)
eof1 = true;
else
item1 = int.Parse(temp);
}
else
{
// output item2 and get next line from file2
}
}
// and the cleanup
while (!eof1)
{
// output item1, and get next line from file1
}
while (!eof2)
{
// output item2, and get next file from file2
}
The only thing different is that getting the next item is more involved than just incrementing an array index.

Merging two ordered sequences can easily be generalized and implemented as extension method like this:
public static class Algorithms
{
public static IEnumerable<T> MergeOrdered<T>(this IEnumerable<T> seq1, IEnumerable<T> seq2, IComparer<T> comparer = null)
{
if (comparer == null) comparer = Comparer<T>.Default;
using (var e1 = seq1.GetEnumerator())
using (var e2 = seq2.GetEnumerator())
{
bool more1 = e1.MoveNext(), more2 = e2.MoveNext();
while (more1 && more2)
{
int compare = comparer.Compare(e1.Current, e2.Current);
yield return compare < 0 ? e1.Current : e2.Current;
if (compare <= 0) more1 = e1.MoveNext();
if (compare >= 0) more2 = e2.MoveNext();
}
for (; more1; more1 = e1.MoveNext())
yield return e1.Current;
for (; more2; more2 = e2.MoveNext())
yield return e2.Current;
}
}
}
Then the concrete task can be accomplished simply with:
static void Merge(string inputFile1, string inputFile2, string outputFile)
{
Func<string, IEnumerable<KeyValuePair<int, string>>> readLines = file =>
File.ReadLines(file).Select(line =>
new KeyValuePair<int, string>(int.Parse(line), line));
var inputLines1 = readLines(inputFile1);
var inputLines2 = readLines(inputFile2);
var comparer = Comparer<KeyValuePair<int, string>>.Create(
(a, b) => a.Key.CompareTo(b.Key));
var outputLines = inputLines1.MergeOrdered(inputLines2, comparer)
.Select(item => item.Value);
File.WriteAllLines(outputFile, outputLines);
}

They are both sorted lists and to avoid memory consumption, open a reader to both files. Read two line from both, compare ahead, write the sorted results and take action based on the current line of each file. E.g:Treat your sorted value in each file as a pointer and keep comparing and advancing from the lesser side until completion. This will ensure a small memory footprint that will perform better for large files than for smaller ones.
You can pinch an algorithm off of the web, here is one and another that even mentions 0(1). Ignore the fact it talks about arrays, your files are effectively sorted arrays so you don't need to duplicate that in memory.

Related

How to Trim the Leading and Trailing White-Spaces of a String Array in C#

I want to trim all the white-spaces and empty strings only from the starting and ending of an array without converting it into a string in C#.
This is what I've done so far to solve my problem but I'm looking for a bit more efficient solution as I don't want to be stuck with a just works solution to the prob
static public string[] Trim(string[] arr)
{
List<string> TrimmedArray = new List<string>(arr);
foreach (string i in TrimmedArray.ToArray())
{
if (String.IsEmpty(i)) TrimmedArray.RemoveAt(TrimmedArray.IndexOf(i));
else break;
}
foreach (string i in TrimmedArray.ToArray().Reverse())
{
if (String.IsEmpty(i)) TrimmedArray.RemoveAt(TrimmedArray.IndexOf(i));
else break;
}
return TrimmedArray.ToArray();
}
NOTE: String.IsEmpty is a custom function which check whether a string was NULL, Empty or just a White-Space.
Your code allocates a lot of new arrays unnecessarily. When you instantiate a list from an array, the list creates a new backing array to store the items, and every time you call ToArray() on the resulting list, you're also allocating yet another copy.
The second problem is with TrimmedArray.RemoveAt(TrimmedArray.IndexOf(i)) - if the array contains multiple copies of the same string value in the middle as at the end, you might end up removing strings from the middle.
My advice would be split the problem into two distinct steps:
Find both boundary indices (the first and last non-empty strings in the array)
Copy only the relevant middle-section to a new array.
To locate the boundary indices you can use Array.FindIndex() and Array.FindLastIndex():
static public string[] Trim(string[] arr)
{
if(arr == null || arr.Length == 0)
// no need to search through nothing
return Array.Empty<string>();
// define predicate to test for non-empty strings
Predicate<string> IsNotEmpty = string s => !String.IsEmpty(str);
var firstIndex = Array.FindIndex(arr, IsNotEmpty);
if(firstIndex < 0)
// nothing to return if it's all whitespace anyway
return Array.Empty<string>();
var lastIndex = Array.FindLastIndex(arr, IsNotEmpty);
// calculate size of the relevant middle-section from the indices
var newArraySize = lastIndex - firstIndex + 1;
// create new array and copy items to it
var results = new string[newArraySize];
Array.Copy(arr, firstIndex, results, 0, newArraySize);
return results;
}
I like the answer by Mathias R. Jessen as it is efficient and clean.
Just thought I'd show how to do it using the List<> as in your original attempt:
static public string[] Trim(string[] arr)
{
List<string> TrimmedArray = new List<string>(arr);
while (TrimmedArray.Count>0 && String.IsEmpty(TrimmedArray[0]))
{
TrimmedArray.RemoveAt(0);
}
while (TrimmedArray.Count>0 && String.IsEmpty(TrimmedArray[TrimmedArray.Count - 1]))
{
TrimmedArray.RemoveAt(TrimmedArray.Count - 1);
}
return TrimmedArray.ToArray();
}
This is not as efficient as the other answer since the internal array within the List<> has to shift all its elements to the left each time an element is deleted from the front.

Algorithm for grouping consecutive numbers

I am trying to build an efficient algorithm that can process thousands of rows of data that contains zip codes of customers. I would then want to cross check those zip codes against a grouping of around 1000 zip codes, but I have about 100 columns of 1000 zip codes. A lot of these zip codes are consecutive numbers, but there is also a lot of random zip codes thrown in there. So what I would like to do is group consecutive zip codes together that I can then just check to see if the zip code falls within that range instead of checking it against every single zip code.
Sample data -
90001
90002
90003
90004
90005
90006
90007
90008
90009
90010
90012
90022
90031
90032
90033
90034
90041
This should be grouped as follows:
{ 90001-90010, 90012, 90022, 90031-90034, 90041 }
Here's my idea for the algorithm:
public struct gRange {
public int start, end;
public gRange(int a, int b) {
start = a;
if(b != null) end = b;
else end = a;
}
}
function groupZips(string[] zips){
List<gRange> zipList = new List<gRange>();
int currZip, prevZip, startRange, endRange;
startRange = 0;
bool inRange = false;
for(int i = 1; i < zips.length; i++) {
currZip = Convert.ToInt32(zips[i]);
prevZip = Convert.ToInt32(zips[i-1]);
if(currZip - prevZip == 1 && inRange == false) {
inRange = true;
startRange = prevZip;
continue;
}
else if(currZip - prevZip == 1 && inRange == true) continue;
else if(currZip - prevZip != 1 && inRange == true) {
inRange = false;
endRange = prevZip;
zipList.add(new gRange(startRange, endRange));
continue;
}
else if(currZip - prevZip != 1 && inRange == false) {
zipList.add(new gRange(prevZip, prevZip));
}
//not sure how to handle the last case when i == zips.length-1
}
}
So as of now, I am unsure of how to handle the last case, but looking at this algorithm, it doesn't strike me as efficient. Is there a better/easier way to be sorting a group of numbers like this?
Here is a O(n) solution even if your zip codes are not guaranteed to be in order.
If you need the output groupings to be sorted, you can't do any better than O(n*log(n)) because somewhere you'll have to sort something, but if grouping the zip codes is your only concern and sorting the groups isn't required then I'd use an algorithm like this. It makes good use of a HashSet, a Dictionary, and a DoublyLinkedList. To my knowledge this algorithm is O(n), because I believe that a HashSet.Add() and HashSet.Contains() are performed in constant time.
Here is a working dotnetfiddle
// I'm assuming zipcodes are ints... convert if desired
// jumbled up your sample data to show that the code would still work
var zipcodes = new List<int>
{
90012,
90033,
90009,
90001,
90005,
90004,
90041,
90008,
90007,
90031,
90010,
90002,
90003,
90034,
90032,
90006,
90022,
};
// facilitate constant-time lookups of whether zipcodes are in your set
var zipHashSet = new HashSet<int>();
// lookup zipcode -> linked list node to remove item in constant time from the linked list
var nodeDictionary = new Dictionary<int, DoublyLinkedListNode<int>>();
// linked list for iterating and grouping your zip codes in linear time
var zipLinkedList = new DoublyLinkedList<int>();
// initialize our datastructures from the initial list
foreach (int zipcode in zipcodes)
{
zipLinkedList.Add(zipcode);
zipHashSet.Add(zipcode);
nodeDictionary[zipcode] = zipLinkedList.Last;
}
// object to store the groupings (ex: "90001-90010", "90022")
var groupings = new HashSet<string>();
// iterate through the linked list, but skip nodes if we group it with a zip code
// that we found on a previous iteration of the loop
var node = zipLinkedList.First;
while (node != null)
{
var bottomZipCode = node.Element;
var topZipCode = bottomZipCode;
// find the lowest zip code in this group
while (zipHashSet.Contains(bottomZipCode - 1))
{
var nodeToDel = nodeDictionary[bottomZipCode - 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if previous zip code is in our group, too
bottomZipCode--;
}
// get string version zip code bottom of the range
var bottom = bottomZipCode.ToString();
// find the highest zip code in this group
while (zipHashSet.Contains(topZipCode + 1))
{
var nodeToDel = nodeDictionary[topZipCode + 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if next zip code is in our group, too
topZipCode++;
}
// get string version zip code top of the range
var top = topZipCode.ToString();
// add grouping in correct format
if (top == bottom)
{
groupings.Add(bottom);
}
else
{
groupings.Add(bottom + "-" + top);
}
// onward!
node = node.Next;
}
// print results
foreach (var grouping in groupings)
{
Console.WriteLine(grouping);
}
** a small refactoring of the common linked list node deletion logic is in order
If Sorting is Required
A O(n*log(n)) algorithm is much simpler, because once you sort your input list the groups can be formed in one iteration of the list with no additional data structures.
I believe you are overthinking this one. Just using Linq against an IEnumerable can search 80,000+ records in less than 1/10 of a second.
I used the free CSV zip code list from here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv
using System;
using System.IO;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;
namespace ZipCodeSearchTest
{
struct zipCodeEntry
{
public string ZipCode { get; set; }
public string City { get; set; }
}
class Program
{
static void Main(string[] args)
{
List<zipCodeEntry> zipCodes = new List<zipCodeEntry>();
string dataFileName = "free-zipcode-database.csv";
using (FileStream fs = new FileStream(dataFileName, FileMode.Open, FileAccess.Read))
using (StreamReader sr = new StreamReader(fs))
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
string[] lineVals = line.Split(',');
zipCodes.Add(new zipCodeEntry { ZipCode = lineVals[1].Trim(' ', '\"'), City = lineVals[3].Trim(' ', '\"') });
}
bool terminate = false;
while (!terminate)
{
Console.WriteLine("Enter zip code:");
var userEntry = Console.ReadLine();
if (userEntry.ToLower() == "x" || userEntry.ToString() == "q")
terminate = true;
else
{
DateTime dtStart = DateTime.Now;
foreach (var arrayVal in zipCodes.Where(z => z.ZipCode == userEntry.PadLeft(5, '0')))
Console.WriteLine(string.Format("ZipCode: {0}", arrayVal.ZipCode).PadRight(20, ' ') + string.Format("City: {0}", arrayVal.City));
DateTime dtStop = DateTime.Now;
Console.WriteLine();
Console.WriteLine("Lookup time: {0}", dtStop.Subtract(dtStart).ToString());
Console.WriteLine("\n\n");
}
}
}
}
}
In this particular case, it is quite possible that a hash will be faster. However, the range-based solution will use a lot less memory, so it would be appropriate if your lists were very large (and I'm not convinced that there are enough possible zipcodes for any list of zipcodes to be large enough.)
Anyway, here's a simpler logic for making the range list and finding if a target is in a range:
Make ranges a simple list of integers (or even zipcodes), and push the first element of zip as its first element.
For each element of zip except the last one, if that element plus one is not the same as the next element, add both that element plus one and the next element to ranges.
Push one more than the last element of zip at the end of `ranges.
Now, to find out if a zipcode is in ranges, do a binary search into ranges for the smallest element which is greater than the target zipcode. [Note 1] If the index of that element is odd, then the target is in one of the ranges, otherwise it isn't.
Notes:
AIUI, the BinarySearch method on a C# list returns the index of the element found or the complement of the index of the first larger element. To get the result needed by the suggested algorithm, you could use something like index >= 0 ? index + 1 : ~index, but it might be simpler to just search for the zipcode one less than the target and then use the complement of the low-order bit of the result.

Insert an underlying value into a non-existing index

I'm trying to solve a simple algorithm a specific way where it takes the current row and adds it to the top most row. I know there are plenty of ways to solve this but currently I have a text file that gets read line by line. Each line is converted to an sbyte (there's a certain reason why I am using sbyte but it's irrelevant to my post and I won't mention it here) and added to a list. From there, the line is reversed and added to another list. Here's the code I have for that first part:
List<List<sbyte>> largeNumbers = new List<List<sbyte>>();
List<string> total = new List<string>();
string bigIntFile = #"C:\Users\Justin\Documents\BigNumbers.txt";
string result;
StreamReader streamReader = new StreamReader(bigIntFile);
while ((result = streamReader.ReadLine()) != null)
{
List<sbyte> largeNumber = new List<sbyte>();
for (int i = 0; i < result.Length; i++)
{
sbyte singleConvertedDigit = Convert.ToSByte(result.Substring(i, 1));
largeNumber.Add(singleConvertedDigit);
}
largeNumber.Reverse();
largeNumbers.Add(largeNumber);
}
From there, I want to use an empty list that stores strings which I will be using later for adding my numbers. However, I want to be able to add numbers to this new list named "total". The numbers I'll be adding to it are not all the same length and because so, I need to check if an index exists at a certain location, if it does I'll be adding the value I'm looking at to the number that resides in that index, if not, I need to create that index and set it's value to 0. In trying to do so, I keep getting an IndexOutOfRange exception (obviously because that index doesn't exist). :
foreach (var largeNumber in largeNumbers)
{
int totalIndex = 0;
foreach (var digit in largeNumber)
{
if (total.Count == 0)
{
total[totalIndex] = digit.ToString(); //Index out of Range exception occurs here
}
else
{
total[totalIndex] = (Convert.ToSByte(total[totalIndex]) + digit).ToString();
}
totalIndex ++;
}
}
I'm just at a loss. Any Ideas on how to check if that index exists; if it does not create it and set it's underlying value equal to 0? This is just a fun exercise for me but I am hitting a brick wall with this lovely index portion. I've tried to use SingleOrDefault as well as ElementAtOrDefault but they don't seem to be working so hot for me. Thanks in advance!
Depending on if your result is have small number of missing elements (i.e. have more than 50% elements missing) consider simply adding 0 to the list till you reach neccessary index. You may use list of nullable items (i.e. List<int?>) instead of regular values (List<int>) if you care if item is missing or not.
Something like (non-compiled...) sample:
// List<long> list; int index; long value
if (index >= list.Count)
{
list.AddRange(Enumerable.Repeat(0, index-list.Count+1);
}
list[index] = value;
If you have significant number of missing elements use Dictionary (or SortedDictionary) with (index, value) pairs.
Dictionary<int, long> items;
if (items.ContainsKey(index))
{
items[key] = value;
}
else
{
items.Add(index, value);
}

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.
If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.
Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.
You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).
I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing
try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

Bag of Words representation problem

Basically i have a dictionary containing all the words of my vocabulary as keys, and all with 0 as value.
To process a document into a bag of words representation i used to copy that dictionary with the appropriate IEqualityComparer and simply checked if the dictionary contained every word in the document and incremented it's key.
To get the array of the bag of words representation i simply used the ToArray method.
This seemed to work fine, but i was just told that the dictionary doesnt assure the same Key order, so the resulting arrays might represent the words in different order, making it useless.
My current idea to solve this problem is to copy all the keys of the word dictionary into an ArrayList, create an array of the proper size and then use the indexOf method of the array list to fill the array.
So my question is, is there any better way to solve this, mine seems kinda crude... and won't i have issues because of the IEqualityComparer?
Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have
W1 --> 0
W2 --> 1
W3 --> 4
indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is
W1 --> 1
W2 --> 2
W3 --> 0
You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.
Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.
OK, so order them.
vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();
and you're done.
Does that solve your problem, or am I misunderstanding the scenario?
If I understand correctly, you want to split a document by word frequency.
You could take the document and run a Regex over it to split out the words:
var words=Regex
.Matches(input,#"\w+")
.Cast<Match>()
.Where(m=>m.Success)
.Select(m=>m.Value);
To make the frequency map:
var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});
There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.
Reading your comments, to create a corresponding sequence of only frequencies:
map.Select(a=>a.frequency)
This sequence will be in exactly the same order as the sequence map above.
Is this any help at all?
There is also an OrderedDictionary.
Represents a collection of key/value
pairs that are accessible by the key
or index.
Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.
class WordCounter
{
public Dictionary dictionary = new Dictionary();
public void CountWords(string text)
{
if (text != null && text != string.Empty)
{
text = text.ToLower();
string[] words = text.Split(' ');
if (dictionary.ContainsKey(words[0]))
{
if (text.Length > words[0].Length)
{
text = text.Substring(words[0].Length + 1);
CountWords(text);
}
}
else
{
int count = words.Count(
delegate(string s)
{
if (s == words[0]) { return true; }
else { return false; }
});
dictionary.Add(words[0], count);
if (text.Length > words[0].Length)
{
text = text.Substring(words[0].Length + 1);
CountWords(text);
}
}
}
}
public int[] GetWordCount(string text)
{
CountWords(text);
return dictionary.Values.ToArray<int>();
}
}
Would be this helpful to you:
SortedDictionary<string, int> dic = new SortedDictionary<string, int>();
for (int i = 0; i < 10; i++)
{
if (dic.ContainsKey("Word" + i))
dic["Word" + i]++;
else
dic.Add("Word" + i, 0);
}
//to get the array of words:
List<string> wordsList = new List<string>(dic.Keys);
string[] wordsArr = wordsList.ToArray();
//to get the array of values
List<int> valuesList = new List<int>(dic.Values);
int[] valuesArr = valuesList.ToArray();
If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.
While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

Categories