RavenDB 2.0: Indexing with SpatialGenerate eats up all memory - c#

Problem: In the code presented I am trying to index ~26000 geo-fences (polygons). I have tried many combinations of 1)inserting the data and 2)indexing the data that some are not presented in the code for brevity. I have imported all the data with pre-built index; importing the data and then making the index and making the index and importing the data partially with waiting for non-stale index. In all cases it eats up all the memory and fails (by fail I mean I wait for 30 minutes for completing the indexing process). One interesting observation happened in partial inserts with pre-built index: It always starts to eat up all memory around ~1790 entries.
Environment: I have tested this on 2 machines: 1) 4 GB Ram, Dual Core 2.56 CUP 2) 12 GB Ram, i7 CPU. I used Visual Studio 2012 (which should not be important) targeted .NET 4.0 with RavenDB build 2230. The only external library used here (other than RavenDB client libraries) was NetTopologySuite (NuGet).
Side Note and Background: I am doing my spatial calculations using 2 in-memory data structures: a KD-Tree and an R-Tree with static data (no insert/delete/update). Now a new step added to my project which needs to update/insert some data twice a day; because of that I searched for a replacement to remove the burden of housekeeping of this part and I came up with SQLite R-Tree module. It is fast enough (using some parallel programming+it is a mostly read-only database). So far RavenDB is not much of a choice (strategic decisions; etc) but I liked to play with it and it was disappointing. To be precise I love RavenDB 960 and I really am angry with RavenDB 2230.
I have used the in-memory trees and tested SQLite with ~3,000,000 polygons (again to be precise "bounding boxes") and failed to use RavenDB for ~26,000 polygons. I really hope that I am doing something badly wrong.
The code:
namespace MyApp
{
static partial class Program
{
static void Main(string[] args)
{
Initialization();
InitializeIndexes();
// MODE 1:
// OneTimeImport(-1, 1024, true);
// MODE 2:
// OneTimeImport(-1, 1024, false);
// MODE 3:
// PartialImport(64); and PartialImport(32); and PartialImport(16);
}
static void PartialImport(int len)
{
bool finished = false;
int index = 0;
// read finished from file
if (finished)
{
return;
}
// read index from file, if not exists then set it to 0
index = OneTimeImport(index, len, true);
if (index == 0)
{
finished = true;
}
// write index and finished to file
return;
}
static int OneTimeImport(int start, int len, bool waitForIndexing)
{
var counter = start < 0 ? 0 : start;
using (var reader = new StreamReader(#"D:\PATH\DATA.TXT", Encoding.UTF8))
using (var session = Store.OpenSession())
{
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue - 1;
if (len < 0) len = 16;
IEnumerable<string> lines = null;
if (start < 0) { lines = reader.Lines(); }
else { lines = reader.Lines().Skip(start).Take(len); }
foreach (var line in lines)
{
var parts = line.Split(new string[] { "|" }, StringSplitOptions.None);
var lm = new Area();
lm.Description = parts[5];
var poly = string.Empty;
if (!string.IsNullOrWhiteSpace(parts[8]))
{
try
{
var list = new List<Coordinate>();
var pairs = parts[8].Split(';');
foreach (var p in pairs)
{
var coor = p.Split(',');
var lon = coor[0].ToDouble();
var lat = coor[1].ToDouble();
list.Add(new Coordinate(lon, lat));
}
if (list.Count > 0)
{
list.Add(list[0]);
var ring = new LinearRing(list.ToArray());
poly = new Polygon(ring).ToString();
}
}
catch { } // some logging
}
lm.Polygon = poly;
session.Store(lm);
counter++;
if (counter % len == 0)
{
session.SaveChanges();
if (waitForIndexing) while (Store.DatabaseCommands.GetStatistics().StaleIndexes.Length > 0) Thread.Sleep(10);
}
Trace.WriteLine(counter);
}
session.SaveChanges();
if (waitForIndexing) while (Store.DatabaseCommands.GetStatistics().StaleIndexes.Length > 0) Thread.Sleep(500);
}
return counter;
}
static void Initialization()
{
Store = new DocumentStore { ConnectionStringName = "LoxConnectionString" };
Store.Initialize();
}
static void InitializeIndexes()
{
IndexCreation.CreateIndexes(typeof(Program).Assembly, Store);
}
public static IEnumerable<string> Lines(this StreamReader reader)
{
string line = null;
while ((line = reader.ReadLine()) != null) yield return line;
}
public static double ToDouble(this string s)
{
double d = 0;
double.TryParse(s, out d);
return d;
}
public static DocumentStore Store { get; private set; }
}
public class Area
{
public string Id { get; set; }
public string Description { get; set; }
public string Polygon { get; set; }
}
public class AreaIndex : AbstractIndexCreationTask<Area>
{
public AreaIndex()
{
Map = docs => from doc in docs
select new
{
doc.Description,
_ = SpatialGenerate("AreaPolygon", doc.Polygon)
};
}
}
}

Related

Simulating message latency accurately

I have written a simple "latency simulator" which works, but at times, messages are delayed for longer than the time specified. I need help to ensure that messages are delayed for the correct amount of time.
The main problem, I believe, is that I am using Thread.Sleep(x), which is depended on various factors but mainly on the clock interrupt rate, which causes Thread.Sleep() to have a resolution of roughly 15ms. Further, intensive tasks will demand more CPU time and will occasionally result in a delay greater than the one requested. If you are not familiar with the resolution issues of Thread.Sleep, you can read these SO posts: here, here and here.
This is my LatencySimulator:
public class LatencySimulatorResult: EventArgs
{
public int messageNumber { get; set; }
public byte[] message { get; set; }
}
public class LatencySimulator
{
private int messageNumber;
private int latency = 0;
private int processedMessageCount = 0;
public event EventHandler messageReady;
public void Delay(byte[] message, int delay)
{
latency = delay;
var result = new LatencySimulatorResult();
result.message = message;
result.messageNumber = messageNumber;
if (latency == 0)
{
if (messageReady != null)
messageReady(this, result);
}
else
{
ThreadPool.QueueUserWorkItem(ThreadPoolCallback, result);
}
Interlocked.Increment(ref messageNumber);
}
private void ThreadPoolCallback(object threadContext)
{
Thread.Sleep(latency);
var next = (LatencySimulatorResult)threadContext;
var ready = next.messageNumber == processedMessageCount + 1;
while (ready == false)
{
ready = next.messageNumber == processedMessageCount + 1;
}
if (messageReady != null)
messageReady(this, next);
Interlocked.Increment(ref processedMessageCount);
}
}
To use it, you create a new instance and bind to the event handler:
var latencySimulator = new LatencySimulator();
latencySimulator.messageReady += MessageReady;
You then call latencySimulator.Delay(someBytes, someDelay);
When a message has finished being delayed, the event is fired and you can then process the delayed message.
It is important that the order in which messages are added is maintained. I cannot have them coming out the other end of the latency simulator in some random order.
Here is a test program to use the latency simulator and to see how long messages have been delayed for:
private static LatencySimulator latencySimulator;
private static ConcurrentDictionary<int, PendingMessage> pendingMessages;
private static List<long> measurements;
static void Main(string[] args)
{
var results = TestLatencySimulator();
var anomalies = results.Result.Where(x=>x > 32).ToList();
foreach (var result in anomalies)
{
Console.WriteLine(result);
}
Console.ReadLine();
}
static async Task<List<long>> TestLatencySimulator()
{
latencySimulator = new LatencySimulator();
latencySimulator.messageReady += MessageReady;
var numberOfMeasurementsMax = 1000;
pendingMessages = new ConcurrentDictionary<int, PendingMessage>();
measurements = new List<long>();
var sendTask = Task.Factory.StartNew(() =>
{
for (var i = 0; i < numberOfMeasurementsMax; i++)
{
var message = new Message { Id = i };
pendingMessages.TryAdd(i, new PendingMessage() { Id = i });
latencySimulator.Delay(Serialize(message), 30);
Thread.Sleep(50);
}
});
//Spin some tasks up to simulate high CPU usage
Task.Factory.StartNew(() => { FindPrimeNumber(100000); });
Task.Factory.StartNew(() => { FindPrimeNumber(100000); });
Task.Factory.StartNew(() => { FindPrimeNumber(100000); });
sendTask.Wait();
return measurements;
}
static long FindPrimeNumber(int n)
{
int count = 0;
long a = 2;
while (count < n)
{
long b = 2;
int prime = 1;// to check if found a prime
while (b * b <= a)
{
if (a % b == 0)
{
prime = 0;
break;
}
b++;
}
if (prime > 0)
{
count++;
}
a++;
}
return (--a);
}
private static void MessageReady(object sender, EventArgs e)
{
LatencySimulatorResult result = (LatencySimulatorResult)e;
var message = (Message)Deserialize(result.message);
if (pendingMessages.ContainsKey(message.Id) != true) return;
pendingMessages[message.Id].stopwatch.Stop();
measurements.Add(pendingMessages[message.Id].stopwatch.ElapsedMilliseconds);
}
static object Deserialize(byte[] arrBytes)
{
using (var memStream = new MemoryStream())
{
var binForm = new BinaryFormatter();
memStream.Write(arrBytes, 0, arrBytes.Length);
memStream.Seek(0, SeekOrigin.Begin);
var obj = binForm.Deserialize(memStream);
return obj;
}
}
static byte[] Serialize<T>(T obj)
{
BinaryFormatter bf = new BinaryFormatter();
using (var ms = new MemoryStream())
{
bf.Serialize(ms, obj);
return ms.ToArray();
}
}
If you run this code, you will see that about 5% of the messages are delayed for more than the expected 30ms. In fact, some are as high as 60ms. Without any background tasks or high CPU usage, the simulator behaves as expected.
I need them all to be 30ms (or as close to as possible) - I do not want some arbitrary 50-60ms delays.
Can anyone suggest how I can refactor this code so that I can achieve the desired result, but without the use of Thread.Sleep() and with as little CPU overhead as possible?

Applying Algorithm to multiple files

I'm currently working on a project where I have to choose a file and then sort it along with other files. The files are just filled with numbers but each file has linked data. So the first number in a file is linked to the first number in the second file and so on. I Currently have code that allows me to read a file and display the file unsorted and sorted using the bubble sort. I am not sure how I would be able to apply this same principle to multiple files at once. So that I could choose a file and then sort it in line with the same method I have for a single file.
Currently, the program loads and asks the user to choose between 1 and 2. 1 Shows the code unsorted and 2 shows the code sorted. I can read in the files but the problem is sorting and displaying in order. Basically How do I sort multiple files that are linked together. What steps do I need to take to do this?
An example of one file:
4
28
77
96
An example of the second file:
66.698
74.58
2.54
48.657
Any help would be appreciated.
Thanks
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
public class Program
{
public Program()
{
}
public void ReadFile(int [] numbers)
{
string path = "Data1/Day_1.txt";
StreamReader sr = new StreamReader(path);
for (int index = 0; index < numbers.Length; index++)
{
numbers[index] = Convert.ToInt32(sr.ReadLine());
}
sr.Close(); // closes file when done
}
public void SortArray(int[] numbers)
{
bool swap;
int temp;
do
{
swap = false;
for (int index = 0; index < (numbers.Length - 1); index++)
{
if (numbers[index] > numbers[index + 1])
{
temp = numbers[index];
numbers[index] = numbers[index + 1];
numbers[index + 1] = temp;
swap = true;
}
}
} while (swap == true);
}
public void DisplayArray(int[] numbers)
{
for(int index = 0; index < numbers.Length; index++)
{
Console.WriteLine("{0}", numbers[index]);
}
}
}
The main is in another file to keep work organised:
using System;
public class FileDemoTest
{
public static void Main(string[] args)
{
int[] numbers = new int[300];
Program obj = new Program();
int operation = 0;
Console.WriteLine("1 or 2 ?");
operation = Convert.ToInt32(Console.ReadLine());
// call the read from file method
obj.ReadFile(numbers);
if (operation == 1)
{
//Display unsorted values
Console.Write("Unsorted:");
obj.DisplayArray(numbers);
}
if (operation == 2)
{
//sort numbers and display
obj.SortArray(numbers);
Console.Write("Sorted: ");
obj.DisplayArray(numbers);
}
}
}
What I would do is create a class that will hold the values from file 1 and file 2. Then you can populate a list of these classes by reading values from both files. After that, you can sort the list of classes on either field, and the relationships will be maintained (since the two values are stored in a single object).
Here's an example of the class that would hold the file data:
public class FileData
{
public int File1Value { get; set; }
public decimal File2Value { get; set; }
/// <summary>
/// Provides a friendly string representation of this object
/// </summary>
/// <returns>A string showing the File1 and File2 values</returns>
public override string ToString()
{
return $"{File1Value}: {File2Value}";
}
}
Then you can create a method that reads both files and creates and returns a list of these objects:
public static FileData[] GetFileData(string firstFilePath, string secondFilePath)
{
// These guys hold the strongly typed version of the string values in the files
int intHolder = 0;
decimal decHolder = 0;
// Get a list of ints from the first file
var fileOneValues = File
.ReadAllLines(firstFilePath)
.Where(line => int.TryParse(line, out intHolder))
.Select(v => intHolder)
.ToArray();
// Get a list of decimals from the second file
var fileTwoValues = File
.ReadAllLines(secondFilePath)
.Where(line => decimal.TryParse(line, out decHolder))
.Select(v => decHolder)
.ToArray();
// I guess the file lengths should match, but in case they don't,
// use the size of the smaller one so we have matches for all items
var numItems = Math.Min(fileOneValues.Count(), fileTwoValues.Count());
// Populate an array of new FileData objects
var fileData = new FileData[numItems];
for (var index = 0; index < numItems; index++)
{
fileData[index] = new FileData
{
File1Value = fileOneValues[index],
File2Value = fileTwoValues[index]
};
}
return fileData;
}
Now, we need to modify your sorting method to work on a FileData array instead of an int array. I also added an argument that, if set to false, will sort on the File2Data field instead of File1Data:
public static void SortArray(FileData[] fileData, bool sortOnFile1Data = true)
{
bool swap;
do
{
swap = false;
for (int index = 0; index < (fileData.Length - 1); index++)
{
bool comparison = sortOnFile1Data
? fileData[index].File1Value > fileData[index + 1].File1Value
: fileData[index].File2Value > fileData[index + 1].File2Value;
if (comparison)
{
var temp = fileData[index];
fileData[index] = fileData[index + 1];
fileData[index + 1] = temp;
swap = true;
}
}
} while (swap);
}
And finally, you can display the non-sorted and sorted lists of data. Note I added a second question to the user if they choose "Sorted" where they can decide if it should be sorted by File1 or File2:
private static void Main()
{
Console.WriteLine("1 or 2 ?");
int operation = Convert.ToInt32(Console.ReadLine());
var fileData = GetFileData(#"f:\public\temp\temp1.txt", #"f:\public\temp\temp2.txt");
if (operation == 1)
{
//Display unsorted values
Console.WriteLine("Unsorted:");
foreach (var data in fileData)
{
Console.WriteLine(data);
}
}
if (operation == 2)
{
Console.WriteLine("Sort on File1 or File2 (1 or 2)?");
operation = Convert.ToInt32(Console.ReadLine());
//sort numbers and display
SortArray(fileData, operation == 1);
Console.WriteLine("Sorted: ");
foreach (var data in fileData)
{
Console.WriteLine(data);
}
}
Console.Write("\nDone!\nPress any key to exit...");
Console.ReadKey();
}
If you wanted to use an int to decide which field to sort on, you could do something like the following:
public static void SortArray(FileData[] fileData, int sortFileNumber = 1)
{
bool swap;
do
{
swap = false;
for (int index = 0; index < (fileData.Length - 1); index++)
{
bool comparison;
// Set our comparison to the field sortFileNumber
if (sortFileNumber == 1)
{
comparison = fileData[index].File1Value > fileData[index + 1].File1Value;
}
else if (sortFileNumber == 2)
{
comparison = fileData[index].File2Value > fileData[index + 1].File2Value;
}
else // File3Value becomes default for anything else
{
comparison = fileData[index].File3Value > fileData[index + 1].File3Value;
}
if (comparison)
{
var temp = fileData[index];
fileData[index] = fileData[index + 1];
fileData[index + 1] = temp;
swap = true;
}
}
} while (swap);
}

Importing and removing duplicates from a massive amount of text files using C# and Redis

This is a bit of a doozy and it's been a while since I worked with C#, so bear with me:
I'm running a jruby script to iterate through 900 files (5 Mb - 1500 Mb in size) to figure out how many dupes STILL exist within these (already uniq'd) files. I had little luck with awk.
My latest idea was to insert them into a local MongoDB instance like so:
db.collection('hashes').update({ :_id => hash}, { $inc: { count: 1} }, { upsert: true)
... so that later I could just query it like db.collection.where({ count: { $gt: 1 } }) to get all the dupes.
This is working great except it's been over 24 hours and at the time of writing I'm at 72,532,927 Mongo entries and growing.
I think Ruby's .each_line is bottlnecking the IO hardcore:
So what I'm thinking now is compiling a C# program which fires up a thread PER EACH FILE and inserts the line (md5 hash) into a Redis list.
From there, I could have another compiled C# program simply pop the values off and ignore the save if the count is 1.
So the questions are:
Will using a compiled file reader and multithreading the file reads significantly improve performance?
Is using Redis even necessary? With a tremendous amount of AWS memory, could I not just use the threads to fill some sort of a list atomically and proceed from there?
Thanks in advance.
Updated
New solution. Old solution. The main idea is to calculate dummy hashes(just sum of all chars in string) of each line and store it in Dictionary<ulong, List<LinePosition>> _hash2LinePositions. It's possible to have multiple hashes in the same stream and it solves by List in Dictionary Value. When the hashes are the same, we read and compare the strings from the streams. LinePosition is using for storing info about line - position in stream and its length. I don't have such huge files as you, but my tests shows that it works. Here is the full code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
public class Solution
{
struct LinePosition
{
public long Start;
public long Length;
public LinePosition(long start, long count)
{
Start = start;
Length = count;
}
public override string ToString()
{
return string.Format("Start: {0}, Length: {1}", Start, Length);
}
}
class TextFileHasher : IDisposable
{
readonly Dictionary<ulong, List<LinePosition>> _hash2LinePositions;
readonly Stream _stream;
bool _isDisposed;
public HashSet<ulong> Hashes { get; private set; }
public string Name { get; private set; }
public TextFileHasher(string name, Stream stream)
{
Name = name;
_stream = stream;
_hash2LinePositions = new Dictionary<ulong, List<LinePosition>>();
Hashes = new HashSet<ulong>();
}
public override string ToString()
{
return Name;
}
public void CalculateFileHash()
{
int readByte = -1;
ulong dummyLineHash = 0;
// Line start position in file
long startPosition = 0;
while ((readByte = _stream.ReadByte()) != -1) {
// Read until new line
if (readByte == '\r' || readByte == '\n') {
// If there was data
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - 1 - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
else {
// Was it new line ?
if (dummyLineHash == 0)
startPosition = _stream.Position - 1;
// Calculate dummy hash
dummyLineHash += (uint)readByte;
}
}
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
public List<LinePosition> GetLinePositions(ulong hash)
{
return _hash2LinePositions[hash];
}
public List<string> GetDuplicates()
{
List<string> duplicates = new List<string>();
foreach (var key in _hash2LinePositions.Keys) {
List<LinePosition> linesPos = _hash2LinePositions[key];
if (linesPos.Count > 1) {
duplicates.AddRange(FindExactDuplicates(linesPos));
}
}
return duplicates;
}
public void Dispose()
{
if (_isDisposed)
return;
_stream.Dispose();
_isDisposed = true;
}
private void AddToDictAndHash(ulong hash, long start, long count)
{
List<LinePosition> linesPosition;
if (!_hash2LinePositions.TryGetValue(hash, out linesPosition)) {
linesPosition = new List<LinePosition>() { new LinePosition(start, count) };
_hash2LinePositions.Add(hash, linesPosition);
}
else {
linesPosition.Add(new LinePosition(start, count));
}
Hashes.Add(hash);
}
public byte[] GetLineAsByteArray(LinePosition prevPos)
{
long len = prevPos.Length;
byte[] lineBytes = new byte[len];
_stream.Seek(prevPos.Start, SeekOrigin.Begin);
_stream.Read(lineBytes, 0, (int)len);
return lineBytes;
}
private List<string> FindExactDuplicates(List<LinePosition> linesPos)
{
List<string> duplicates = new List<string>();
linesPos.Sort((x, y) => x.Length.CompareTo(y.Length));
LinePosition prevPos = linesPos[0];
for (int i = 1; i < linesPos.Count; i++) {
if (prevPos.Length == linesPos[i].Length) {
var prevLineArray = GetLineAsByteArray(prevPos);
var thisLineArray = GetLineAsByteArray(linesPos[i]);
if (prevLineArray.SequenceEqual(thisLineArray)) {
var line = System.Text.Encoding.Default.GetString(prevLineArray);
duplicates.Add(line);
}
#if false
string prevLine = System.Text.Encoding.Default.GetString(prevLineArray);
string thisLine = System.Text.Encoding.Default.GetString(thisLineArray);
Console.WriteLine("PrevLine: {0}\r\nThisLine: {1}", prevLine, thisLine);
StringBuilder sb = new StringBuilder();
sb.Append(prevPos);
sb.Append(" is '");
sb.Append(prevLine);
sb.Append("'. ");
sb.AppendLine();
sb.Append(linesPos[i]);
sb.Append(" is '");
sb.Append(thisLine);
sb.AppendLine("'. ");
sb.Append("Equals => ");
sb.Append(prevLine.CompareTo(thisLine) == 0);
Console.WriteLine(sb.ToString());
#endif
}
else {
prevPos = linesPos[i];
}
}
return duplicates;
}
}
public static void Main(String[] args)
{
List<TextFileHasher> textFileHashers = new List<TextFileHasher>();
string text1 = "abc\r\ncba\r\nabc";
TextFileHasher tfh1 = new TextFileHasher("Text1", new MemoryStream(System.Text.Encoding.Default.GetBytes(text1)));
tfh1.CalculateFileHash();
textFileHashers.Add(tfh1);
string text2 = "def\r\ncba\r\nwet";
TextFileHasher tfh2 = new TextFileHasher("Text2", new MemoryStream(System.Text.Encoding.Default.GetBytes(text2)));
tfh2.CalculateFileHash();
textFileHashers.Add(tfh2);
string text3 = "def\r\nbla\r\nwat";
TextFileHasher tfh3 = new TextFileHasher("Text3", new MemoryStream(System.Text.Encoding.Default.GetBytes(text3)));
tfh3.CalculateFileHash();
textFileHashers.Add(tfh3);
List<string> totalDuplicates = new List<string>();
Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>> totalHashes = new Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>>();
textFileHashers.ForEach(tfh => {
foreach(var dummyHash in tfh.Hashes) {
Dictionary<TextFileHasher, List<LinePosition>> tfh2LinePositions = null;
if (!totalHashes.TryGetValue(dummyHash, out tfh2LinePositions))
totalHashes[dummyHash] = new Dictionary<TextFileHasher, List<LinePosition>>() { { tfh, tfh.GetLinePositions(dummyHash) } };
else {
List<LinePosition> linePositions = null;
if (!tfh2LinePositions.TryGetValue(tfh, out linePositions))
tfh2LinePositions[tfh] = tfh.GetLinePositions(dummyHash);
else
linePositions.AddRange(tfh.GetLinePositions(dummyHash));
}
}
});
HashSet<TextFileHasher> alreadyGotDuplicates = new HashSet<TextFileHasher>();
foreach(var hash in totalHashes.Keys) {
var tfh2LinePositions = totalHashes[hash];
var tfh = tfh2LinePositions.Keys.FirstOrDefault();
// Get duplicates in the TextFileHasher itself
if (tfh != null && !alreadyGotDuplicates.Contains(tfh)) {
totalDuplicates.AddRange(tfh.GetDuplicates());
alreadyGotDuplicates.Add(tfh);
}
if (tfh2LinePositions.Count <= 1) {
continue;
}
// Algo to get duplicates in more than 1 TextFileHashers
var tfhs = tfh2LinePositions.Keys.ToArray();
for (int i = 0; i < tfhs.Length; i++) {
var tfh1Positions = tfhs[i].GetLinePositions(hash);
for (int j = i + 1; j < tfhs.Length; j++) {
var tfh2Positions = tfhs[j].GetLinePositions(hash);
for (int k = 0; k < tfh1Positions.Count; k++) {
var tfh1Pos = tfh1Positions[k];
var tfh1ByteArray = tfhs[i].GetLineAsByteArray(tfh1Pos);
for (int m = 0; m < tfh2Positions.Count; m++) {
var tfh2Pos = tfh2Positions[m];
if (tfh1Pos.Length != tfh2Pos.Length)
continue;
var tfh2ByteArray = tfhs[j].GetLineAsByteArray(tfh2Pos);
if (tfh1ByteArray.SequenceEqual(tfh2ByteArray)) {
var line = System.Text.Encoding.Default.GetString(tfh1ByteArray);
totalDuplicates.Add(line);
}
}
}
}
}
}
Console.WriteLine();
if (totalDuplicates.Count > 0) {
Console.WriteLine("Total number of duplicates: {0}", totalDuplicates.Count);
Console.WriteLine("#######################");
totalDuplicates.ForEach(x => Console.WriteLine("{0}", x));
Console.WriteLine("#######################");
}
// Free resources
foreach (var tfh in textFileHashers)
tfh.Dispose();
}
}
If you have tons of ram... You guys are overthinking it...
var fileLines = File.ReadAllLines(#"c:\file.csv").Distinct();

Dictionary or list in c#

I got a strange C# programming problem. There is a data retrieval in groups of random lengths of number groups. The numbers should be all unique, like:
group[1]{1,2,15};
group[2]{3,4,7,33,22,100};
group[3]{11,12,9};
// Now there is a routine that adds a number to a group.
// For the example, just imagine the active group looks like:
// group[active]=(10,5,0)
group[active].add(new_number);
// Now if 100 were to be added to the active group
// then the active group should be merged to group[2]
// (as that one already contained 100)
// And then as a result it would like
group[1]{1,2,15};
group[2]{3,4,7,33,22,100,10,5,0}; // 10 5 0 added to group[2]
group[3]{11,12,9};
// 100 wasn't added to group[2] since it was already in there.
If the number to be added is already used (not unique) in a previous group.
Then I should merge all numbers in the active group towards that previous group, so I don’t get double numbers.
So in the above example if number 100 was added to the active
group, then all numbers in the group[active] should be merged into group[2].
And then the group[active] should start clean fresh again without any items. And since 100 was already in group[2] it should not be added double.
I am not entirely sure on how to deal with this in a proper way.
As an important criteria here is that it has to work fast.
I will have around minimal 30 groups (upper-bound unknown might be 2000 or more), and their length on average contains five integer numbers, but it could be much longer or only one number.
I kind of feel that I am reinventing the wheel here.
I wonder what this problem is called (does it go by a name, some sorting, or grouping math problem)?, with a name I might find some articles related to such problems.
But maybe it’s indeed something new, then what would be recommended? Should I use list of lists or a dictionary of lists.. or something else? Somehow the checking if the number is already present should be done fast.
I'm thinking along this path now and am not sure if it’s the best.
Instead of a single number, I use a struct now. It wasn't written in the original question as I was afraid, explaining that would make it too complex.
struct data{int ID; int additionalNumber}
Dictionary <int,List<data>> group =new Dictionary<int, List<data>>();
I can step aside from using a struct in here. A lookup list could connect the other data to the proper index. So this makes it again more close to the original description.
On a side note, great answers are given.
So far I don’t know yet what would work best for me in my situation.
Note on the selected answer
Several answers were given here, but I went for the pure dictionary solution.
Just as a note for people in similar problem scenarios: I'd still recommend testing, and maybe the others work better for you. It’s just that in my case currently it worked best. The code was also quite short which I liked, and a dictionary adds also other handy options for my future coding on this.
I would go with Dictionary<int, HashSet<int>>, since you want to avoid duplicates and want a fast way to check if given number already exists:
Usage example:
var groups = new Dictionary<int, HashSet<int>>();
// populate the groups
groups[1] = new HashSet<int>(new[] { 1,2,15 });
groups[2] = new HashSet<int>(new[] { 3,4,7,33,22,100 });
int number = 5;
int groupId = 4;
bool numberExists = groups.Values.Any(x => x.Contains(number));
// if there is already a group that contains the number
// merge it with the current group and add the new number
if (numberExists)
{
var group = groups.First(kvp => kvp.Value.Contains(number));
groups[group.Key].UnionWith(groups[groupId]);
groups[groupId] = new HashSet<int>();
}
// otherwise just add the new number
else
{
groups[groupId].Add(number);
}
From what I gather you want to iteratively assign numbers to groups satisfying these conditions:
Each number can be contained in only one of the groups
Groups are sets (numbers can occur only once in given group)
If number n exists in group g and we try to add it to group g', all numbers from g' should be transferred to g instead (avoiding repetitions in g)
Although approaches utilizing Dictionary<int, HashSet<int>> are correct, here's another one (more mathematically based).
You could simply maintain a Dictionary<int, int>, in which the key would be the number, and the corresponding value would indicate the group, to which that number belongs (this stems from condition 1.). And here's the add routine:
//let's assume dict is a reference to the dictionary
//k is a number, and g is a group
void AddNumber(int k, int g)
{
//if k already has assigned a group, we assign all numbers from g
//to k's group (which should be O(n))
if(dict.ContainsKey(k) && dict[k] != g)
{
foreach(var keyValuePair in dict.Where(kvp => kvp.Value == g).ToList())
dict[keyValuePair.Key] = dict[k];
}
//otherwise simply assign number k to group g (which should be O(1))
else
{
dict[k] = g;
}
}
Notice that from a mathematical point of view what you want to model is a function from a set of numbers to a set of groups.
I have kept it as easy to follow as I can, trying not to impact the speed or deviate from the spec.
Create a class called Groups.cs and copy and paste this code into it:
using System;
using System.Collections.Generic;
namespace XXXNAMESPACEXXX
{
public static class Groups
{
public static List<List<int>> group { get; set; }
public static int active { get; set; }
public static void AddNumberToGroup(int numberToAdd, int groupToAddItTo)
{
try
{
if (group == null)
{
group = new List<List<int>>();
}
while (group.Count < groupToAddItTo)
{
group.Add(new List<int>());
}
int IndexOfListToRefresh = -1;
List<int> NumbersToMove = new List<int>();
foreach (List<int> Numbers in group)
{
if (Numbers.Contains(numberToAdd) && (group.IndexOf(Numbers) + 1) != groupToAddItTo)
{
active = group.IndexOf(Numbers) + 1;
IndexOfListToRefresh = group.IndexOf(Numbers);
foreach (int Number in Numbers)
{
NumbersToMove.Add(Number);
}
}
}
foreach (int Number in NumbersToMove)
{
if (!group[groupToAddItTo - 1].Contains(Number))
{
group[groupToAddItTo - 1].Add(Number);
}
}
if (!group[groupToAddItTo - 1].Contains(numberToAdd))
{
group[groupToAddItTo - 1].Add(numberToAdd);
}
if (IndexOfListToRefresh != -1)
{
group[IndexOfListToRefresh] = new List<int>();
}
}
catch//(Exception ex)
{
//Exception handling here
}
}
public static string GetString()
{
string MethodResult = "";
try
{
string Working = "";
bool FirstPass = true;
foreach (List<int> Numbers in group)
{
if (!FirstPass)
{
Working += "\r\n";
}
else
{
FirstPass = false;
}
Working += "group[" + (group.IndexOf(Numbers) + 1) + "]{";
bool InnerFirstPass = true;
foreach (int Number in Numbers)
{
if (!InnerFirstPass)
{
Working += ", ";
}
else
{
InnerFirstPass = false;
}
Working += Number.ToString();
}
Working += "};";
if ((active - 1) == group.IndexOf(Numbers))
{
Working += " //<active>";
}
}
MethodResult = Working;
}
catch//(Exception ex)
{
//Exception handling here
}
return MethodResult;
}
}
}
I don't know if foreach is more or less efficient than standard for loops, so I have made an alternative version that uses standard for loops:
using System;
using System.Collections.Generic;
namespace XXXNAMESPACEXXX
{
public static class Groups
{
public static List<List<int>> group { get; set; }
public static int active { get; set; }
public static void AddNumberToGroup(int numberToAdd, int groupToAddItTo)
{
try
{
if (group == null)
{
group = new List<List<int>>();
}
while (group.Count < groupToAddItTo)
{
group.Add(new List<int>());
}
int IndexOfListToRefresh = -1;
List<int> NumbersToMove = new List<int>();
for(int i = 0; i < group.Count; i++)
{
List<int> Numbers = group[i];
int IndexOfNumbers = group.IndexOf(Numbers) + 1;
if (Numbers.Contains(numberToAdd) && IndexOfNumbers != groupToAddItTo)
{
active = IndexOfNumbers;
IndexOfListToRefresh = IndexOfNumbers - 1;
for (int j = 0; j < Numbers.Count; j++)
{
int Number = NumbersToMove[j];
NumbersToMove.Add(Number);
}
}
}
for(int i = 0; i < NumbersToMove.Count; i++)
{
int Number = NumbersToMove[i];
if (!group[groupToAddItTo - 1].Contains(Number))
{
group[groupToAddItTo - 1].Add(Number);
}
}
if (!group[groupToAddItTo - 1].Contains(numberToAdd))
{
group[groupToAddItTo - 1].Add(numberToAdd);
}
if (IndexOfListToRefresh != -1)
{
group[IndexOfListToRefresh] = new List<int>();
}
}
catch//(Exception ex)
{
//Exception handling here
}
}
public static string GetString()
{
string MethodResult = "";
try
{
string Working = "";
bool FirstPass = true;
for(int i = 0; i < group.Count; i++)
{
List<int> Numbers = group[i];
if (!FirstPass)
{
Working += "\r\n";
}
else
{
FirstPass = false;
}
Working += "group[" + (group.IndexOf(Numbers) + 1) + "]{";
bool InnerFirstPass = true;
for(int j = 0; j < Numbers.Count; j++)
{
int Number = Numbers[j];
if (!InnerFirstPass)
{
Working += ", ";
}
else
{
InnerFirstPass = false;
}
Working += Number.ToString();
}
Working += "};";
if ((active - 1) == group.IndexOf(Numbers))
{
Working += " //<active>";
}
}
MethodResult = Working;
}
catch//(Exception ex)
{
//Exception handling here
}
return MethodResult;
}
}
}
Both implimentations contain the group variable and two methods, which are; AddNumberToGroup and GetString, where GetString is used to check the current status of the group variable.
Note: You'll need to replace XXXNAMESPACEXXX with the Namespace of your project. Hint: Take this from another class.
When adding an item to your List, do this:
int NumberToAdd = 10;
int GroupToAddItTo = 2;
AddNumberToGroup(NumberToAdd, GroupToAddItTo);
...or...
AddNumberToGroup(10, 2);
In the example above, I am adding the number 10 to group 2.
Test the speed with the following:
DateTime StartTime = DateTime.Now;
int NumberOfTimesToRepeatTest = 1000;
for (int i = 0; i < NumberOfTimesToRepeatTest; i++)
{
Groups.AddNumberToGroup(4, 1);
Groups.AddNumberToGroup(3, 1);
Groups.AddNumberToGroup(8, 2);
Groups.AddNumberToGroup(5, 2);
Groups.AddNumberToGroup(7, 3);
Groups.AddNumberToGroup(3, 3);
Groups.AddNumberToGroup(8, 4);
Groups.AddNumberToGroup(43, 4);
Groups.AddNumberToGroup(100, 5);
Groups.AddNumberToGroup(1, 5);
Groups.AddNumberToGroup(5, 6);
Groups.AddNumberToGroup(78, 6);
Groups.AddNumberToGroup(34, 7);
Groups.AddNumberToGroup(456, 7);
Groups.AddNumberToGroup(456, 8);
Groups.AddNumberToGroup(7, 8);
Groups.AddNumberToGroup(7, 9);
}
long MillisecondsTaken = DateTime.Now.Ticks - StartTime.Ticks;
Console.WriteLine(Groups.GetString());
Console.WriteLine("Process took: " + MillisecondsTaken);
I think this is what you need. Let me know if I misunderstood anything in the question.
As far as I can tell it's brilliant, it's fast and it's tested.
Enjoy!
...and one more thing:
For the little windows interface app, I just created a simple winforms app with three textboxes (one set to multiline) and a button.
Then, after adding the Groups class above, in the button-click event I wrote the following:
private void BtnAdd_Click(object sender, EventArgs e)
{
try
{
int Group = int.Parse(TxtGroup.Text);
int Number = int.Parse(TxtNumber.Text);
Groups.AddNumberToGroup(Number, Group);
TxtOutput.Text = Groups.GetString();
}
catch//(Exception ex)
{
//Exception handling here
}
}

Suppressing Frequencies From FFT

What I am trying to do is to retrieve the frequencies from some song and suppress all the frequencies that do not appear in the human vocal range or in general any range. Here is my suppress function.
public void SupressAndWrite(Func<FrequencyUnit, bool> func)
{
this.WaveManipulated = true;
while (this.mainWave.WAVFile.NumSamplesRemaining > 0)
{
FrequencyUnit[] freqUnits = this.mainWave.NextFrequencyUnits();
Complex[] compUnits = (from item
in freqUnits
select (func(item)
? new Complex(item.Frequency, 0) :Complex.Zero))
.ToArray();
FourierTransform.FFT(compUnits, FourierTransform.Direction.Backward);
short[] shorts = (from item
in compUnits
select (short)item.Real).ToArray();
foreach (short item in shorts)
{
this.ManipulatedFile.AddSample16bit(item);
}
}
this.ManipulatedFile.Close();
}
Here is my class for my wave.
public sealed class ComplexWave
{
public readonly WAVFile WAVFile;
public readonly Int32 SampleSize;
private FourierTransform.Direction fourierDirection { get; set; }
private long position;
/// <param name="file"></param>
/// <param name="sampleSize in BLOCKS"></param>
public ComplexWave(WAVFile file, int sampleSize)
{
file.NullReferenceExceptionCheck();
this.WAVFile = file;
this.SampleSize = sampleSize;
if (this.SampleSize % 8 != 0)
{
if (this.SampleSize % 16 != 0)
{
throw new ArgumentException("Sample Size");
}
}
if (!MathTools.IsPowerOf2(sampleSize))
{
throw new ArgumentException("Sample Size");
}
this.fourierDirection = FourierTransform.Direction.Forward;
}
public Complex[] NextSampleFourierTransform()
{
short[] newInput = this.GetNextSample();
Complex[] data = newInput.CopyToComplex();
if (newInput.Any((x) => x != 0))
{
Debug.Write("done");
}
FourierTransform.FFT(data, this.fourierDirection);
return data;
}
public FrequencyUnit[] NextFrequencyUnits()
{
Complex[] cm = this.NextSampleFourierTransform();
FrequencyUnit[] freqUn = new FrequencyUnit[(cm.Length / 2)];
int max = (cm.Length / 2);
for (int i = 0; i < max; i++)
{
freqUn[i] = new FrequencyUnit(cm[i], this.WAVFile.SampleRateHz, i, cm.Length);
}
Array.Sort(freqUn);
return freqUn;
}
private short[] GetNextSample()
{
short[] retval = new short[this.SampleSize];
for (int i = 0; i < this.SampleSize; i++)
{
if (this.WAVFile.NumSamplesRemaining > 0)
{
retval[i] = this.WAVFile.GetNextSampleAs16Bit();
this.position++;
}
}
return retval;
}
}
Both FFT forward and FFT backwards work correctly. Could you please tell me what my error is.
Unfortunately, human voice, even when singing, isn't in 'frequency range'. It usually has one main frequency and multitude of harmonics that follow it, depending on the phoneme.
Use this https://play.google.com/store/apps/details?id=radonsoft.net.spectralview&hl=en or some similar app to see what I mean - and then re-define your strategy. Also google 'karaoke' effect.
NEXT:
It's not obvious from your example, but you should scan whole file in windows (google 'fft windowing') to process it whole.

Categories