Suppressing Frequencies From FFT - c#

What I am trying to do is to retrieve the frequencies from some song and suppress all the frequencies that do not appear in the human vocal range or in general any range. Here is my suppress function.
public void SupressAndWrite(Func<FrequencyUnit, bool> func)
{
this.WaveManipulated = true;
while (this.mainWave.WAVFile.NumSamplesRemaining > 0)
{
FrequencyUnit[] freqUnits = this.mainWave.NextFrequencyUnits();
Complex[] compUnits = (from item
in freqUnits
select (func(item)
? new Complex(item.Frequency, 0) :Complex.Zero))
.ToArray();
FourierTransform.FFT(compUnits, FourierTransform.Direction.Backward);
short[] shorts = (from item
in compUnits
select (short)item.Real).ToArray();
foreach (short item in shorts)
{
this.ManipulatedFile.AddSample16bit(item);
}
}
this.ManipulatedFile.Close();
}
Here is my class for my wave.
public sealed class ComplexWave
{
public readonly WAVFile WAVFile;
public readonly Int32 SampleSize;
private FourierTransform.Direction fourierDirection { get; set; }
private long position;
/// <param name="file"></param>
/// <param name="sampleSize in BLOCKS"></param>
public ComplexWave(WAVFile file, int sampleSize)
{
file.NullReferenceExceptionCheck();
this.WAVFile = file;
this.SampleSize = sampleSize;
if (this.SampleSize % 8 != 0)
{
if (this.SampleSize % 16 != 0)
{
throw new ArgumentException("Sample Size");
}
}
if (!MathTools.IsPowerOf2(sampleSize))
{
throw new ArgumentException("Sample Size");
}
this.fourierDirection = FourierTransform.Direction.Forward;
}
public Complex[] NextSampleFourierTransform()
{
short[] newInput = this.GetNextSample();
Complex[] data = newInput.CopyToComplex();
if (newInput.Any((x) => x != 0))
{
Debug.Write("done");
}
FourierTransform.FFT(data, this.fourierDirection);
return data;
}
public FrequencyUnit[] NextFrequencyUnits()
{
Complex[] cm = this.NextSampleFourierTransform();
FrequencyUnit[] freqUn = new FrequencyUnit[(cm.Length / 2)];
int max = (cm.Length / 2);
for (int i = 0; i < max; i++)
{
freqUn[i] = new FrequencyUnit(cm[i], this.WAVFile.SampleRateHz, i, cm.Length);
}
Array.Sort(freqUn);
return freqUn;
}
private short[] GetNextSample()
{
short[] retval = new short[this.SampleSize];
for (int i = 0; i < this.SampleSize; i++)
{
if (this.WAVFile.NumSamplesRemaining > 0)
{
retval[i] = this.WAVFile.GetNextSampleAs16Bit();
this.position++;
}
}
return retval;
}
}
Both FFT forward and FFT backwards work correctly. Could you please tell me what my error is.

Unfortunately, human voice, even when singing, isn't in 'frequency range'. It usually has one main frequency and multitude of harmonics that follow it, depending on the phoneme.
Use this https://play.google.com/store/apps/details?id=radonsoft.net.spectralview&hl=en or some similar app to see what I mean - and then re-define your strategy. Also google 'karaoke' effect.
NEXT:
It's not obvious from your example, but you should scan whole file in windows (google 'fft windowing') to process it whole.

Related

How to optimize recursive function for a graph for clearing and settlement

I have to make a module in an Insurance application that deals with clearing and settlement (I think this is the correct financial terminology) between insurance companies enroled in the system. Practically, the system must pair all the amounts that companies have to pay to one another, and only the unpaired (remaining) sums to be paid through the bank. For now there are about 30 companies in the system.
All the readings I did about clearing and settlement pointed me towards graphs and graphs theory (which I have studied in the highschool quite a long time ago).
For a system with 4 companies the graph would look like this:
where each company represents a node (N1 ... N4) and each weighted edge represents the amount that a company has to pay to the other. In my code, the nodes are int, representing the id's of the companies.
What I did so far... I created the graph (for test I used the Random generator for the amounts) and made a recursive function to calculate all posible cycles in the graph. Then I made another recursive function that takes all non-zero cycles starting with the longest path with maximum common sum to pair.
The algorithm seems valid in terms of final results, but for graphs bigger than 7-8 nodes it takes too long to complete. The problem is in the recursive function that creates the possible cycles in the graph. Here is my code:
static void Main(string[] args)
{
int nodes = 4;
try
{
nodes = Convert.ToInt32(args[0]);
}
catch { }
DateTime start = DateTime.Now;
Graph g = new Graph(nodes);
int step = 0;
double CompensatedAmount = 0;
double TotalCompensatedAmount = 0;
DateTime endGeneration = DateTime.Now;
Console.WriteLine("Graph generated in: " + (endGeneration - start).TotalSeconds + " seconds.");
Compensare.RunCompensation(false, g, step, CompensatedAmount, TotalCompensatedAmount, out CompensatedAmount, out TotalCompensatedAmount);
DateTime endCompensation = DateTime.Now;
Console.WriteLine("Graph compensated in: " + (endCompensation - endGeneration).TotalSeconds + " seconds.");
}
... and the main class:
public static class Compensare
{
public static void RunCompensation(bool exit, Graph g, int step, double prevCompensatedAmount, double prevTotalCompensatedAmount, out double CompensatedAmount, out double TotalCompensatedAmount)
{
step++;
CompensatedAmount = prevCompensatedAmount;
TotalCompensatedAmount = prevTotalCompensatedAmount;
if (!exit)
{
List<Cycle> orderedList = g.Cycles.OrderByDescending(x => x.CycleCompensatedAmount).ToList();
g.ListCycles(orderedList, "OrderedCycles" + step.ToString() + ".txt");
using (Graph clona = g.Clone())
{
int maxCycleIndex = clona.GetMaxCycleByCompensatedAmount();
double tmpCompensatedAmount = clona.Cycles[maxCycleIndex].CycleMin;
exit = tmpCompensatedAmount <= 0 ? true : false;
CompensatedAmount += tmpCompensatedAmount;
TotalCompensatedAmount += (tmpCompensatedAmount * clona.Cycles[maxCycleIndex].EdgesCount);
clona.CompensateCycle(maxCycleIndex);
clona.UpdateCycles();
Console.WriteLine(String.Format("{0} - edges: {4} - min: {3} - {1} - {2}\r\n", step, CompensatedAmount, TotalCompensatedAmount, tmpCompensatedAmount, clona.Cycles[maxCycleIndex].EdgesCount));
RunCompensation(exit, clona, step, CompensatedAmount, TotalCompensatedAmount, out CompensatedAmount, out TotalCompensatedAmount);
}
}
}
}
public class Edge
{
public int Start { get; set; }
public int End { get; set; }
public double Weight { get; set; }
public double InitialWeight {get;set;}
public Edge() { }
public Edge(int _start, int _end, double _weight)
{
this.Start = _start;
this.End = _end;
this.Weight = _weight;
this.InitialWeight = _weight;
}
}
public class Cycle
{
public List<Edge> Edges = new List<Edge>();
public double CycleWeight = 0;
public double CycleMin = 0;
public double CycleMax = 0;
public double CycleAverage = 0;
public double CycleCompensatedAmount = 0;
public int EdgesCount = 0;
public Cycle() { }
public Cycle(List<Edge> _edges)
{
this.Edges = new List<Edge>(_edges);
UpdateCycle();
}
public void UpdateCycle()
{
UpdateCycle(this);
}
public void UpdateCycle(Cycle c)
{
double sum = 0;
double min = c.Edges[0].Weight;
double max = c.Edges[0].Weight;
for(int i=0;i<c.Edges.Count;i++)
{
sum += c.Edges[i].Weight;
min = c.Edges[i].Weight < min ? c.Edges[i].Weight : min;
max = c.Edges[i].Weight > max ? c.Edges[i].Weight : max;
}
c.EdgesCount = c.Edges.Count;
c.CycleWeight = sum;
c.CycleMin = min;
c.CycleMax = max;
c.CycleAverage = sum / c.EdgesCount;
c.CycleCompensatedAmount = min * c.EdgesCount;
}
}
public class Graph : IDisposable
{
public List<int> Nodes = new List<int>();
public List<Edge> Edges = new List<Edge>();
public List<Cycle> Cycles = new List<Cycle>();
public int NodesCount { get; set; }
public Graph() { }
public Graph(int _nodes)
{
this.NodesCount = _nodes;
GenerateNodes();
GenerateEdges();
GenerateCycles();
}
private int FindNode(string _node)
{
for(int i = 0; i < this.Nodes.Count; i++)
{
if (this.Nodes[i].ToString() == _node)
return i;
}
return 0;
}
private int FindEdge(string[] _edge)
{
for(int i = 0; i < this.Edges.Count; i++)
{
if (this.Edges[i].Start.ToString() == _edge[0] && this.Edges[i].End.ToString() == _edge[1] && Convert.ToDouble(this.Edges[i].Weight) == Convert.ToDouble(_edge[2]))
return i;
}
return 0;
}
public Graph Clone()
{
Graph clona = new Graph();
clona.Nodes = new List<int>(this.Nodes);
clona.Edges = new List<Edge>(this.Edges);
clona.Cycles = new List<Cycle>(this.Cycles);
clona.NodesCount = this.NodesCount;
return clona;
}
public void CompensateCycle(int cycleIndex)
{
for(int i = 0; i < this.Cycles[cycleIndex].Edges.Count; i++)
{
this.Cycles[cycleIndex].Edges[i].Weight -= this.Cycles[cycleIndex].CycleMin;
}
}
public int GetMaxCycleByCompensatedAmount()
{
int toReturn = 0;
for (int i = 0; i < this.Cycles.Count; i++)
{
if (this.Cycles[i].CycleCompensatedAmount > this.Cycles[toReturn].CycleCompensatedAmount)
{
toReturn = i;
}
}
return toReturn;
}
public void GenerateNodes()
{
for (int i = 0; i < this.NodesCount; i++)
{
this.Nodes.Add(i + 1);
}
}
public void GenerateEdges()
{
Random r = new Random();
for(int i = 0; i < this.Nodes.Count; i++)
{
for(int j = 0; j < this.Nodes.Count; j++)
{
if(this.Nodes[i] != this.Nodes[j])
{
int _weight = r.Next(0, 500);
Edge e = new Edge(this.Nodes[i], this.Nodes[j], _weight);
this.Edges.Add(e);
}
}
}
}
public void GenerateCycles()
{
for(int i = 0; i < this.Edges.Count; i++)
{
FindCycles(new Cycle(new List<Edge>() { this.Edges[i] }));
}
this.UpdateCycles();
}
public void UpdateCycles()
{
for (int i = 0; i < this.Cycles.Count; i++)
{
this.Cycles[i].UpdateCycle();
}
}
private void FindCycles(Cycle path)
{
List<Edge> nextPossibleEdges = GetNextEdges(path.Edges[path.Edges.Count - 1].End);
for (int i = 0; i < nextPossibleEdges.Count; i++)
{
if (path.Edges.IndexOf(nextPossibleEdges[i]) < 0) // the edge shouldn't be already in the path
{
Cycle temporaryPath = new Cycle(path.Edges);
temporaryPath.Edges.Add(nextPossibleEdges[i]);
if (nextPossibleEdges[i].End == temporaryPath.Edges[0].Start) // end of path - valid cycle
{
if (!CycleExists(temporaryPath))
{
this.Cycles.Add(temporaryPath);
break;
}
}
else
{
FindCycles(temporaryPath);
}
}
}
}
private bool CycleExists(Cycle cycle)
{
bool toReturn = false;
if (this.Cycles.IndexOf(cycle) > -1) { toReturn = true; }
else
{
for (int i = 0; i < this.Cycles.Count; i++)
{
if (this.Cycles[i].Edges.Count == cycle.Edges.Count && !CompareEdges(this.Cycles[i].Edges[0], cycle.Edges[0]))
{
bool cycleExists = true;
for (int j = 0; j < cycle.Edges.Count; j++)
{
bool edgeExists = false; // if there is an edge not in the path, then the searched cycle is diferent from the current cycle and we can pas to the next iteration
for (int k = 0; k < this.Cycles[i].Edges.Count; k++)
{
if (CompareEdges(cycle.Edges[j], this.Cycles[i].Edges[k]))
{
edgeExists = true;
break;
}
}
if (!edgeExists)
{
cycleExists = false;
break;
}
}
if (cycleExists) // if we found an cycle with all edges equal to the searched cycle, then the cycle is not valid
{
toReturn = true;
break;
}
}
}
}
return toReturn;
}
private bool CompareEdges(Edge e1, Edge e2)
{
return (e1.Start == e2.Start && e1.End == e2.End && e1.Weight == e2.Weight);
}
private List<Edge> GetNextEdges(int endNode)
{
List<Edge> tmp = new List<Edge>();
for(int i = 0; i < this.Edges.Count; i++)
{
if(endNode == this.Edges[i].Start)
{
tmp.Add(this.Edges[i]);
}
}
return tmp;
}
#region IDisposable Support
private bool disposedValue = false; // To detect redundant calls
protected virtual void Dispose(bool disposing)
{
if (!disposedValue)
{
if (disposing)
{
// TODO: dispose managed state (managed objects).
this.Nodes = null;
this.Edges = null;
this.Cycles = null;
this.NodesCount = 0;
}
// TODO: free unmanaged resources (unmanaged objects) and override a finalizer below.
// TODO: set large fields to null.
disposedValue = true;
}
}
// TODO: override a finalizer only if Dispose(bool disposing) above has code to free unmanaged resources.
// ~Graph() {
// // Do not change this code. Put cleanup code in Dispose(bool disposing) above.
// Dispose(false);
// }
// This code added to correctly implement the disposable pattern.
public void Dispose()
{
// Do not change this code. Put cleanup code in Dispose(bool disposing) above.
Dispose(true);
// TODO: uncomment the following line if the finalizer is overridden above.
// GC.SuppressFinalize(this);
}
#endregion
}
I've found several articles/answers about graphs, both in Java and C# (including quickgraph), but they mainly focus on directed graphs (without cycles).
I have also read about tail call optimization, for recursion, but I don't know if/how to implement in my case.
I now there is a lot to grasp about this subject, but maybe someone had to deal with something similar and can either help me optimize the code (which as I said seems to do the job in the end), either point me to another direction to rethink the whole process.
I think you can massively simplify this.
All money is the same, so (using your example) N1 doesn't care whether it gets 350 from N2 and pays 150 to N2 and so on - N1 merely cares that overall it ends up 145 down (if I've done the arithmetic correctly). Similarly, each other N only cares about its overall position. So, summing the inflows and outflows at each node, we get:
Company Net position
N1 -145
N2 -65
N3 +195
N4 +15
So with someone to act as a central clearing house - the bank - simply arrange for N1 and N2 to pay the clearing house 145 and 65 respectively, and for N3 and N4 to receive 195 and 15 respectively from the clearing house. And everyone's happy.
I may have missed some aspect, of course, in which case I'm sure someone will point it out...

Importing and removing duplicates from a massive amount of text files using C# and Redis

This is a bit of a doozy and it's been a while since I worked with C#, so bear with me:
I'm running a jruby script to iterate through 900 files (5 Mb - 1500 Mb in size) to figure out how many dupes STILL exist within these (already uniq'd) files. I had little luck with awk.
My latest idea was to insert them into a local MongoDB instance like so:
db.collection('hashes').update({ :_id => hash}, { $inc: { count: 1} }, { upsert: true)
... so that later I could just query it like db.collection.where({ count: { $gt: 1 } }) to get all the dupes.
This is working great except it's been over 24 hours and at the time of writing I'm at 72,532,927 Mongo entries and growing.
I think Ruby's .each_line is bottlnecking the IO hardcore:
So what I'm thinking now is compiling a C# program which fires up a thread PER EACH FILE and inserts the line (md5 hash) into a Redis list.
From there, I could have another compiled C# program simply pop the values off and ignore the save if the count is 1.
So the questions are:
Will using a compiled file reader and multithreading the file reads significantly improve performance?
Is using Redis even necessary? With a tremendous amount of AWS memory, could I not just use the threads to fill some sort of a list atomically and proceed from there?
Thanks in advance.
Updated
New solution. Old solution. The main idea is to calculate dummy hashes(just sum of all chars in string) of each line and store it in Dictionary<ulong, List<LinePosition>> _hash2LinePositions. It's possible to have multiple hashes in the same stream and it solves by List in Dictionary Value. When the hashes are the same, we read and compare the strings from the streams. LinePosition is using for storing info about line - position in stream and its length. I don't have such huge files as you, but my tests shows that it works. Here is the full code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
public class Solution
{
struct LinePosition
{
public long Start;
public long Length;
public LinePosition(long start, long count)
{
Start = start;
Length = count;
}
public override string ToString()
{
return string.Format("Start: {0}, Length: {1}", Start, Length);
}
}
class TextFileHasher : IDisposable
{
readonly Dictionary<ulong, List<LinePosition>> _hash2LinePositions;
readonly Stream _stream;
bool _isDisposed;
public HashSet<ulong> Hashes { get; private set; }
public string Name { get; private set; }
public TextFileHasher(string name, Stream stream)
{
Name = name;
_stream = stream;
_hash2LinePositions = new Dictionary<ulong, List<LinePosition>>();
Hashes = new HashSet<ulong>();
}
public override string ToString()
{
return Name;
}
public void CalculateFileHash()
{
int readByte = -1;
ulong dummyLineHash = 0;
// Line start position in file
long startPosition = 0;
while ((readByte = _stream.ReadByte()) != -1) {
// Read until new line
if (readByte == '\r' || readByte == '\n') {
// If there was data
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - 1 - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
else {
// Was it new line ?
if (dummyLineHash == 0)
startPosition = _stream.Position - 1;
// Calculate dummy hash
dummyLineHash += (uint)readByte;
}
}
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
public List<LinePosition> GetLinePositions(ulong hash)
{
return _hash2LinePositions[hash];
}
public List<string> GetDuplicates()
{
List<string> duplicates = new List<string>();
foreach (var key in _hash2LinePositions.Keys) {
List<LinePosition> linesPos = _hash2LinePositions[key];
if (linesPos.Count > 1) {
duplicates.AddRange(FindExactDuplicates(linesPos));
}
}
return duplicates;
}
public void Dispose()
{
if (_isDisposed)
return;
_stream.Dispose();
_isDisposed = true;
}
private void AddToDictAndHash(ulong hash, long start, long count)
{
List<LinePosition> linesPosition;
if (!_hash2LinePositions.TryGetValue(hash, out linesPosition)) {
linesPosition = new List<LinePosition>() { new LinePosition(start, count) };
_hash2LinePositions.Add(hash, linesPosition);
}
else {
linesPosition.Add(new LinePosition(start, count));
}
Hashes.Add(hash);
}
public byte[] GetLineAsByteArray(LinePosition prevPos)
{
long len = prevPos.Length;
byte[] lineBytes = new byte[len];
_stream.Seek(prevPos.Start, SeekOrigin.Begin);
_stream.Read(lineBytes, 0, (int)len);
return lineBytes;
}
private List<string> FindExactDuplicates(List<LinePosition> linesPos)
{
List<string> duplicates = new List<string>();
linesPos.Sort((x, y) => x.Length.CompareTo(y.Length));
LinePosition prevPos = linesPos[0];
for (int i = 1; i < linesPos.Count; i++) {
if (prevPos.Length == linesPos[i].Length) {
var prevLineArray = GetLineAsByteArray(prevPos);
var thisLineArray = GetLineAsByteArray(linesPos[i]);
if (prevLineArray.SequenceEqual(thisLineArray)) {
var line = System.Text.Encoding.Default.GetString(prevLineArray);
duplicates.Add(line);
}
#if false
string prevLine = System.Text.Encoding.Default.GetString(prevLineArray);
string thisLine = System.Text.Encoding.Default.GetString(thisLineArray);
Console.WriteLine("PrevLine: {0}\r\nThisLine: {1}", prevLine, thisLine);
StringBuilder sb = new StringBuilder();
sb.Append(prevPos);
sb.Append(" is '");
sb.Append(prevLine);
sb.Append("'. ");
sb.AppendLine();
sb.Append(linesPos[i]);
sb.Append(" is '");
sb.Append(thisLine);
sb.AppendLine("'. ");
sb.Append("Equals => ");
sb.Append(prevLine.CompareTo(thisLine) == 0);
Console.WriteLine(sb.ToString());
#endif
}
else {
prevPos = linesPos[i];
}
}
return duplicates;
}
}
public static void Main(String[] args)
{
List<TextFileHasher> textFileHashers = new List<TextFileHasher>();
string text1 = "abc\r\ncba\r\nabc";
TextFileHasher tfh1 = new TextFileHasher("Text1", new MemoryStream(System.Text.Encoding.Default.GetBytes(text1)));
tfh1.CalculateFileHash();
textFileHashers.Add(tfh1);
string text2 = "def\r\ncba\r\nwet";
TextFileHasher tfh2 = new TextFileHasher("Text2", new MemoryStream(System.Text.Encoding.Default.GetBytes(text2)));
tfh2.CalculateFileHash();
textFileHashers.Add(tfh2);
string text3 = "def\r\nbla\r\nwat";
TextFileHasher tfh3 = new TextFileHasher("Text3", new MemoryStream(System.Text.Encoding.Default.GetBytes(text3)));
tfh3.CalculateFileHash();
textFileHashers.Add(tfh3);
List<string> totalDuplicates = new List<string>();
Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>> totalHashes = new Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>>();
textFileHashers.ForEach(tfh => {
foreach(var dummyHash in tfh.Hashes) {
Dictionary<TextFileHasher, List<LinePosition>> tfh2LinePositions = null;
if (!totalHashes.TryGetValue(dummyHash, out tfh2LinePositions))
totalHashes[dummyHash] = new Dictionary<TextFileHasher, List<LinePosition>>() { { tfh, tfh.GetLinePositions(dummyHash) } };
else {
List<LinePosition> linePositions = null;
if (!tfh2LinePositions.TryGetValue(tfh, out linePositions))
tfh2LinePositions[tfh] = tfh.GetLinePositions(dummyHash);
else
linePositions.AddRange(tfh.GetLinePositions(dummyHash));
}
}
});
HashSet<TextFileHasher> alreadyGotDuplicates = new HashSet<TextFileHasher>();
foreach(var hash in totalHashes.Keys) {
var tfh2LinePositions = totalHashes[hash];
var tfh = tfh2LinePositions.Keys.FirstOrDefault();
// Get duplicates in the TextFileHasher itself
if (tfh != null && !alreadyGotDuplicates.Contains(tfh)) {
totalDuplicates.AddRange(tfh.GetDuplicates());
alreadyGotDuplicates.Add(tfh);
}
if (tfh2LinePositions.Count <= 1) {
continue;
}
// Algo to get duplicates in more than 1 TextFileHashers
var tfhs = tfh2LinePositions.Keys.ToArray();
for (int i = 0; i < tfhs.Length; i++) {
var tfh1Positions = tfhs[i].GetLinePositions(hash);
for (int j = i + 1; j < tfhs.Length; j++) {
var tfh2Positions = tfhs[j].GetLinePositions(hash);
for (int k = 0; k < tfh1Positions.Count; k++) {
var tfh1Pos = tfh1Positions[k];
var tfh1ByteArray = tfhs[i].GetLineAsByteArray(tfh1Pos);
for (int m = 0; m < tfh2Positions.Count; m++) {
var tfh2Pos = tfh2Positions[m];
if (tfh1Pos.Length != tfh2Pos.Length)
continue;
var tfh2ByteArray = tfhs[j].GetLineAsByteArray(tfh2Pos);
if (tfh1ByteArray.SequenceEqual(tfh2ByteArray)) {
var line = System.Text.Encoding.Default.GetString(tfh1ByteArray);
totalDuplicates.Add(line);
}
}
}
}
}
}
Console.WriteLine();
if (totalDuplicates.Count > 0) {
Console.WriteLine("Total number of duplicates: {0}", totalDuplicates.Count);
Console.WriteLine("#######################");
totalDuplicates.ForEach(x => Console.WriteLine("{0}", x));
Console.WriteLine("#######################");
}
// Free resources
foreach (var tfh in textFileHashers)
tfh.Dispose();
}
}
If you have tons of ram... You guys are overthinking it...
var fileLines = File.ReadAllLines(#"c:\file.csv").Distinct();

Postfix increment into if, c#

Code example:
using System;
public class Test {
public static void Main() {
int a = 0;
if(a++ == 0){
Console.WriteLine(a);
}
}
}
In this code the Console will write: 1. I can write this code in another way:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
These two examples work exactly the same (from what I know about postfix).
The problem is with this example coming from the Microsoft tutorials:
using System;
public class Document {
// Class allowing to view the document as an array of words:
public class WordCollection {
readonly Document document;
internal WordCollection (Document d){
document = d;
}
// Helper function -- search character array "text", starting
// at character "begin", for word number "wordCount". Returns
//false if there are less than wordCount words. Sets "start" and
//length to the position and length of the word within text
private bool GetWord(char[] text, int begin, int wordCount,
out int start, out int length) {
int end = text.Length;
int count = 0;
int inWord = -1;
start = length = 0;
for (int i = begin; i <= end; ++i){
bool isLetter = i < end && Char.IsLetterOrDigit(text[i]);
if (inWord >= 0) {
if (!isLetter) {
if (count++ == wordCount) {//PROBLEM IS HERE!!!!!!!!!!!!
start = inWord;
length = i - inWord;
return true;
}
inWord = -1;
}
} else {
if (isLetter) {
inWord = i;
}
}
}
return false;
}
//Indexer to get and set words of the containing document:
public string this[int index] {
get
{
int start, length;
if(GetWord(document.TextArray, 0, index, out start,
out length)) {
return new string(document.TextArray, start, length);
} else {
throw new IndexOutOfRangeException();
}
}
set {
int start, length;
if(GetWord(document.TextArray, 0, index, out start,
out length))
{
//Replace the word at start/length with
// the string "value"
if(length == value.Length){
Array.Copy(value.ToCharArray(), 0,
document.TextArray, start, length);
}
else {
char[] newText = new char[document.TextArray.Length +
value.Length - length];
Array.Copy(document.TextArray, 0, newText, 0, start);
Array.Copy(value.ToCharArray(), 0, newText, start, value.Length);
Array.Copy(document.TextArray, start + length, newText,
start + value.Length, document.TextArray.Length - start - length);
document.TextArray = newText;
}
} else {
throw new IndexOutOfRangeException();
}
}
}
public int Count {
get {
int count = 0, start = 0, length = 0;
while (GetWord(document.TextArray, start + length,
0, out start, out length)) {
++count;
}
return count;
}
}
}
// Class allowing the document to be viewed like an array
// of character
public class CharacterCollection {
readonly Document document;
internal CharacterCollection(Document d) {
document = d;
}
//Indexer to get and set character in the containing
//document
public char this[int index] {
get {
return document.TextArray[index];
}
set {
document.TextArray[index] = value;
}
}
//get the count of character in the containing document
public int Count {
get {
return document.TextArray.Length;
}
}
}
//Because the types of the fields have indexers,
//these fields appear as "indexed properties":
public WordCollection Words;
public readonly CharacterCollection Characters;
private char[] TextArray;
public Document(string initialText) {
TextArray = initialText.ToCharArray();
Words = new WordCollection(this);
Characters = new CharacterCollection(this);
}
public string Text {
get {
return new string(TextArray);
}
}
class Test {
static void Main() {
Document d = new Document(
"peter piper picked a peck of pickled peppers. How many pickled peppers did peter piper pick?"
);
//Change word "peter" to "penelope"
for(int i = 0; i < d.Words.Count; ++i){
if (d.Words[i] == "peter") {
d.Words[i] = "penelope";
}
}
for (int i = 0; i < d.Characters.Count; ++i) {
if (d.Characters[i] == 'p') {
d.Characters[i] = 'P';
}
}
Console.WriteLine(d.Text);
}
}
}
If I change the code marked above to this:
if (count == wordCount) {//PROBLEM IS HERE
start = inWord;
length = i - inWord;
count++;
return true;
}
I get an IndexOutOfRangeException, but I don't know why.
Your initial assumption is incorrect (that the two examples work exactly the same). In the following version, count is incremented regardless of whether or not it is equal to wordCount:
if (count++ == wordCount)
{
// Code omitted
}
In this version, count is ONLY incremented when it is equal to wordCount
if (count == wordCount)
{
// Other code omitted
count++;
}
EDIT
The reason this is causing you a failure is that, when you are searching for the second word (when wordCount is 1), the variable count will never equal wordCount (because it never gets incremented), and therefore the GetWord method returns false, which then triggers the else clause in your get method, which throws an IndexOutOfRangeException.
In your version of the code, count is only being incremented when count == wordCount; in the Microsoft version, it's being incremented whether the condition is met or not.
using System;
public class Test {
public static void Main() {
int a = 0;
if(a++ == 0){
Console.WriteLine(a);
}
}
}
Is not quite the same as:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
In the second case a++ is executed only if a == 0. In the first case a++ is executed every time we check the condition.
There is your mistake:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
}
It should be like this:
public static void Main() {
int a = 0;
if(a == 0){
a++;
Console.WriteLine(a);
}
else
a++;
}
a gets alwasy increased. This means, that in your code example count will get only increased when count == wordCount (In which case the method will return true anyway...). You basicly never increasing count.

(Dynamic programming) How to maximize room utilization with a list of meeting?

I am trying this problem using dynamic programming
Problem:
Given a meeting room and a list of intervals (represent the meeting), for e.g.:
interval 1: 1.00-2.00
interval 2: 2.00-4.00
interval 3: 14.00-16.00
...
etc.
Question:
How to schedule the meeting to maximize the room utilization, and NO meeting should overlap with each other?
Attempted solution
Below is my initial attempt in C# (knowing it is a modified Knapsack problem with constraints). However I had difficulty in getting the result correctly.
bool ContainsOverlapped(List<Interval> list)
{
var sortedList = list.OrderBy(x => x.Start).ToList();
for (int i = 0; i < sortedList.Count; i++)
{
for (int j = i + 1; j < sortedList.Count; j++)
{
if (sortedList[i].IsOverlap(sortedList[j]))
return true;
}
}
return false;
}
public bool Optimize(List<Interval> intervals, int limit, List<Interval> itemSoFar){
if (intervals == null || intervals.Count == 0)
return true; //no more choice
if (Sum(itemSoFar) > limit) //over limit
return false;
var arrInterval = intervals.ToArray();
//try all choices
for (int i = 0; i < arrInterval.Length; i++){
List<Interval> remaining = new List<Interval>();
for (int j = i + 1; j < arrInterval.Length; j++) {
remaining.Add(arrInterval[j]);
}
var partialChoice = new List<Interval>();
partialChoice.AddRange(itemSoFar);
partialChoice.Add(arrInterval[i]);
//should not schedule overlap
if (ContainsOverlapped(partialChoice))
partialChoice.Remove(arrInterval[i]);
if (Optimize(remaining, limit, partialChoice))
return true;
else
partialChoice.Remove(arrInterval[i]); //undo
}
//try all solution
return false;
}
public class Interval
{
public bool IsOverlap(Interval other)
{
return (other.Start < this.Start && this.Start < other.End) || //other < this
(this.Start < other.Start && other.End < this.End) || // this covers other
(other.Start < this.Start && this.End < other.End) || // other covers this
(this.Start < other.Start && other.Start < this.End); //this < other
}
public override bool Equals(object obj){
var i = (Interval)obj;
return base.Equals(obj) && i.Start == this.Start && i.End == this.End;
}
public int Start { get; set; }
public int End { get; set; }
public Interval(int start, int end){
Start = start;
End = end;
}
public int Duration{
get{
return End - Start;
}
}
}
Edit 1
Room utilization = amount of time the room is occupied. Sorry for confusion.
Edit 2
for simplicity: the duration of each interval is integer, and the start/end time start at whole hour (1,2,3..24)
I'm not sure how you are relating this to a knapsack problem. To me it seems more of a vertex cover problem.
First sort the intervals as per their start times and form a graph representation in the form of adjacency matrix or list.
The vertices shall be the interval numbers. There shall be an edge between two vertices if the corresponding intervals overlap with each other. Also, each vertex shall be associated with a value equal to the interval's duration.
The problem then becomes choosing the independent vertices in such a way that the total value is maximum.
This can be done through dynamic programming. The recurrence relation for each vertex shall be as follows:
V[i] = max{ V[j] | j < i and i->j is an edge,
V[k] + value[i] | k < i and there is no edge between i and k }
Base Case V[1] = value[1]
Note:
The vertices should be numbered in increasing order of their start times. Then if there are three vertices:
i < j < k, and if there is no edge between vertex i and vertex j, then there cannot be any edge between vertex i and vertex k.
Good approach is to create class that can easily handle for you.
First I create helper class for easily storing intervals
public class FromToDateTime
{
private DateTime _start;
public DateTime Start
{
get
{
return _start;
}
set
{
_start = value;
}
}
private DateTime _end;
public DateTime End
{
get
{
return _end;
}
set
{
_end = value;
}
}
public FromToDateTime(DateTime start, DateTime end)
{
Start = start;
End = end;
}
}
And then here is class Room, where all intervals are and which has method "addInterval", which returns true, if interval is ok and was added and false, if it does not.
btw : I got a checking condition for overlapping here : Algorithm to detect overlapping periods
public class Room
{
private List<FromToDateTime> _intervals;
public List<FromToDateTime> Intervals
{
get
{
return _intervals;
}
set
{
_intervals = value;
}
}
public Room()
{
Intervals = new List<FromToDateTime>();
}
public bool addInterval(FromToDateTime newInterval)
{
foreach (FromToDateTime interval in Intervals)
{
if (newInterval.Start < interval.End && interval.Start < newInterval.End)
{
return false;
}
}
Intervals.Add(newInterval);
return true;
}
}
While the more general problem (if you have multiple number of meeting rooms) is indeed NP-Hard, and is known as the interval scheduling problem.
Optimal solution for 1-d problem with one classroom:
For the 1-d problem, choosing the (still valid) earliest deadline first solves the problem optimally.
Proof: by induction, the base clause is the void clause - the algorithm optimally solves a problem with zero meetings.
The induction hypothesis is the algorithm solves the problem optimally for any number of k tasks.
The step: Given a problem with n meetings, hose the earliest deadline, and remove all invalid meetings after choosing it. Let the chosen earliest deadline task be T.
You will get a new problem of smaller size, and by invoking the algorithm on the reminder, you will get the optimal solution for them (induction hypothesis).
Now, note that given that optimal solution, you can add at most one of the discarded tasks, since you can either add T, or another discarded task - but all of them overlaps T - otherwise they wouldn't have been discarded), thus, you can add at most one from all discarded tasks, same as the suggested algorithm.
Conclusion: For 1 meeting room, this algorithm is optimal.
QED
high level pseudo code of the solution:
findOptimal(list<tasks>):
res = [] //empty list
sort(list) //according to deadline/meeting end
while (list.IsEmpty() == false):
res = res.append(list.first())
end = list.first().endTime()
//remove all overlaps with the chosen meeting
while (list.first().startTine() < end):
list.removeFirst()
return res
Clarification: This answer assumes "Room Utilization" means maximize number of meetings placed in the room.
Thanks all, here is my solution based on this Princeton note on dynamic programming.
Algorithm:
Sort all events by end time.
For each event, find p[n] - the latest event (by end time) which does not overlap with it.
Compute the optimization values: choose the best between including/not including the event.
Optimize(n) {
opt(0) = 0;
for j = 1 to n-th {
opt(j) = max(length(j) + opt[p(j)], opt[j-1]);
}
}
The complete source-code:
namespace CommonProblems.Algorithm.DynamicProgramming {
public class Scheduler {
#region init & test
public List<Event> _events { get; set; }
public List<Event> Init() {
if (_events == null) {
_events = new List<Event>();
_events.Add(new Event(8, 11));
_events.Add(new Event(6, 10));
_events.Add(new Event(5, 9));
_events.Add(new Event(3, 8));
_events.Add(new Event(4, 7));
_events.Add(new Event(0, 6));
_events.Add(new Event(3, 5));
_events.Add(new Event(1, 4));
}
return _events;
}
public void DemoOptimize() {
this.Init();
this.DynamicOptimize(this._events);
}
#endregion
#region Dynamic Programming
public void DynamicOptimize(List<Event> events) {
events.Add(new Event(0, 0));
events = events.SortByEndTime();
int[] eventIndexes = getCompatibleEvent(events);
int[] utilization = getBestUtilization(events, eventIndexes);
List<Event> schedule = getOptimizeSchedule(events, events.Count - 1, utilization, eventIndexes);
foreach (var e in schedule) {
Console.WriteLine("Event: [{0}- {1}]", e.Start, e.End);
}
}
/*
Algo to get optimization value:
1) Sort all events by end time, give each of the an index.
2) For each event, find p[n] - the latest event (by end time) which does not overlap with it.
3) Compute the optimization values: choose the best between including/not including the event.
Optimize(n) {
opt(0) = 0;
for j = 1 to n-th {
opt(j) = max(length(j) + opt[p(j)], opt[j-1]);
}
display opt();
}
*/
int[] getBestUtilization(List<Event> sortedEvents, int[] compatibleEvents) {
int[] optimal = new int[sortedEvents.Count];
int n = optimal.Length;
optimal[0] = 0;
for (int j = 1; j < n; j++) {
var thisEvent = sortedEvents[j];
//pick between 2 choices:
optimal[j] = Math.Max(thisEvent.Duration + optimal[compatibleEvents[j]], //Include this event
optimal[j - 1]); //Not include
}
return optimal;
}
/*
Show the optimized events:
sortedEvents: events sorted by end time.
index: event index to start with.
optimal: optimal[n] = the optimized schedule at n-th event.
compatibleEvents: compatibleEvents[n] = the latest event before n-th
*/
List<Event> getOptimizeSchedule(List<Event> sortedEvents, int index, int[] optimal, int[] compatibleEvents) {
List<Event> output = new List<Event>();
if (index == 0) {
//base case: no more event
return output;
}
//it's better to choose this event
else if (sortedEvents[index].Duration + optimal[compatibleEvents[index]] >= optimal[index]) {
output.Add(sortedEvents[index]);
//recursive go back
output.AddRange(getOptimizeSchedule(sortedEvents, compatibleEvents[index], optimal, compatibleEvents));
return output;
}
//it's better NOT choose this event
else {
output.AddRange(getOptimizeSchedule(sortedEvents, index - 1, optimal, compatibleEvents));
return output;
}
}
//compatibleEvents[n] = the latest event which do not overlap with n-th.
int[] getCompatibleEvent(List<Event> sortedEvents) {
int[] compatibleEvents = new int[sortedEvents.Count];
for (int i = 0; i < sortedEvents.Count; i++) {
for (int j = 0; j <= i; j++) {
if (!sortedEvents[j].IsOverlap(sortedEvents[i])) {
compatibleEvents[i] = j;
}
}
}
return compatibleEvents;
}
#endregion
}
public class Event {
public int EventId { get; set; }
public bool IsOverlap(Event other) {
return !(this.End <= other.Start ||
this.Start >= other.End);
}
public override bool Equals(object obj) {
var i = (Event)obj;
return base.Equals(obj) && i.Start == this.Start && i.End == this.End;
}
public int Start { get; set; }
public int End { get; set; }
public Event(int start, int end) {
Start = start;
End = end;
}
public int Duration {
get {
return End - Start;
}
}
}
public static class ListExtension {
public static bool ContainsOverlapped(this List<Event> list) {
var sortedList = list.OrderBy(x => x.Start).ToList();
for (int i = 0; i < sortedList.Count; i++) {
for (int j = i + 1; j < sortedList.Count; j++) {
if (sortedList[i].IsOverlap(sortedList[j]))
return true;
}
}
return false;
}
public static List<Event> SortByEndTime(this List<Event> events) {
if (events == null) return new List<Event>();
return events.OrderBy(x => x.End).ToList();
}
}
}

Comparing names

Is there any simple algorithm to determine the likeliness of 2 names representing the same person?
I'm not asking for something of the level that Custom department might be using. Just a simple algorithm that would tell me if 'James T. Clark' is most likely the same name as 'J. Thomas Clark' or 'James Clerk'.
If there is an algorithm in C# that would be great, but I can translate from any language.
Sounds like you're looking for a phonetic-based algorithms, such as soundex, NYSIIS, or double metaphone. The first actually is what several government departments use, and is trivial to implement (with many implementations readily available). The second is a slightly more complicated and more precise version of the first. The latter-most works with some non-English names and alphabets.
Levenshtein distance is a definition of distance between two arbitrary strings. It gives you a distance of 0 between identical strings and non-zero between different strings, which might also be useful if you decide to make a custom algorithm.
Levenshtein is close, although maybe not exactly what you want.
I've faced similar problem and tried to use Levenstein distance first, but it did not work well for me. I came up with an algorithm that gives you "similarity" value between two strings (higher value means more similar strings, "1" for identical strings). This value is not very meaningful by itself (if not "1", always 0.5 or less), but works quite well when you throw in Hungarian Matrix to find matching pairs from two lists of strings.
Use like this:
PartialStringComparer cmp = new PartialStringComparer();
tbResult.Text = cmp.Compare(textBox1.Text, textBox2.Text).ToString();
The code behind:
public class SubstringRange {
string masterString;
public string MasterString {
get { return masterString; }
set { masterString = value; }
}
int start;
public int Start {
get { return start; }
set { start = value; }
}
int end;
public int End {
get { return end; }
set { end = value; }
}
public int Length {
get { return End - Start; }
set { End = Start + value;}
}
public bool IsValid {
get { return MasterString.Length >= End && End >= Start && Start >= 0; }
}
public string Contents {
get {
if(IsValid) {
return MasterString.Substring(Start, Length);
} else {
return "";
}
}
}
public bool OverlapsRange(SubstringRange range) {
return !(End < range.Start || Start > range.End);
}
public bool ContainsRange(SubstringRange range) {
return range.Start >= Start && range.End <= End;
}
public bool ExpandTo(string newContents) {
if(MasterString.Substring(Start).StartsWith(newContents, StringComparison.InvariantCultureIgnoreCase) && newContents.Length > Length) {
Length = newContents.Length;
return true;
} else {
return false;
}
}
}
public class SubstringRangeList: List<SubstringRange> {
string masterString;
public string MasterString {
get { return masterString; }
set { masterString = value; }
}
public SubstringRangeList(string masterString) {
this.MasterString = masterString;
}
public SubstringRange FindString(string s){
foreach(SubstringRange r in this){
if(r.Contents.Equals(s, StringComparison.InvariantCultureIgnoreCase))
return r;
}
return null;
}
public SubstringRange FindSubstring(string s){
foreach(SubstringRange r in this){
if(r.Contents.StartsWith(s, StringComparison.InvariantCultureIgnoreCase))
return r;
}
return null;
}
public bool ContainsRange(SubstringRange range) {
foreach(SubstringRange r in this) {
if(r.ContainsRange(range))
return true;
}
return false;
}
public bool AddSubstring(string substring) {
bool result = false;
foreach(SubstringRange r in this) {
if(r.ExpandTo(substring)) {
result = true;
}
}
if(FindSubstring(substring) == null) {
bool patternfound = true;
int start = 0;
while(patternfound){
patternfound = false;
start = MasterString.IndexOf(substring, start, StringComparison.InvariantCultureIgnoreCase);
patternfound = start != -1;
if(patternfound) {
SubstringRange r = new SubstringRange();
r.MasterString = this.MasterString;
r.Start = start++;
r.Length = substring.Length;
if(!ContainsRange(r)) {
this.Add(r);
result = true;
}
}
}
}
return result;
}
private static bool SubstringRangeMoreThanOneChar(SubstringRange range) {
return range.Length > 1;
}
public float Weight {
get {
if(MasterString.Length == 0 || Count == 0)
return 0;
float numerator = 0;
int denominator = 0;
foreach(SubstringRange r in this.FindAll(SubstringRangeMoreThanOneChar)) {
numerator += r.Length;
denominator++;
}
if(denominator == 0)
return 0;
return numerator / denominator / MasterString.Length;
}
}
public void RemoveOverlappingRanges() {
SubstringRangeList l = new SubstringRangeList(this.MasterString);
l.AddRange(this);//create a copy of this list
foreach(SubstringRange r in l) {
if(this.Contains(r) && this.ContainsRange(r)) {
Remove(r);//try to remove the range
if(!ContainsRange(r)) {//see if the list still contains "superset" of this range
Add(r);//if not, add it back
}
}
}
}
public void AddStringToCompare(string s) {
for(int start = 0; start < s.Length; start++) {
for(int len = 1; start + len <= s.Length; len++) {
string part = s.Substring(start, len);
if(!AddSubstring(part))
break;
}
}
RemoveOverlappingRanges();
}
}
public class PartialStringComparer {
public float Compare(string s1, string s2) {
SubstringRangeList srl1 = new SubstringRangeList(s1);
srl1.AddStringToCompare(s2);
SubstringRangeList srl2 = new SubstringRangeList(s2);
srl2.AddStringToCompare(s1);
return (srl1.Weight + srl2.Weight) / 2;
}
}
Levenstein distance one is much simpler (adapted from http://www.merriampark.com/ld.htm):
public class Distance {
/// <summary>
/// Compute Levenshtein distance
/// </summary>
/// <param name="s">String 1</param>
/// <param name="t">String 2</param>
/// <returns>Distance between the two strings.
/// The larger the number, the bigger the difference.
/// </returns>
public static int LD(string s, string t) {
int n = s.Length; //length of s
int m = t.Length; //length of t
int[,] d = new int[n + 1, m + 1]; // matrix
int cost; // cost
// Step 1
if(n == 0) return m;
if(m == 0) return n;
// Step 2
for(int i = 0; i <= n; d[i, 0] = i++) ;
for(int j = 0; j <= m; d[0, j] = j++) ;
// Step 3
for(int i = 1; i <= n; i++) {
//Step 4
for(int j = 1; j <= m; j++) {
// Step 5
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
// Step 6
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
I doubt there is, considering even the Customs Department doesn't seem to have a satisfactory answer...
If there is a solution to this problem I seriously doubt it's a part of core C#. Off the top of my head, it would require a database of first, middle and last name frequencies, as well as account for initials, as in your example. This is fairly complex logic that relies on a database of information.
Second to Levenshtein distance, what language do you want? I was able to find an implementation in C# on codeproject pretty easily.
In an application I worked on, the Last name field was considered reliable.
So presented all the all the records with the same last name to the user.
User could sort by the other fields to look for similar names.
This solution was good enough to greatly reduce the issue of users creating duplicate records.
Basically looks like the issue will require human judgement.

Categories