For-loop has Unexpected Side-Effect - c#

So I have been writing a small byte cipher in C#, and everything was going well until I tried to do some for loops to test runtime performance. This is where things started to get really weird. Allow me to show you, instead of trying to explain it:
First off, here is the working code (for loops commented out):
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using DreamforceFramework.Framework.Cryptography;
namespace TestingApp
{
static class Program
{
static void Main(string[] args)
{
string myData = "This is a test.";
byte[] myDataEncrypted;
string myDecryptedData = null;
Stopwatch watch = new Stopwatch();
Console.WriteLine("Warming up for Encryption...");
//for (int i = 0; i < 20; i++)
//{
// // Warm up the algorithm for a proper speed benchmark.
// myDataEncrypted = DreamforceByteCipher.Encrypt(myData, "Dreamforce");
//}
watch.Start();
myDataEncrypted = DreamforceByteCipher.Encrypt(myData, "Dreamforce");
watch.Stop();
Console.WriteLine("Encryption Time: " + watch.Elapsed);
Console.WriteLine("Warming up for Decryption...");
//for (int i = 0; i < 20; i++)
//{
// // Warm up the algorithm for a proper speed benchmark.
// myDecryptedData = DreamforceByteCipher.Decrypt(myDataEncrypted, "Dreamforce");
//}
watch.Reset();
watch.Start();
myDecryptedData = DreamforceByteCipher.Decrypt(myDataEncrypted, "Dreamforce");
watch.Stop();
Console.WriteLine("Decryption Time: " + watch.Elapsed);
Console.WriteLine(myDecryptedData);
Console.Read();
}
}
}
and my ByteCipher(I highly simplified it after the error originally occurred as an attempt to pinpoint the problem):
using System;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using DreamforceFramework.Framework.Utilities;
namespace DreamforceFramework.Framework.Cryptography
{
/// <summary>
/// DreamforceByteCipher
/// Gordon Kyle Wallace, "Krythic"
/// Copyright (C) 2015 Gordon Kyle Wallace, "Krythic" - All Rights Reserved
/// </summary>
public static class DreamforceByteCipher
{
public static byte[] Encrypt(string data, string password)
{
byte[] bytes = Encoding.UTF8.GetBytes(data);
string passwordHash = DreamforceHashing.GenerateSHA256(password);
byte[] hashedPasswordBytes = Encoding.ASCII.GetBytes(passwordHash);
int passwordShiftIndex = 0;
bool twistPath = false;
for (int i = 0; i < bytes.Length; i++)
{
int shift = hashedPasswordBytes[passwordShiftIndex];
bytes[i] = twistPath
? (byte)(
(data[i] + (shift * i)))
: (byte)(
(data[i] - (shift * i)));
passwordShiftIndex = (passwordShiftIndex + 1) % 64;
twistPath = !twistPath;
}
return bytes;
}
/// <summary>
/// Decrypts a byte array back into a string.
/// </summary>
/// <param name="data"></param>
/// <param name="password"></param>
/// <returns></returns>
public static string Decrypt(byte[] data, string password)
{
string passwordHash = DreamforceHashing.GenerateSHA256(password);
byte[] hashedPasswordBytes = Encoding.UTF8.GetBytes(passwordHash);
int passwordShiftIndex = 0;
bool twistPath = false;
for (int i = 0; i < data.Length; i++)
{
int shift = hashedPasswordBytes[passwordShiftIndex];
data[i] = twistPath
? (byte)(
(data[i] - (shift * i)))
: (byte)(
(data[i] + (shift * i)));
passwordShiftIndex = (passwordShiftIndex + 1) % 64;
twistPath = !twistPath;
}
return Encoding.ASCII.GetString(data);
}
}
}
With the for loops commented out, this is the output that I get:
The very last line shows that everything was decrypted successfully.
Now...this is where things get weird. If you uncomment the for loops, and run the program, this is the output:
The decryption did not work. This makes absolutely no sense, because the variable holding the decrypted data should be rewritten each and every time. Did I encounter a bug in C#/.NET that is causing this strange behavior?
A simple solution:
http://pastebin.com/M3xa9yQK

Your Decrypt method modifies the data input array in place. Therefore, you can only call Decrypt a single time with any given input byte array before the data is no longer encrypted. Take a simple console application for example:
class Program
{
public static void Main(string[] args)
{
var arr = new byte[] { 10 };
Console.WriteLine(arr[0]); // prints 10
DoSomething(arr);
Console.WriteLine(arr[0]); // prints 11
}
private static void DoSomething(byte[] arr)
{
arr[0] = 11;
}
}
So, to answer your question, no. You haven't found a bug in .NET. You've found a very simple bug in your code.

Related

Is there a more efficient way to generate 40 million alphanumeric unique random strings in C# [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I've created a windows-forms-based small application to generate random unique alphanumeric strings with length=8. Application is working fine with small count but it got stuck for like forever when I try to generate 40 million (as per my requirement) strings. Please help me to make it efficient so that the strings could be generated quickly.
following is the complete code I've used for it.
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Threading.Tasks;
namespace RandomeString
{
public partial class Form1 : Form
{
private const string Letters = "abcdefghijklmnpqrstuvwxyz";
private readonly char[] alphanumeric = (Letters + Letters.ToLower() + "abcdefghijklmnpqrstuvwxyz0123456789").ToCharArray();
private static Random random = new Random();
private int _ticks;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
if (string.IsNullOrEmpty(textBox1.Text) || string.IsNullOrWhiteSpace(textBox2.Text))
{
string message = "Please provide required length and numbers of strings count.";
string title = "Input Missing";
MessageBoxButtons buttons = MessageBoxButtons.OK;
DialogResult result = MessageBox.Show(message, title, buttons, MessageBoxIcon.Warning);
}
else
{
int ValuesCount;
ValuesCount = Convert.ToInt32(textBox2.Text);
for (int i = 1; i <= ValuesCount; i++)
{
listBox1.Items.Add(RandomString(Convert.ToInt32(textBox1.Text)));
}
}
}
public static string RandomString(int length)
{
const string chars = "abcdefghijklmnpqrstuvwxyz0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}
private void button2_Click(object sender, EventArgs e)
{
try
{
StringBuilder sb = new StringBuilder();
foreach (object row in listBox1.Items)
{
sb.Append(row.ToString());
sb.AppendLine();
}
sb.Remove(sb.Length - 1, 1); // Just to avoid copying last empty row
Clipboard.SetData(System.Windows.Forms.DataFormats.Text, sb.ToString());
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
private void timer1_Tick(object sender, EventArgs e)
{
_ticks++;
this.Text = _ticks.ToString();
}
}
}
One way to speed things up is to avoid LINQ. For example, take a look at these two implementations:
public static string LinqStuff(int length)
{
const string chars = "abcdefghijklmnpqrstuvwxyz0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}
public static string ManualStuff(int length)
{
const string chars = "abcdefghijklmnpqrstuvwxyz0123456789";
const int clength = 35;
var buffer = new char[length];
for(var i = 0; i < length; ++i)
{
buffer[i] = chars[random.Next(clength)];
}
return new string(buffer);
}
Running it through this:
private void TestThis(long iterations)
{
Console.WriteLine($"Running {iterations} iterations...");
var sw = new Stopwatch();
sw.Start();
for (long i = 0; i < iterations; ++i)
{
LinqStuff(20);
}
sw.Stop();
Console.WriteLine($"LINQ took {sw.ElapsedMilliseconds} ms.");
sw.Reset();
sw.Start();
for (long i = 0; i < iterations; ++i)
{
ManualStuff(20);
}
sw.Stop();
Console.WriteLine($"Manual took {sw.ElapsedMilliseconds} ms.");
}
With this:
TestThis(50_000_000);
Yielded these results:
LINQ took 28272 ms.
Manual took 9449 ms.
So by using LINQ, you increase the time it takes to generate strings by 3 times.
You could tweak this more and squeeze out a few more seconds, probably (for example, send in the same char[] buffer to each call)
Don't use linq
pre-allocate the memory
don't put it in to a UI control
use as many cores and threads as you can.
use direct memory.
Write the results to a file, instead of using the clipboard
This could likely be done quicker and even more efficiently, see notes. However, I can generate the chars in under 200ms
Note : Span<T> would give better results, however due to the lamdas it's just easier to take the small hit from fixed and use pointers
private const string Chars = "abcdefghijklmnpqrstuvwxyz0123456789";
private static readonly ThreadLocal<Random> _random =
new ThreadLocal<Random>(() => new Random());
public static unsafe void Do(byte[] array, int index)
{
var r = _random.Value;
fixed (byte* pArray = array)
{
var pLen = pArray + ((index + 1) * 1000000);
int i = 1;
for (var p = pArray + (index * 1000000); p < pLen; p++ ,i++)
if ((i % 9) == 0) *p = (byte)'\r';
else if ((i % 10) == 0) *p = (byte)'\n';
else *p = (byte)Chars[r.Next(35)];
}
}
public static async Task Main(string[] args)
{
var array = new byte[40000000 * ( 8 + 2)];
var sw = Stopwatch.StartNew();
Parallel.For(0, 39, (index) => Do(array, index));
Console.WriteLine(sw.Elapsed);
sw = Stopwatch.StartNew();
await using (var fs = new FileStream(#"D:\asdasd.txt", FileMode.Create,FileAccess.Write,FileShare.None, 1024*1024,FileOptions.Asynchronous|FileOptions.SequentialScan))
await fs.WriteAsync(array,0, array.Length);
Console.WriteLine(sw.Elapsed);
}
Output
00:00:00.1768141
00:00:00.4369418
Note 1 : I haven't really put much thought into this apart from the raw generation, obviously there are other considerations.
Note 2 : Also this will end up on the large object heap, so buyer beware. You would need to generate them straight to file so save this from ending up on the LOB
Note 3 : I give no guarantees about the random distribution, likely a different random number generator would be better overall
Note 4 : I used 40 index's because the math was easy, you would get slightly better results if you could match your threads to cores

Character Counter using Array List

I need to design a program that reads in an ASCII text file and creates an output file that contains each unique ASCII character and the number of times it appears in the file. Each unique character in the file must be represented by a character frequency class instance. The character frequency objects must be stored in an array list. My code is below:
using System.IO;
using System;
using System.Collections;
namespace ASCII
{
class CharacterFrequency
{
char ch;
int frequency;
public char getCharacter()
{
return ch;
}
public void setCharacter(char ch)
{
this.ch = ch;
}
public int getfrequency()
{
return frequency;
}
public void setfrequency(int frequency)
{
this.frequency = frequency;
}
static void Main()
{
string OutputFileName;
string InputFileName;
Console.WriteLine("Enter the file path");
InputFileName = Console.ReadLine();
Console.WriteLine("Enter the outputfile name");
OutputFileName = Console.ReadLine();
StreamWriter streamWriter = new StreamWriter(OutputFileName);
string data = File.ReadAllText(InputFileName);
ArrayList al = new ArrayList();
//create two for loops to traverse through the arraylist and compare
for (int i = 0; i < data.Length; i++)
{
int k = 0;
int f = 0;
for (int j = 0; j < data.Length; j++)
{
if (data[i].Equals(data[j]))
{
f++;
if (i > j) { k++; }
}
}
al.Add(data[i] + "(" + (int)data[i] + ")" + f + " ");
foreach (var item in al)
{
streamWriter.WriteLine(item);
}
}
streamWriter.Close();
}
}
}
When I run the program, the program does not stop running and the output file keeps getting larger until it eventually runs out memory and I get an error stating that. I am not seeing where the error is or why the loop won't terminate. It should just count the characters but it seems to keep looping and repeating counting the characters. Any help?
Try this approach :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace yourNamespace
{
class Char_frrequency
{
Dictionary<Char, int> countMap = new Dictionary<char, int>();
public String getStringWithUniqueCharacters(String input)
{
List<Char> uniqueList = new List<Char>();
foreach (Char x in input)
{
if (countMap.ContainsKey(x))
{
countMap[x]++;
}
else
{
countMap.Add(x, 1);
}
if (!uniqueList.Contains(x))
{
uniqueList.Add(x);
}
}
Char[] uniqueArray = uniqueList.ToArray();
return new String(uniqueArray);
}
public int getFrequency(Char x)
{
return countMap[x];
}
}
}
This might not be the ideal solution. But you can use these methods

Memory leakage in C# while using very big strings

Inspecting memory leakage on one of my apps, I've found that the next code "behaves strange".
public String DoTest()
{
String fileContent = "";
String fileName = "";
String[] filesNames = System.IO.Directory.GetFiles(logDir);
List<String> contents = new List<string>();
for (int i = 0; i < filesNames.Length; i++)
{
fileName = filesNames[i];
if (fileName.ToLower().Contains("aud"))
{
contents.Add(System.IO.File.ReadAllText(fileName));
}
}
fileContent = String.Join("", contents);
return fileContent;
}
Before running this piece of code, the memory used by object was approximatly 1.4 Mb. Once this method called, it used 70MB. waiting some minutes, nothing changed (the original object was being released a long time ago).
calling to
GC.Collect();
GC.WaitForFullGCComplete();
decreased memory to 21MB (Yet, far much more than the 1.4MB at the beginning).
Tested with console app (infinity loop) and winform app. Happens even on direct call (no need to create more objects).
Edit: full code (console app) to show the problem
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
namespace memory_tester
{
/// <summary>
/// Class to show loosing of memory
/// </summary>
class memory_leacker
{
// path to folder with 250 text files, total of 80MB of text
const String logDir = #"d:\http_server_test\http_server_test\bin\Debug\logs\";
/// <summary>
/// Collecting all text from files in folder logDir and returns it.
/// </summary>
/// <returns></returns>
public String DoTest()
{
String fileContent = "";
String fileName = "";
String[] filesNames = System.IO.Directory.GetFiles(logDir);
List<String> contents = new List<string>();
for (int i = 0; i < filesNames.Length; i++)
{
fileName = filesNames[i];
if (fileName.ToLower().Contains("aud"))
{
//using string builder directly into fileContent shows same results.
contents.Add(System.IO.File.ReadAllText(fileName));
}
}
fileContent = String.Join("", contents);
return fileContent;
}
/// <summary>
/// demo call to see that no memory leaks here
/// </summary>
/// <returns></returns>
public String DoTestDemo()
{
return "";
}
}
class Program
{
/// <summary>
/// Get current proc's private memory
/// </summary>
/// <returns></returns>
public static long GetUsedMemory()
{
String procName = System.AppDomain.CurrentDomain.FriendlyName;
long mem = Process.GetCurrentProcess().PrivateMemorySize64 ;
return mem;
}
static void Main(string[] args)
{
const long waitTime = 10; //was 240
memory_leacker mleaker = new memory_leacker();
for (int i=0; i< waitTime; i++)
{
Console.Write($"Memory before {GetUsedMemory()} Please wait {i}\r");
Thread.Sleep(1000);
}
Console.Write("\r\n");
mleaker.DoTestDemo();
for (int i = 0; i < waitTime; i++)
{
Console.Write($"Memory after demo call {GetUsedMemory()} Please wait {i}\r");
Thread.Sleep(1000);
}
Console.Write("\r\n");
mleaker.DoTest();
for (int i = 0; i < waitTime; i++)
{
Console.Write($"Memory after real call {GetUsedMemory()} Please wait {i}\r");
Thread.Sleep(1000);
}
Console.Write("\r\n");
mleaker = null;
for (int i = 0; i < waitTime; i++)
{
Console.Write($"Memory after release objectg {GetUsedMemory()} Please wait {i}\r");
Thread.Sleep(1000);
}
Console.Write("\r\n");
GC.Collect();
GC.WaitForFullGCComplete();
for (int i = 0; i < waitTime; i++)
{
Console.Write($"Memory after GC {GetUsedMemory()} Please wait {i}\r");
Thread.Sleep(1000);
}
Console.Write("\r\n...pause...");
Console.ReadKey();
}
}
}
I believe that if you use stringbuilder on fileContent instead string, you can improve your performance and usage of memory.
public String DoTest()
{
var fileContent = new StringBuilder();
String fileName = "";
String[] filesNames = System.IO.Directory.GetFiles(logDir);
for (int i = 0; i < filesNames.Length; i++)
{
fileName = filesNames[i];
if (fileName.ToLower().Contains("aud"))
{
fileContent.Append(System.IO.File.ReadAllText(fileName));
}
}
return fileContent;
}
I refactored version of your code below, here I have removed the need for the list of strings named 'contents' in your original question.
public String DoTest()
{
string fileContent = "";
IEnumerable<string> filesNames = System.IO.Directory.GetFiles(logDir).Where(x => x.ToLower().Contains("aud"));
foreach (var fileName in filesNames)
{
fileContent = string.Join("", System.IO.File.ReadAllText(fileName));
}
return fileContent;
}

NAudio - wave split and combine wave in realtime

I am working with a multi input soundcard and I want to achieve live mixing of multiple inputs. All the inputs are stereo, so I need to split them in first place, mix a selection of channel and provide them as mono stream.
The goal would be something like this mix Channel1[left] + Channel3[right] + Channel4[right] -> mono stream.
I have already implemented a process chain like this:
1) WaveIn -> create BufferedWaveProvider for each channel -> add Samples (just the ones for current channel) to each BufferedWaveProvider by using wavein.DataAvailable += { buffwavprovider[channel].AddSamples(...)...
This gives me a nice list of multiple BufferdWaveProvider. The splitting audio part here is implemented correctly.
2) Select multiple BufferedWaveProviders and give them to MixingWaveProvider32. Then create a WaveStream (using WaveMixerStream32 and IWaveProvider).
3) A MultiChannelToMonoStream takes that WaveStream and generates a mixdown. This also works.
But result is, that audio is chopped. Like some trouble with the buffer....
Is this the correct way to handle this problem, or is there a way better solution around?
edit - code added:
public class AudioSplitter
{
public List<NamedBufferedWaveProvider> WaveProviders { private set; get; }
public string Name { private set; get; }
private WaveIn _wavIn;
private int bytes_per_sample = 4;
/// <summary>
/// Splits up one WaveIn into one BufferedWaveProvider for each channel
/// </summary>
/// <param name="wavein"></param>
/// <returns></returns>
public AudioSplitter(WaveIn wavein, string name)
{
if (wavein.WaveFormat.Encoding != WaveFormatEncoding.IeeeFloat)
throw new Exception("Format must be IEEE float");
WaveProviders = new List<NamedBufferedWaveProvider>(wavein.WaveFormat.Channels);
Name = name;
_wavIn = wavein;
_wavIn.StartRecording();
var outFormat = NAudio.Wave.WaveFormat.CreateIeeeFloatWaveFormat(wavein.WaveFormat.SampleRate, 1);
for (int i = 0; i < wavein.WaveFormat.Channels; i++)
{
WaveProviders.Add(new NamedBufferedWaveProvider(outFormat) { DiscardOnBufferOverflow = true, Name = Name + "_" + i });
}
bytes_per_sample = _wavIn.WaveFormat.BitsPerSample / 8;
wavein.DataAvailable += Wavein_DataAvailable;
}
/// <summary>
/// add samples for each channel to bufferedwaveprovider
/// </summary>
/// <param name="sender"></param>
/// <param name="e"></param>
private void Wavein_DataAvailable(object sender, WaveInEventArgs e)
{
int channel = 0;
byte[] buffer = e.Buffer;
for (int i = 0; i < e.BytesRecorded - bytes_per_sample; i = i + bytes_per_sample)
{
byte[] channel_buffer = new byte[bytes_per_sample];
for (int j = 0; j < bytes_per_sample; j++)
{
channel_buffer[j] = buffer[i + j];
}
WaveProviders[channel].AddSamples(channel_buffer, 0, channel_buffer.Length);
channel++;
if (channel >= _wavIn.WaveFormat.Channels)
channel = 0;
}
}
}
Using the Audiosplitter for each channel gives a list of buffered wave provider (mono 32bit float).
var mix = new MixingWaveProvider32(_waveProviders);
var wps = new WaveProviderToWaveStream(mix);
MultiChannelToMonoStream mms = new MultiChannelToMonoStream(wps);
new Thread(() =>
{
byte[] buffer = new byte[4096];
while (mms.Read(buffer, 0, buffer.Length) > 0 && isrunning)
{
using (FileStream fs = new FileStream("C:\\temp\\audio\\mono_32.wav", FileMode.Append, FileAccess.Write))
{
fs.Write(buffer, 0, buffer.Length);
}
}
}).Start();
there is some space left for optimization, but basically this gets the job done:
private void Wavein_DataAvailable(object sender, WaveInEventArgs e)
{
int channel = 0;
byte[] buffer = e.Buffer;
List<List<byte>> channelbuffers = new List<List<byte>>();
for (int c = 0; c < _wavIn.WaveFormat.Channels; c++)
{
channelbuffers.Add(new List<byte>());
}
for (int i = 0; i < e.BytesRecorded; i++)
{
var byteList = channelbuffers[channel];
byteList.Add(buffer[i]);
if (i % bytes_per_sample == bytes_per_sample - 1)
channel++;
if (channel >= _wavIn.WaveFormat.Channels)
channel = 0;
}
for (int j = 0; j < channelbuffers.Count; j++)
{
WaveProviders[j].AddSamples(channelbuffers[j].ToArray(), 0, channelbuffers[j].Count());
}
}
We need to provide a WaveProvider (WaveProviders[j]) for each channel.

fastest way to compare string elements with each other

I have a list with a lot of strings (>5000) where I have to take the first element and compare it to all following elements. Eg. consider this list of string:
{ one, two, three, four, five, six, seven, eight, nine, ten }. Now I take one and compare it with two, three, four, ... afterwards I take two and compare it with three, four, ...
I believe the lookup is the problem why this takes so long. On a normal hdd (7200rpm) it takes about 30 seconds, on a ssd 10 seconds. I just don't know how I can speed this up even more. All strings inside the list are ordered by ascending and it is important to check them in this order. If it can speed things up considerably I would not mind to have an unordered list.
I took a look into hashset but I need the checking order so that would not work even with the fast contain method.
EDIT: As it looks like I am not clear enough and as wanted by Dusan here is the complete code. My problem case: I have a lot of directories, with similar names and am getting a collection with all directory names only and comparing them with each other for similarity and writing that. Hence the comparison between hdd and ssd. But that is weird because I am not writing instantly, instead putting it in a field and writing in the end. Still there is a difference in speed.
Why did I not include the whole code? Because I believe my core issue here is the lookup of value from the list and the comparison between the 2 strings. Everything else should already be sufficiently fast, adding to list, looking in the blacklist (hashset) and getting a list of dir names.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;
using System.Threading;
namespace Similarity
{
/// <summary>
/// Credit http://www.dotnetperls.com/levenshtein
/// Contains approximate string matching
/// </summary>
internal static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
internal class Program
{
#region Properties
private static HashSet<string> _blackList = new HashSet<string>();
public static HashSet<string> blackList
{
get
{
return _blackList;
}
}
public static void AddBlackListEntry(string line)
{
blackList.Add(line);
}
private static List<string> _similar = new List<string>();
public static List<string> similar
{
get
{
return _similar;
}
}
public static void AddSimilarEntry(string s)
{
similar.Add(s);
}
#endregion Properties
private static void Main(string[] args)
{
Clean();
var directories = Directory.EnumerateDirectories(Directory.GetCurrentDirectory(), "*", SearchOption.TopDirectoryOnly)
.Select(x => new DirectoryInfo(x).Name).OrderBy(y => new DirectoryInfo(y).Name).ToList();
using (StreamWriter sw = new StreamWriter(#"result.txt"))
{
foreach (var item in directories)
{
Console.WriteLine(item);
sw.WriteLine(item);
}
Console.WriteLine("Amount of directories: " + directories.Count());
}
if (directories.Count != 0)
{
StartSimilarityCheck(directories);
}
else
{
Console.WriteLine("No directories");
}
WriteResult(similar);
Console.WriteLine("Finish. Press any key to exit...");
Console.ReadKey();
}
private static void StartSimilarityCheck(List<string> whiteList)
{
int counter = 0; // how many did we check yet?
var watch = Stopwatch.StartNew();
foreach (var dirName in whiteList)
{
bool insertDirName = true;
if (!IsBlackList(dirName))
{
// start the next element
for (int i = counter + 1; i <= whiteList.Count; i++)
{
// end of index reached
if (i == whiteList.Count)
{
break;
}
int similiariy = LevenshteinDistance.Compute(dirName, whiteList[i]);
// low score means high similarity
if (similiariy < 2)
{
if (insertDirName)
{
//Writer(dirName);
AddSimilarEntry(dirName);
insertDirName = false;
}
//Writer(whiteList[i]);
AddSimilarEntry(dirName);
AddBlackListEntry(whiteList[i]);
}
}
}
Console.WriteLine(counter);
//Console.WriteLine("Skip: {0}", dirName);
counter++;
}
watch.Stop();
Console.WriteLine("Time: " + watch.ElapsedMilliseconds / 1000);
}
private static void WriteResult(List<string> list)
{
using (StreamWriter sw = new StreamWriter(#"similar.txt", true, Encoding.UTF8, 65536))
{
foreach (var item in list)
{
sw.WriteLine(item);
}
}
}
private static void Clean()
{
// yeah hardcoded file names incoming. Better than global variables??
try
{
if (File.Exists(#"similar.txt"))
{
File.Delete(#"similar.txt");
}
if (File.Exists(#"result.txt"))
{
File.Delete(#"result.txt");
}
}
catch (Exception)
{
throw;
}
}
private static void Writer(string s)
{
using (StreamWriter sw = new StreamWriter(#"similar.txt", true, Encoding.UTF8, 65536))
{
sw.WriteLine(s);
}
}
private static bool IsBlackList(string name)
{
return blackList.Contains(name);
}
}
To fix the bottleneck which is the second for-loop insert an if-condition which checks if similiariy is >= than what we want, if that is the case then break the loop. now it runs in 1 second. thanks everyone
Your inner loop uses a strange double check. This may prevent an important JIT optimization, removal of redundant range checks.
//foreach (var item myList)
for (int j = 0; j < myList.Count-1; j++)
{
string item1 = myList[j];
for (int i = j + 1; i < myList.Count; i++)
{
string item2 = myList[i];
// if (i == myList.Count)
...
}
}
The amount of downvotes is crazy but oh well... I found the reason for my performance issue / bottleneck thanks to the comments.
The second for loop inside StartSimilarityCheck() iterates over all entries, which in itself is no problem but when viewed under performance issues and efficient, is bad. The solution is to only check strings which are in the neighborhood but how do we know if they are?
First, we get a list which is ordered by ascension. That gives us a rough overview of similar strings. Now we define a threshold of Levenshtein score (smaller score is higher similarity between two strings). If the score is higher than the threshold it means they are not too similar, thus we can break out of the inner loop. That saves time and the program can finish really fast. Notice that that way is not bullet proof, IMHO because if the first string is 0Directory it will be at the beginning part of the list but a string like zDirectory will be further down and could be missed. Correct me if I am wrong..
private static void StartSimilarityCheck(List<string> whiteList)
{
var watch = Stopwatch.StartNew();
for (int j = 0; j < whiteList.Count - 1; j++)
{
string dirName = whiteList[j];
bool insertDirName = true;
int threshold = 2;
if (!IsBlackList(dirName))
{
// start the next element
for (int i = j + 1; i < whiteList.Count; i++)
{
// end of index reached
if (i == whiteList.Count)
{
break;
}
int similiarity = LevenshteinDistance.Compute(dirName, whiteList[i]);
if (similiarity >= threshold)
{
break;
}
// low score means high similarity
if (similiarity <= threshold)
{
if (insertDirName)
{
AddSimilarEntry(dirName);
AddSimilarEntry(whiteList[i]);
AddBlackListEntry(whiteList[i]);
insertDirName = false;
}
else
{
AddBlackListEntry(whiteList[i]);
}
}
}
}
Console.WriteLine(j);
}
watch.Stop();
Console.WriteLine("Ms: " + watch.ElapsedMilliseconds);
Console.WriteLine("Similar entries: " + similar.Count);
}

Categories