c# comparing list of IDs - c#

I have a List<Keyword> where Keyword class is:
public string keyword;
public List<int> ids;
public int hidden;
public int live;
public bool worked;
Keyword has its own keyword, a set of 20 ids, live by default is set to 1 and hidden to 0.
I just need to iterate over the whole main List to invalidate those keywords whose number of same ids is greater than 6, so comparing every pair, if the second one has more than 6 ids repeated respect to the first one, hidden is set to 1 and live to 0.
The algorithm is very basic but it takes too long when the main list has many elements.
I'm trying to guess if there could be any method I could use to increase the speed.
The basic algorithm I use is:
foreach (Keyword main_keyword in lista_de_keywords_live)
{
if (main_keyword.worked) {
continue;
}
foreach (Keyword keyword_to_compare in lista_de_keywords_live)
{
if (keyword_to_compare.worked || keyword_to_compare.id == main_keyword.id) continue;
n_ids_same = 0;
foreach (int id in main_keyword.ids)
{
if (keyword_to_compare._lista_models.IndexOf(id) >= 0)
{
if (++n_ids_same >= 6) break;
}
}
if (n_ids_same >= 6)
{
keyword_to_compare.hidden = 1;
keyword_to_compare.live = 0;
keyword_to_compare.worked = true;
}
}
}

The code below is an example of how you would use a HashSet for your problem. However, I would not recommend using it in this scenario. On the other hand, the idea of sorting the ids to make the comparison faster still.
Run it in a Console Project to try it out.
Notice that once I'm done adding new ids to a keyword, I sort them. This makes the comparison faster later on.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
namespace KeywordExample
{
public class Keyword
{
public List<int> ids;
public int hidden;
public int live;
public bool worked;
public Keyword()
{
ids = new List<int>();
hidden = 0;
live = 1;
worked = false;
}
public override string ToString()
{
StringBuilder s = new StringBuilder();
if (ids.Count > 0)
{
s.Append(ids[0]);
for (int i = 1; i < ids.Count; i++)
{
s.Append(',' + ids[i].ToString());
}
}
return s.ToString();
}
}
public class KeywordComparer : EqualityComparer<Keyword>
{
public override bool Equals(Keyword k1, Keyword k2)
{
int equals = 0;
int i = 0;
int j = 0;
//based on sorted ids
while (i < k1.ids.Count && j < k2.ids.Count)
{
if (k1.ids[i] < k2.ids[j])
{
i++;
}
else if (k1.ids[i] > k2.ids[j])
{
j++;
}
else
{
equals++;
i++;
j++;
}
}
return equals >= 6;
}
public override int GetHashCode(Keyword keyword)
{
return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
}
}
class Program
{
static void Main(string[] args)
{
List<Keyword> listOfKeywordsLive = new List<Keyword>();
//add some values
Random random = new Random();
int n = 10;
int sizeOfMaxId = 20;
for (int i = 0; i < n; i++)
{
var newKeyword = new Keyword();
for (int j = 0; j < 20; j++)
{
newKeyword.ids.Add(random.Next(sizeOfMaxId) + 1);
}
newKeyword.ids.Sort(); //sorting the ids
listOfKeywordsLive.Add(newKeyword);
}
//solution here
HashSet<Keyword> set = new HashSet<Keyword>(new KeywordComparer());
set.Add(listOfKeywordsLive[0]);
for (int i = 1; i < listOfKeywordsLive.Count; i++)
{
Keyword keywordToCompare = listOfKeywordsLive[i];
if (!set.Add(keywordToCompare))
{
keywordToCompare.hidden = 1;
keywordToCompare.live = 0;
keywordToCompare.worked = true;
}
}
//print all keywords to check
Console.WriteLine(set.Count + "/" + n + " inserted");
foreach (var keyword in set)
{
Console.WriteLine(keyword);
}
}
}
}

The obvious source of inefficiency is the way you calculate intersection of two lists (of ids). The algorithm is O(n^2). This is by the way problem that relational databases solve for every join and your approach would be called loop join. The main efficient strategies are hash join and merge join. For your scenario the latter approach may be better I guess, but you can also try HashSets if you like.
The second source of inefficiency is repeating everything twice. As (a join b) is equal to (b join a), you do not need two cycles over the whole List<Keyword>. Actually, you only need to loop over the non duplicate ones.
Using some code from here, you can write the algorithm like:
Parallel.ForEach(list, k => k.ids.Sort());
List<Keyword> result = new List<Keyword>();
foreach (var k in list)
{
if (result.Any(r => r.ids.IntersectSorted(k.ids, Comparer<int>.Default)
.Skip(5)
.Any()))
{
k.hidden = 1;
k.live = 0;
k.worked = true;
}
else
{
result.Add(k);
}
}
If you replace the linq with just the index manipulation approach (see the link above), it would be a tiny bit faster I guess.

Related

How to auto-increment number and letter to generate a string sequence wise in c#

I have to make a string which consists a string like - AAA0009, and once it reaches AAA0009, it will generate AA0010 to AAA0019 and so on.... till AAA9999 and when it will reach to AAA9999, it will give AAB0000 to AAB9999 and so on till ZZZ9999.
I want to use static class and static variables so that it can auto increment by itself on every hit.
I have tried some but not even close, so help me out thanks.
Thanks for being instructive I was trying as I Said already but anyways you already want to put negatives over there without even knowing the thing:
Code:
public class GenerateTicketNumber
{
private static int num1 = 0;
public static string ToBase36()
{
const string base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
var sb = new StringBuilder(9);
do
{
sb.Insert(0, base36[(byte)(num1 % 36)]);
num1 /= 36;
} while (num1 != 0);
var paddedString = "#T" + sb.ToString().PadLeft(8, '0');
num1 = num1 + 1;
return paddedString;
}
}
above is the code. this will generate a sequence but not the way I want anyways will use it and thanks for help.
Though there's already an accepted answer, I would like to share this one.
P.S. I do not claim that this is the best approach, but in my previous work we made something similar using Azure Table Storage which is a no sql database (FYI) and it works.
1.) Create a table to store your running ticket number.
public class TicketNumber
{
public string Type { get; set; } // Maybe you want to have different types of ticket?
public string AlphaPrefix { get; set; }
public string NumericPrefix { get; set; }
public TicketNumber()
{
this.AlphaPrefix = "AAA";
this.NumericPrefix = "0001";
}
public void Increment()
{
int num = int.Parse(this.NumericPrefix);
if (num + 1 >= 9999)
{
num = 1;
int i = 2; // We are assuming that there are only 3 characters
bool isMax = this.AlphaPrefix == "ZZZ";
if (isMax)
{
this.AlphaPrefix = "AAA"; // reset
}
else
{
while (this.AlphaPrefix[i] == 'Z')
{
i--;
}
char iChar = this.AlphaPrefix[i];
StringBuilder sb = new StringBuilder(this.AlphaPrefix);
sb[i] = (char)(iChar + 1);
this.AlphaPrefix = sb.ToString();
}
}
else
{
num++;
}
this.NumericPrefix = num.ToString().PadLeft(4, '0');
}
public override string ToString()
{
return this.AlphaPrefix + this.NumericPrefix;
}
}
2.) Make sure you perform row-level locking and issue an error when it fails.
Here's an oracle syntax:
SELECT * FROM TICKETNUMBER WHERE TYPE = 'TYPE' FOR UPDATE NOWAIT;
This query locks the row and returns an error if the row is currently locked by another session.
We need this to make sure that even if you have millions of users generating a ticket number, it will not mess up the sequence.
Just make sure to save the new ticket number before you perform a COMMIT.
I forgot the MSSQL version of this but I recall using WITH (ROWLOCK) or something. Just google it.
3.) Working example:
static void Main()
{
TicketNumber ticketNumber = new TicketNumber();
ticketNumber.AlphaPrefix = "ZZZ";
ticketNumber.NumericPrefix = "9999";
for (int i = 0; i < 10; i++)
{
Console.WriteLine(ticketNumber);
ticketNumber.Increment();
}
Console.Read();
}
Output:
Looking at your code that you've provided, it seems that you're backing this with a number and just want to convert that to a more user-friendly text representation.
You could try something like this:
private static string ValueToId(int value)
{
var parts = new List<string>();
int numberPart = value % 10000;
parts.Add(numberPart.ToString("0000"));
value /= 10000;
for (int i = 0; i < 3 || value > 0; ++i)
{
parts.Add(((char)(65 + (value % 26))).ToString());
value /= 26;
}
return string.Join(string.Empty, parts.AsEnumerable().Reverse().ToArray());
}
It will take the first 4 characters and use them as is, and then for the remainder of the value if will convert it into characters A-Z.
So 9999 becomes AAA9999, 10000 becomes AAB0000, and 270000 becomes ABB0000.
If the number is big enough that it exceeds 3 characters, it will add more letters at the start.
Here's an example of how you could go about implementing it
void Main()
{
string template = #"AAAA00";
var templateChars = template.ToCharArray();
for (int i = 0; i < 100000; i++)
{
templateChars = IncrementCharArray(templateChars);
Console.WriteLine(string.Join("",templateChars ));
}
}
public static char Increment(char val)
{
if(val == '9') return 'A';
if(val == 'Z') return '0';
return ++val;
}
public static char[] IncrementCharArray(char[] val)
{
if (val.All(chr => chr == 'Z'))
{
var newArray = new char[val.Length + 1];
for (int i = 0; i < newArray.Length; i++)
{
newArray[i] = '0';
}
return newArray;
}
int length = val.Length;
while (length > -1)
{
char lastVal = val[--length];
val[length] = Increment(lastVal);
if ( val[length] != '0') break;
}
return val;
}

Kattis phonelist issue

Once I run the following local, it is woking fast, but when I submit it to Kattis, It only exceeds 2/5 and I get Time Limit Exceeded.
Any suggestion?
I have tried with a input file with 10000 numbers and it is still fast localy :S
using System;
namespace phonelist
{
class Program
{
static void Main(string[] args)
{
int nrOfPhoneNrs = 0;
bool consistent;
int nrOfTestCases = Convert.ToInt32(Console.ReadLine().Trim());
for (byte i = 0; i < nrOfTestCases; i++)
{
consistent = false;
nrOfPhoneNrs = Convert.ToInt32(Console.ReadLine().Trim());
string[] phList = new string[nrOfPhoneNrs];
int n = 0;
while (n < nrOfPhoneNrs)
{
phList[n] = Console.ReadLine();
n++;
}
Array.Sort(phList);
int runs = nrOfPhoneNrs - 1;
for (int p = 0; p < runs; p++)
{
if (phList[p + 1].StartsWith(phList[p]))
{
consistent= true;
break;
}
}
Console.WriteLine(consistent? "NO" : "YES");
}
}
}
}
I think that your main problem is that you're using StartsWith and Array.Sort methods.
I don't want to give you too detailed advice (so that you can still solve it by yourself) but let me just suggest considering a different data structure than an array of strings, perhaps HashSet<string>.

Dictionary or list in c#

I got a strange C# programming problem. There is a data retrieval in groups of random lengths of number groups. The numbers should be all unique, like:
group[1]{1,2,15};
group[2]{3,4,7,33,22,100};
group[3]{11,12,9};
// Now there is a routine that adds a number to a group.
// For the example, just imagine the active group looks like:
// group[active]=(10,5,0)
group[active].add(new_number);
// Now if 100 were to be added to the active group
// then the active group should be merged to group[2]
// (as that one already contained 100)
// And then as a result it would like
group[1]{1,2,15};
group[2]{3,4,7,33,22,100,10,5,0}; // 10 5 0 added to group[2]
group[3]{11,12,9};
// 100 wasn't added to group[2] since it was already in there.
If the number to be added is already used (not unique) in a previous group.
Then I should merge all numbers in the active group towards that previous group, so I don’t get double numbers.
So in the above example if number 100 was added to the active
group, then all numbers in the group[active] should be merged into group[2].
And then the group[active] should start clean fresh again without any items. And since 100 was already in group[2] it should not be added double.
I am not entirely sure on how to deal with this in a proper way.
As an important criteria here is that it has to work fast.
I will have around minimal 30 groups (upper-bound unknown might be 2000 or more), and their length on average contains five integer numbers, but it could be much longer or only one number.
I kind of feel that I am reinventing the wheel here.
I wonder what this problem is called (does it go by a name, some sorting, or grouping math problem)?, with a name I might find some articles related to such problems.
But maybe it’s indeed something new, then what would be recommended? Should I use list of lists or a dictionary of lists.. or something else? Somehow the checking if the number is already present should be done fast.
I'm thinking along this path now and am not sure if it’s the best.
Instead of a single number, I use a struct now. It wasn't written in the original question as I was afraid, explaining that would make it too complex.
struct data{int ID; int additionalNumber}
Dictionary <int,List<data>> group =new Dictionary<int, List<data>>();
I can step aside from using a struct in here. A lookup list could connect the other data to the proper index. So this makes it again more close to the original description.
On a side note, great answers are given.
So far I don’t know yet what would work best for me in my situation.
Note on the selected answer
Several answers were given here, but I went for the pure dictionary solution.
Just as a note for people in similar problem scenarios: I'd still recommend testing, and maybe the others work better for you. It’s just that in my case currently it worked best. The code was also quite short which I liked, and a dictionary adds also other handy options for my future coding on this.
I would go with Dictionary<int, HashSet<int>>, since you want to avoid duplicates and want a fast way to check if given number already exists:
Usage example:
var groups = new Dictionary<int, HashSet<int>>();
// populate the groups
groups[1] = new HashSet<int>(new[] { 1,2,15 });
groups[2] = new HashSet<int>(new[] { 3,4,7,33,22,100 });
int number = 5;
int groupId = 4;
bool numberExists = groups.Values.Any(x => x.Contains(number));
// if there is already a group that contains the number
// merge it with the current group and add the new number
if (numberExists)
{
var group = groups.First(kvp => kvp.Value.Contains(number));
groups[group.Key].UnionWith(groups[groupId]);
groups[groupId] = new HashSet<int>();
}
// otherwise just add the new number
else
{
groups[groupId].Add(number);
}
From what I gather you want to iteratively assign numbers to groups satisfying these conditions:
Each number can be contained in only one of the groups
Groups are sets (numbers can occur only once in given group)
If number n exists in group g and we try to add it to group g', all numbers from g' should be transferred to g instead (avoiding repetitions in g)
Although approaches utilizing Dictionary<int, HashSet<int>> are correct, here's another one (more mathematically based).
You could simply maintain a Dictionary<int, int>, in which the key would be the number, and the corresponding value would indicate the group, to which that number belongs (this stems from condition 1.). And here's the add routine:
//let's assume dict is a reference to the dictionary
//k is a number, and g is a group
void AddNumber(int k, int g)
{
//if k already has assigned a group, we assign all numbers from g
//to k's group (which should be O(n))
if(dict.ContainsKey(k) && dict[k] != g)
{
foreach(var keyValuePair in dict.Where(kvp => kvp.Value == g).ToList())
dict[keyValuePair.Key] = dict[k];
}
//otherwise simply assign number k to group g (which should be O(1))
else
{
dict[k] = g;
}
}
Notice that from a mathematical point of view what you want to model is a function from a set of numbers to a set of groups.
I have kept it as easy to follow as I can, trying not to impact the speed or deviate from the spec.
Create a class called Groups.cs and copy and paste this code into it:
using System;
using System.Collections.Generic;
namespace XXXNAMESPACEXXX
{
public static class Groups
{
public static List<List<int>> group { get; set; }
public static int active { get; set; }
public static void AddNumberToGroup(int numberToAdd, int groupToAddItTo)
{
try
{
if (group == null)
{
group = new List<List<int>>();
}
while (group.Count < groupToAddItTo)
{
group.Add(new List<int>());
}
int IndexOfListToRefresh = -1;
List<int> NumbersToMove = new List<int>();
foreach (List<int> Numbers in group)
{
if (Numbers.Contains(numberToAdd) && (group.IndexOf(Numbers) + 1) != groupToAddItTo)
{
active = group.IndexOf(Numbers) + 1;
IndexOfListToRefresh = group.IndexOf(Numbers);
foreach (int Number in Numbers)
{
NumbersToMove.Add(Number);
}
}
}
foreach (int Number in NumbersToMove)
{
if (!group[groupToAddItTo - 1].Contains(Number))
{
group[groupToAddItTo - 1].Add(Number);
}
}
if (!group[groupToAddItTo - 1].Contains(numberToAdd))
{
group[groupToAddItTo - 1].Add(numberToAdd);
}
if (IndexOfListToRefresh != -1)
{
group[IndexOfListToRefresh] = new List<int>();
}
}
catch//(Exception ex)
{
//Exception handling here
}
}
public static string GetString()
{
string MethodResult = "";
try
{
string Working = "";
bool FirstPass = true;
foreach (List<int> Numbers in group)
{
if (!FirstPass)
{
Working += "\r\n";
}
else
{
FirstPass = false;
}
Working += "group[" + (group.IndexOf(Numbers) + 1) + "]{";
bool InnerFirstPass = true;
foreach (int Number in Numbers)
{
if (!InnerFirstPass)
{
Working += ", ";
}
else
{
InnerFirstPass = false;
}
Working += Number.ToString();
}
Working += "};";
if ((active - 1) == group.IndexOf(Numbers))
{
Working += " //<active>";
}
}
MethodResult = Working;
}
catch//(Exception ex)
{
//Exception handling here
}
return MethodResult;
}
}
}
I don't know if foreach is more or less efficient than standard for loops, so I have made an alternative version that uses standard for loops:
using System;
using System.Collections.Generic;
namespace XXXNAMESPACEXXX
{
public static class Groups
{
public static List<List<int>> group { get; set; }
public static int active { get; set; }
public static void AddNumberToGroup(int numberToAdd, int groupToAddItTo)
{
try
{
if (group == null)
{
group = new List<List<int>>();
}
while (group.Count < groupToAddItTo)
{
group.Add(new List<int>());
}
int IndexOfListToRefresh = -1;
List<int> NumbersToMove = new List<int>();
for(int i = 0; i < group.Count; i++)
{
List<int> Numbers = group[i];
int IndexOfNumbers = group.IndexOf(Numbers) + 1;
if (Numbers.Contains(numberToAdd) && IndexOfNumbers != groupToAddItTo)
{
active = IndexOfNumbers;
IndexOfListToRefresh = IndexOfNumbers - 1;
for (int j = 0; j < Numbers.Count; j++)
{
int Number = NumbersToMove[j];
NumbersToMove.Add(Number);
}
}
}
for(int i = 0; i < NumbersToMove.Count; i++)
{
int Number = NumbersToMove[i];
if (!group[groupToAddItTo - 1].Contains(Number))
{
group[groupToAddItTo - 1].Add(Number);
}
}
if (!group[groupToAddItTo - 1].Contains(numberToAdd))
{
group[groupToAddItTo - 1].Add(numberToAdd);
}
if (IndexOfListToRefresh != -1)
{
group[IndexOfListToRefresh] = new List<int>();
}
}
catch//(Exception ex)
{
//Exception handling here
}
}
public static string GetString()
{
string MethodResult = "";
try
{
string Working = "";
bool FirstPass = true;
for(int i = 0; i < group.Count; i++)
{
List<int> Numbers = group[i];
if (!FirstPass)
{
Working += "\r\n";
}
else
{
FirstPass = false;
}
Working += "group[" + (group.IndexOf(Numbers) + 1) + "]{";
bool InnerFirstPass = true;
for(int j = 0; j < Numbers.Count; j++)
{
int Number = Numbers[j];
if (!InnerFirstPass)
{
Working += ", ";
}
else
{
InnerFirstPass = false;
}
Working += Number.ToString();
}
Working += "};";
if ((active - 1) == group.IndexOf(Numbers))
{
Working += " //<active>";
}
}
MethodResult = Working;
}
catch//(Exception ex)
{
//Exception handling here
}
return MethodResult;
}
}
}
Both implimentations contain the group variable and two methods, which are; AddNumberToGroup and GetString, where GetString is used to check the current status of the group variable.
Note: You'll need to replace XXXNAMESPACEXXX with the Namespace of your project. Hint: Take this from another class.
When adding an item to your List, do this:
int NumberToAdd = 10;
int GroupToAddItTo = 2;
AddNumberToGroup(NumberToAdd, GroupToAddItTo);
...or...
AddNumberToGroup(10, 2);
In the example above, I am adding the number 10 to group 2.
Test the speed with the following:
DateTime StartTime = DateTime.Now;
int NumberOfTimesToRepeatTest = 1000;
for (int i = 0; i < NumberOfTimesToRepeatTest; i++)
{
Groups.AddNumberToGroup(4, 1);
Groups.AddNumberToGroup(3, 1);
Groups.AddNumberToGroup(8, 2);
Groups.AddNumberToGroup(5, 2);
Groups.AddNumberToGroup(7, 3);
Groups.AddNumberToGroup(3, 3);
Groups.AddNumberToGroup(8, 4);
Groups.AddNumberToGroup(43, 4);
Groups.AddNumberToGroup(100, 5);
Groups.AddNumberToGroup(1, 5);
Groups.AddNumberToGroup(5, 6);
Groups.AddNumberToGroup(78, 6);
Groups.AddNumberToGroup(34, 7);
Groups.AddNumberToGroup(456, 7);
Groups.AddNumberToGroup(456, 8);
Groups.AddNumberToGroup(7, 8);
Groups.AddNumberToGroup(7, 9);
}
long MillisecondsTaken = DateTime.Now.Ticks - StartTime.Ticks;
Console.WriteLine(Groups.GetString());
Console.WriteLine("Process took: " + MillisecondsTaken);
I think this is what you need. Let me know if I misunderstood anything in the question.
As far as I can tell it's brilliant, it's fast and it's tested.
Enjoy!
...and one more thing:
For the little windows interface app, I just created a simple winforms app with three textboxes (one set to multiline) and a button.
Then, after adding the Groups class above, in the button-click event I wrote the following:
private void BtnAdd_Click(object sender, EventArgs e)
{
try
{
int Group = int.Parse(TxtGroup.Text);
int Number = int.Parse(TxtNumber.Text);
Groups.AddNumberToGroup(Number, Group);
TxtOutput.Text = Groups.GetString();
}
catch//(Exception ex)
{
//Exception handling here
}
}

fastest way to compare string elements with each other

I have a list with a lot of strings (>5000) where I have to take the first element and compare it to all following elements. Eg. consider this list of string:
{ one, two, three, four, five, six, seven, eight, nine, ten }. Now I take one and compare it with two, three, four, ... afterwards I take two and compare it with three, four, ...
I believe the lookup is the problem why this takes so long. On a normal hdd (7200rpm) it takes about 30 seconds, on a ssd 10 seconds. I just don't know how I can speed this up even more. All strings inside the list are ordered by ascending and it is important to check them in this order. If it can speed things up considerably I would not mind to have an unordered list.
I took a look into hashset but I need the checking order so that would not work even with the fast contain method.
EDIT: As it looks like I am not clear enough and as wanted by Dusan here is the complete code. My problem case: I have a lot of directories, with similar names and am getting a collection with all directory names only and comparing them with each other for similarity and writing that. Hence the comparison between hdd and ssd. But that is weird because I am not writing instantly, instead putting it in a field and writing in the end. Still there is a difference in speed.
Why did I not include the whole code? Because I believe my core issue here is the lookup of value from the list and the comparison between the 2 strings. Everything else should already be sufficiently fast, adding to list, looking in the blacklist (hashset) and getting a list of dir names.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Text.RegularExpressions;
using System.Diagnostics;
using System.Threading;
namespace Similarity
{
/// <summary>
/// Credit http://www.dotnetperls.com/levenshtein
/// Contains approximate string matching
/// </summary>
internal static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
internal class Program
{
#region Properties
private static HashSet<string> _blackList = new HashSet<string>();
public static HashSet<string> blackList
{
get
{
return _blackList;
}
}
public static void AddBlackListEntry(string line)
{
blackList.Add(line);
}
private static List<string> _similar = new List<string>();
public static List<string> similar
{
get
{
return _similar;
}
}
public static void AddSimilarEntry(string s)
{
similar.Add(s);
}
#endregion Properties
private static void Main(string[] args)
{
Clean();
var directories = Directory.EnumerateDirectories(Directory.GetCurrentDirectory(), "*", SearchOption.TopDirectoryOnly)
.Select(x => new DirectoryInfo(x).Name).OrderBy(y => new DirectoryInfo(y).Name).ToList();
using (StreamWriter sw = new StreamWriter(#"result.txt"))
{
foreach (var item in directories)
{
Console.WriteLine(item);
sw.WriteLine(item);
}
Console.WriteLine("Amount of directories: " + directories.Count());
}
if (directories.Count != 0)
{
StartSimilarityCheck(directories);
}
else
{
Console.WriteLine("No directories");
}
WriteResult(similar);
Console.WriteLine("Finish. Press any key to exit...");
Console.ReadKey();
}
private static void StartSimilarityCheck(List<string> whiteList)
{
int counter = 0; // how many did we check yet?
var watch = Stopwatch.StartNew();
foreach (var dirName in whiteList)
{
bool insertDirName = true;
if (!IsBlackList(dirName))
{
// start the next element
for (int i = counter + 1; i <= whiteList.Count; i++)
{
// end of index reached
if (i == whiteList.Count)
{
break;
}
int similiariy = LevenshteinDistance.Compute(dirName, whiteList[i]);
// low score means high similarity
if (similiariy < 2)
{
if (insertDirName)
{
//Writer(dirName);
AddSimilarEntry(dirName);
insertDirName = false;
}
//Writer(whiteList[i]);
AddSimilarEntry(dirName);
AddBlackListEntry(whiteList[i]);
}
}
}
Console.WriteLine(counter);
//Console.WriteLine("Skip: {0}", dirName);
counter++;
}
watch.Stop();
Console.WriteLine("Time: " + watch.ElapsedMilliseconds / 1000);
}
private static void WriteResult(List<string> list)
{
using (StreamWriter sw = new StreamWriter(#"similar.txt", true, Encoding.UTF8, 65536))
{
foreach (var item in list)
{
sw.WriteLine(item);
}
}
}
private static void Clean()
{
// yeah hardcoded file names incoming. Better than global variables??
try
{
if (File.Exists(#"similar.txt"))
{
File.Delete(#"similar.txt");
}
if (File.Exists(#"result.txt"))
{
File.Delete(#"result.txt");
}
}
catch (Exception)
{
throw;
}
}
private static void Writer(string s)
{
using (StreamWriter sw = new StreamWriter(#"similar.txt", true, Encoding.UTF8, 65536))
{
sw.WriteLine(s);
}
}
private static bool IsBlackList(string name)
{
return blackList.Contains(name);
}
}
To fix the bottleneck which is the second for-loop insert an if-condition which checks if similiariy is >= than what we want, if that is the case then break the loop. now it runs in 1 second. thanks everyone
Your inner loop uses a strange double check. This may prevent an important JIT optimization, removal of redundant range checks.
//foreach (var item myList)
for (int j = 0; j < myList.Count-1; j++)
{
string item1 = myList[j];
for (int i = j + 1; i < myList.Count; i++)
{
string item2 = myList[i];
// if (i == myList.Count)
...
}
}
The amount of downvotes is crazy but oh well... I found the reason for my performance issue / bottleneck thanks to the comments.
The second for loop inside StartSimilarityCheck() iterates over all entries, which in itself is no problem but when viewed under performance issues and efficient, is bad. The solution is to only check strings which are in the neighborhood but how do we know if they are?
First, we get a list which is ordered by ascension. That gives us a rough overview of similar strings. Now we define a threshold of Levenshtein score (smaller score is higher similarity between two strings). If the score is higher than the threshold it means they are not too similar, thus we can break out of the inner loop. That saves time and the program can finish really fast. Notice that that way is not bullet proof, IMHO because if the first string is 0Directory it will be at the beginning part of the list but a string like zDirectory will be further down and could be missed. Correct me if I am wrong..
private static void StartSimilarityCheck(List<string> whiteList)
{
var watch = Stopwatch.StartNew();
for (int j = 0; j < whiteList.Count - 1; j++)
{
string dirName = whiteList[j];
bool insertDirName = true;
int threshold = 2;
if (!IsBlackList(dirName))
{
// start the next element
for (int i = j + 1; i < whiteList.Count; i++)
{
// end of index reached
if (i == whiteList.Count)
{
break;
}
int similiarity = LevenshteinDistance.Compute(dirName, whiteList[i]);
if (similiarity >= threshold)
{
break;
}
// low score means high similarity
if (similiarity <= threshold)
{
if (insertDirName)
{
AddSimilarEntry(dirName);
AddSimilarEntry(whiteList[i]);
AddBlackListEntry(whiteList[i]);
insertDirName = false;
}
else
{
AddBlackListEntry(whiteList[i]);
}
}
}
}
Console.WriteLine(j);
}
watch.Stop();
Console.WriteLine("Ms: " + watch.ElapsedMilliseconds);
Console.WriteLine("Similar entries: " + similar.Count);
}

mergesort - with an insignificant change throws SystemInvalidOperationException

A very strange thing occured in my program. Here is the simplified code.
class Program
{
static void Main(string[] args)
{
ArrayList numbers = new ArrayList();
numbers.Add(1);
numbers.Add(3);
numbers.Add(4);
numbers.Add(2);
var it = Sorts.MergeSort((ArrayList)numbers.Clone());
Sorts.PrintArray(it, "mergesort");
Console.WriteLine("DONE");
Console.ReadLine();
}
}
public static class Sorts
{
public static ArrayList BubbleSort(ArrayList numbers)
{
bool sorted = true;
for (int i = 0; i < numbers.Count; i++)
{
for (int j = 1; j < numbers.Count; j++)
{
if ((int)numbers[j - 1] > (int)numbers[j])
{
int tmp = (int)numbers[j - 1];
numbers[j - 1] = numbers[j];
numbers[j] = tmp;
sorted = false;
}
}
if (sorted)
{
return numbers;
}
}
return numbers;
}
public static ArrayList MergeSort(ArrayList numbers, int switchLimit = 3)
{
//if I use this if - everything works
if (numbers.Count <= 1)
{
// return numbers;
}
//the moment I use this condition - it throws SystemInvalidOperationException in function Merge in the line of a "for"-loop
if (numbers.Count <=switchLimit)
{
return Sorts.BubbleSort(numbers);
}
ArrayList ret = new ArrayList();
int middle = numbers.Count / 2;
ArrayList L = numbers.GetRange(0, middle);
ArrayList R = numbers.GetRange(middle, numbers.Count - middle);
L = MergeSort(L);
R = MergeSort(R);
return Merge(L, R);
}
private static ArrayList Merge(ArrayList L, ArrayList R)
{
ArrayList ret = new ArrayList();
int l = 0;
int r = 0;
for (int i = 0; i < L.Count + R.Count; i++)
{
if (l == L.Count)
{
ret.Add(R[r++]);
}
else if (r == R.Count)
{
ret.Add(L[l++]);
}
else if ((int)L[l] < (int)R[r])
{
ret.Add(L[l++]);
}
else
{
ret.Add(R[r++]);
}
}
return ret;
}
//---------------------------------------------------------------------------------
public static void PrintArray(ArrayList arr, string txt = "", int sleep = 0)
{
Console.WriteLine("{1}({0}): ", arr.Count, txt);
for (int i = 0; i < arr.Count; i++)
{
Console.WriteLine(arr[i].ToString().PadLeft(10));
}
Console.WriteLine();
System.Threading.Thread.Sleep(sleep);
}
}
There is a problem with my Sorts.MergeSort function.
When I use it normally (take a look at the first if-condition in a function - all works perfectly.
But the moment when I want it to switch to bubblesort with smaller input (the second if-condition in the function) it throws me an SystemInvalidOperationException. I have no idea where is the problem.
Do you see it?
Thanks. :)
Remark: bubblesort itself works - so there shouldn't be a problem in that sort...
The problem is with your use of GetRange:
This method does not create copies of the elements. The new ArrayList is only a view window into the source ArrayList. However, all subsequent changes to the source ArrayList must be done through this view window ArrayList. If changes are made directly to the source ArrayList, the view window ArrayList is invalidated and any operations on it will return an InvalidOperationException.
You're creating two views onto the original ArrayList and trying to work with both of them - but when one view modifies the underlying list, the other view is effectively invalidated.
If you change the code to create copies of the sublists - or if you work directly with the original list within specified bounds - then I believe it'll work fine.
(As noted in comments, I'd also strongly recommend that you use generic collections.)
Here's a short but complete program which demonstrates the problem you're running into:
using System;
using System.Collections;
class Program
{
static void Main()
{
ArrayList list = new ArrayList();
list.Add("a");
list.Add("b");
ArrayList view1 = list.GetRange(0, 1);
ArrayList view2 = list.GetRange(1, 1);
view1[0] = "c";
Console.WriteLine(view2[0]); // Throws an exception
}
}
on this line R = MergeSort(R); you alter the range of numbers represented by L. This invalidates L. Sorry I have to go so can't explain any further now.

Categories