How to make this code faster? - c#

I have a function which works a very slow for my task (it must be 10-100 times faster)
Here is code
public long Support(List<string[]> sequences, string[] words)
{
var count = 0;
foreach (var sequence in sequences)
{
for (int i = 0; i < sequence.Length - words.Length + 1; i++)
{
bool foundSeq = true;
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
if (foundSeq)
{
count++;
break;
}
}
}
return count;
}
public void Support(List<string[]> sequences, List<SequenceInfo> sequenceInfoCollection)
{
System.Threading.Tasks.Parallel.ForEach(sequenceInfoCollection.Where(x => x.Support==null),sequenceInfo =>
{
sequenceInfo.Support = Support(sequences, sequenceInfo.Sequence);
});
}
Where List<string[]> sequences is a array of array of words. This array usually contains 250k+ rows. Each row is about 4-7 words. string[] words is a array of words(all words contains in sequences at least one time) which we trying to count.
The problem is foundSeq = foundSeq && sequence[i + j] == words[j];. This code take most of all execution time(Enumerable.MoveNext at second place). I want to hash all words in my array. Numbers compares faster then strings, right? I think it can help me to get 30%-80% of perfomance. But i need 10x! What can i to do? If you want to know it's a part of apriory algorithm.
Support function check if the words sequence is a part any sequence in the sequences list and count how much times.

Knuth–Morris–Pratt algorithm
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.
From Wikipedia: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
This is one of the improvements that you should make. With a small difference: a "word" in your code is a "characters" in the terminology of the algorithm; your "words" array is what is a word in KMP.
The idea is that when you search for "abc def ghi jkl", and have matched "abc def ghi" already, but the next word does not match, you can jump three positions.
Search: abc def ghi jkl
Text: abc def ghi klm abc def ghi jkl
i=0: abc def ghi jkl?
skip 2: XXX XXX <--- you save two iterations here, i += 2
i=2: abc?
i=3: abc? ...

The first optimisation I would make is an early-fail. Your inner loop continues through the whole sequence even if you know it's failed, and you're doing an unnecessary bit of boolean logic. Your code is this:
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
Instead, just do this (or equivalent):
for (int j = 0; j < words.Length; j++)
{
if (sequence[i + j] != words[j])
{
foundSeq = false;
break;
}
}
This will save you the majority of your comparisons (you will drop out at the first word if it doesn't match, instead of continuing to compare when you know the result is false). It could even make the tenfold difference you're looking for if you expect the occurrence of the individual words in each sequence to be low (if, say, you are finding sentences in a page of English text).

Theoretically, you could concatenate each sequence and using substring matching. I don't have a compiler at hand now, so I can't test whether it will actually improve performance, but this is the general idea:
List<string> sentences = sequences.Select(seq => String.Join(" ", seq));
string toMatch = String.Join(" ", words);
return sentences.Count(sentence => sentence.Contains(toMatch));

Related

PadRight in string of arrays doesn't add chars

I created array of strings which includes strings with Length from 4 to 6. I am trying to PadRight 0's to get length for every element in array to 6.
string[] array1 =
{
"aabc", "aabaaa", "Abac", "abba", "acaaaa"
};
for (var i = 0; i <= array1.Length-1; i++)
{
if (array1[i].Length < 6)
{
for (var j = array1[i].Length; j <= 6; j++)
{
array1[i] = array1[i].PadRight(6 - array1[i].Length, '0');
}
}
Console.WriteLine(array1[i]);
}
Right now the program writes down the exact same strings I have in array without adding 0's at the end. I made a little research and found some informations about that strings are immutable, but still there are some example with changing strings inside, but I couldn't find any with PadRight or PadLeft and I fell like there must be a way to do it, but I just can't figure it out.
Any ideas on how to fix that issue?
The first argument to PadRight is the total length you want. You've specified 6 - array1[i].Length - and as all your strings start off with at least 3 characters, you're padding to at most 3 characters, so it's not doing anything.
You don't need your inner loop, and your outer loop condition is more conventionally written as <. This is one way I'd write that code:
using System;
public class Test
{
static void Main()
{
string[] array =
{
"aabc", "aabaaa", "Abac", "abba", "acaaaa"
};
for (var i = 0; i < array.Length; i++)
{
array[i] = array[i].PadRight(6, '0');
Console.WriteLine(array[i]);
}
}
}
In fact I'd probably use foreach, or even Select, but that's a different matter. I've left this using an array to be a bit closer to your original code.

Use Substring method to count specific character from a string

So I am new to programming and one of my exercises involves using a substring within a loop to count the number of iterations of a specific character with a user's input.
As far as I can tell for the exercise, and what I know in C sharp so far, using a substring in this will only help read the position of a character within the input, and will not count it. I can not make heads or tails of this, and am at a loss.
I want to know how to understand this, and what ways I am missing the point of the exercise.
I need an idea of how to set the substring to read the number of a certain character type from the end-user's input from console.
This is the original question:
There is a method called Substring that we can use with a string to look at a portion of a string.
For example, the following code will print the letter a.
string input = "abcdef";
Console.WriteLine(input.Substring(0, 1));
Assignment:
Given the following input, create a loop that uses the Substring method to count the number of times the letter ‘z’ occurs in a string input by the user.
asdfojiaqweb;ounqwrb;ounwqen;zzzn bnaozonza
Edit: So Far I have the code to count the number of times that Z is used, but I don't know how to incorporate a substring into it
int total = 0;
var letter = new HashSet<char> { 'z' };
Console.WriteLine("Please enter your letters:");
// asdfojiaqweb;ounqwrb;ounwqen;zzzn bnaozonza
string sentence = Console.ReadLine().ToLower();
for (int i = 0; i < sentence.Length; i++)
{
if (letter.Contains(sentence[i]))
{
total++;
}
}
Console.WriteLine("Total number of Z uses is: {0}", total);
// Console.WriteLine(sentence.Substring(0, 1));
If you must use Substring, then replace your loop by this
for (int i = 0; i < sentence.Length; i++)
{
if (sentence.Substring(i, 1) == "z")
{
total++;
}
}
And if you need to both count uppercase and lowercase z, then use following code
for (int i = 0; i < sentence.Length; i++)
{
if (string.Equals(sentence.Substring(i, 1), "z", StringComparison.OrdinalIgnoreCase))
{
total++;
}
}

whats wrong in this logic of finding longest common child of string

i came up with this logic to find longest common child of two strings of equal length but it runs successfuly only on simple outputs and fails others,pls guide me what i am doing wrong here.
String a, b;
int sum = 0;
int[] ar,br;
ar = new int[26];
br = new int[26];
a = Console.ReadLine();
b = Console.ReadLine();
for (int i = 0; i < a.Length; i++)
{
ar[(a[i] - 65)]++;
br[(b[i] - 65)]++;
}
for(int i =0;i<ar.Length;i++)
{
if (ar[i] <= br[i]) { sum += ar[i]; }
else sum += br[i];
}
Console.Write(sum);
Console.ReadLine();
output:
AA
BB
0 correct.
HARRRY
SALLY
2 correct
for both above input it runs but when i submit for evaluation it fails on their test cases.i cant access their testacase on which my logic fails.i wanna know where does my logic fails.
Your second loop is all wrong. It is simply finding the count of characters that occur in both the array and the count is only updated with the the no. of the common characters contained in the string containing the least no. of these common characters.
refer this link for the correct implementation.
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Retrieve_the_Longest_Substring
Also convert your input to uppercase characters using String.ToUpper before you use the input string.

How to get the a string that is most repeated in a list

I have a lot of lists like the following:
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[1]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[2]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[3]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[3]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[4]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[4]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[5]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[5]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[6]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[7]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[8]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[8]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[9]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[9]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[11]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[12]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[12]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[13]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[13]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[14]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[14]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[15]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[15]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[16]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[16]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[17]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[18]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[18]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[19]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[19]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[20]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[20]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[21]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[22]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[22]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[23]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[23]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[24]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[24]/div[2]/div[4]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[25]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[25]/div[2]/div[4]
And I need to extract the portion that is most repeated in each line, which in this case is
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li
What's the best way to do this?
I'm using C#/.net
thanks!
If I understand your question correctly, what you want is the longest common prefix of all lines. You could obtain it by doing something like that:
void Main()
{
string path = #"D:\tmp\so5670107.txt";
string[] lines = File.ReadAllLines(path);
string prefix = LongestCommonPrefix(lines);
Console.WriteLine(prefix);
}
static string LongestCommonPrefix(string a, string b)
{
int length = 0;
for (int i = 0; i < a.Length && i < b.Length; i++)
{
if (a[i] == b[i])
length++;
else
break;
}
return a.Substring(0, length);
}
static string LongestCommonPrefix(IEnumerable<string> strings)
{
return strings.Aggregate(LongestCommonPrefix);
}
The result is:
/html[1]/body[1]/div[5]/div[1]/div[2]/div[
(the expected result you give in the question seems incorrect, since there are lines that don't match it)
I chose a naive approach for the sake of simplicity, but of course there are more efficient ways of finding the longest common prefix between two strings (using a dichotomic search for instance)
You could do this with a loop. Assumption is that your list of strings is in a collection called paths:
var countByPath = new Dictionary<string, int>();
foreach (var path in paths)
{
if (!countByPath.ContainsKey(path))
{
countByPath[path] = 1;
}
else
{
countByPath[path]++;
}
}
The longest substring that is repeated in the list? Assumption is that your list of strings is in a collection called paths:
var currentChoice = "";
foreach (var path in paths)
{
for (int i = path.Length; i > 0; i--)
{
var candidate = path.Substring(0, i);
if (i > currentChoice.Length &&
paths.Count(p => p.StartsWith(candidate)) > 1)
currentChoice = candidate;
else
break;
}
}
Console.WriteLine(currentChoice);
The result is then
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]
since it is repeated twice
There is already an algorithm for this. I can't remember what it's called, but if you are interested in language independent implementation. It works in the following way:
Read first line
Read second line. If second line is the same as first line, than increase counter by one, otherwise keep counter at zero.
Carry on reading lines, if three lines are the same (i.e. repeat), than your counter will be 2. If next line is different to the previous three, than decrease counter by 1.
E.g.
String1 - Counter: 0
String1 - Counter: 1 (Store String1 in a variable)
String1 - Counter: 2 (Store String1 in same variable)
String2 - Counter: 1 (Still store String1 in variable)
I hope this makese sense. I did this at uni few years ago. Can't remember mathematician who came up with algorithm, but it's fairly old.

Get all possible word combinations

I have a list of n words (let's say 26). Now I want to get a list of all possible combinations, but with a maximum of k words per row (let's say 5)
So when my word list is: aaa, bbb, ..., zzz
I want to get:
aaa
bbb
...
aaabbb
aaaccc
...
aaabbbcccdddeeefff
aaabbbcccdddeeeggg
...
I want to make it variable, so that it will work with any n or k value.
There should be no word be twice and every combinations needs to be taken (even if there are very much).
How could I achieve that?
EDIT:
Thank you for your answers. It is not an assignment. Is is just that I forgot the combinations of my password and I want to be sure that I have all combinations tested. Although I have not 26 password parts, but this made it easier to explain what I want.
If there are other people with the same problem, this link could be helpfull:
Generate word combination array in c#
i wrote simple a function to do this
private string allState(int index,string[] inStr)
{
string a = inStr[index].ToString();
int l = index+1;
int k = l;
var result = string.Empty;
var t = inStr.Length;
int i = index;
while (i < t)
{
string s = a;
for (int j = l; j < k; j++)
{
s += inStr[j].ToString();
}
result += s+",";
k++;
i++;
}
index++;
if(index<inStr.Length)
result += allState(index, inStr);
return result.TrimEnd(new char[] { ',' });
}
allState(0, new string[] { "a", "b", "c"})
You could take a look at this
However, if you need to get large numbers of combinations (in the tens of millions) you should use lazy evaluation for the generation of the combinations.

Categories