Subgroup of list - c#

I have a 2 list of string.
Is there a simple way to find if one list contains all the strings of the 2nd list?
(By saying simple, I mean that I don't have explicitly compare for each string in one list to all the strings

Use Enumerable.Except to find differences between lists. If there is no items in result, then all items from list2 are in list1:
bool containsAll = !list2.Except(list1).Any();
Internally Except uses Set<T> to get unique items from list1 and returns only that items from list2 which are not in set. If there is nothing to return, then all items in set.

Try this:
firstList.All(x=>secondList.Contains(x));
Shorter version (method group):
firstList.All(secondList.Contains)
You need to write using for Linq:
using System.Linq;
It ckeckes wheter All items from first list are in second list. Contains checks if given item is in list. All gives true if all items of collections are matching predicate. Given predicate is: if item is in second list, so whole expresion checks if all items are in second list <- proved working :)

For larger lists use a HashSet<T> (which results in linear Big O, as opposed to O(n^2) when just using two lists):
var hash = new HashSet<string>(list2);
bool containsAll = list1.All(hash.Contains);

use LINQ
bool isSublistOf = list1.All(list2.Contains);
The All method returns true if the condition in the lambda is met for every element in the IEnumerable. The All is passed the Contains method of List2 as the Func<bool,string> which returns true if the element is found List2. The net effect is that the statement returns true if ALL of the elements in List1 are found in List2.
Performance Note
Due to the nature of the All operator, it is worst case O(n^2), but will exit at the first chance (any mismatch). Using completely random 8 byte strings, I tried out each of the cases using a rudimentary performance harness.
static void Main(string[] args)
{
long count = 5000000;
//Get 5,000,000 random strings (the superset)
var strings = CreateRandomStrings(count);
//Get 1000 random strings (the subset)
var substrings = CreateRandomStrings(1000);
//Perform the hashing technique
var start = DateTime.Now;
var hash = new HashSet<string>(strings);
var mid = DateTime.Now;
var any = substrings.All(hash.Contains);
var end = DateTime.Now;
Console.WriteLine("Hashing took " + end.Subtract(start).TotalMilliseconds + " " + mid.Subtract(start).Milliseconds + " of which was setting up the hash");
//Do the scanning all technique
start = DateTime.Now;
any = substrings.All(strings.Contains);
end = DateTime.Now;
Console.WriteLine("Scanning took " + end.Subtract(start).TotalMilliseconds);
//Do the Excepting technique
start = DateTime.Now;
any = substrings.Except(strings).Any();
end = DateTime.Now;
Console.WriteLine("Excepting took " + end.Subtract(start).TotalMilliseconds);
Console.ReadKey();
}
private static string[] CreateRandomStrings(long count)
{
var rng = new Random(DateTime.Now.Millisecond);
string[] strings = new string[count];
byte[] bytes = new byte[8];
for (long i = 0; i < count; i++) {
rng.NextBytes(bytes);
strings[i] = Convert.ToBase64String(bytes);
}
return strings;
}
The result ranked them in the following order fairly consistently:
Scanning - ~38ms (list1.All(list2.Contains))
Hashing - ~750ms (749 of which was spent setting up the hashset)
Excepting - 1200ms
The excepting method takes much longer because it requires all the work up front. Unlike the other methods, it will not exit on a mismatch, but continue to process all elements. The Hashing is much faster, but also does significant work up front in setting up the hash. This would be the fastest method if the strings were less random and intersections were more certain.
Disclaimer
All performance tuning at this level is next to irrelevant. This is just a mental exercise only

Related

How to split a list of DateTime based on the time differences found within the entries?

I have a list that contains deserialized objects, and within those objects is a DateTime field. The times goes up sequentially in seconds but there will be instances where there is a time differnce of more than 4 seconds from one entry to the next, and then continues from there.
/* List example:
26/11/2019 10:26:01
26/11/2019 10:26:02
26/11/2019 10:26:03
26/11/2019 10:26:04
26/11/2019 10:26:30 // difference detected, subdivide into new list
26/11/2019 10:26:31
26/11/2019 10:26:40 // difference detected, subdivide into new list
26/11/2019 10:26:41
26/11/2019 10:26:42
26/11/2019 10:26:43
*/
So what I want to do is iterate through this big list and then sub-divide those entries into smaller list but I'm unsure how to do this. My initial thought would be to create a for-loop and then individually compare the datetimes to see if there's a 4 second difference, but I can already tell that will take forever.
Would like some suggestions.
Thanks
If the list is big, a for loop is the best approach in terms of performance.
If you need even more performance consider working with arrays and using Array.Copy() which is very fast in copying subarrays or the whole array.
private List<DateTime[]> SplitMe(List<DateTime> bigList)
{
var splitArrays = new List<DateTime[]>();
var bigArray = bigList.ToArray();
var previousItem = bigList[0];
int lastEndingIndex = -1;
for (int index = 1; index < bigArray.Length; index++)
{
var currentItem = bigArray[index];
if (currentItem < previousItem + TimeSpan.FromSeconds(4))
{
var newArray = new DateTime[index - lastEndingIndex];
Array.Copy(bigArray, lastEndingIndex + 1, newArray, 0, index - lastEndingIndex);
splitArrays.Add(newArray);
lastEndingIndex = index;
}
previousItem = currentItem;
}
return splitArrays;
}
If you do not like the array approach, you can use the much slower LINQ.
bigList.Skip(lastEndingIndex + 1).Take(index - lastEndingIndex).ToList() will do the job.

How to speed up string operations and avoid slow loops

I am writing a code which makes a lot of combinations (Combinations might not be the right word here, sequences of string in the order they are actually present in the string) that already exist in a string. The loop starts adding combinations to a List<string> but unfortunately, my loop takes a lot of time when dealing with any file over 200 bytes. I want to be able to work with hundreds of MBs here.
Let me explain what I actually want in the simplest of ways.
Lets say I have a string that is "Afnan is awesome" (-> main string), what I would want is a list of string which encompasses different substring sequences of the main string. For example-> A,f,n,a,n, ,i,s, ,a,w,e,s,o,m,e. Now this is just the first iteration of the loop. With each iteration, my substring length increases, yielding these results for the second iteration -> Af,fn,na,n , i,is,s , a,aw,we,es,so,om,me. The third iteration would look like this: Afn,fna,nan,an ,n i, is,is ,s a, aw, awe, wes, eso, som, ome. This will keep going on until my substring length reaches half the length of my main string.
My code is as follows:
string data = File.ReadAllText("MyFilePath");
//Creating my dictionary
List<string> dictionary = new List<string>();
int stringLengthIncrementer = 1;
for (int v = 0; v < (data.Length / 2); v++)
{
for (int x = 0; x < data.Length; x++)
{
if ((x + stringLengthIncrementer) > data.Length) break; //So index does not go out of bounds
if (dictionary.Contains(data.Substring(x, stringLengthIncrementer)) == false) //So no repetition takes place
{
dictionary.Add(data.Substring(x, stringLengthIncrementer)); //To add the substring to my List<string> -> dictionary
}
}
stringLengthIncrementer++; //To increase substring length with each iteration
}
I use data.Length / 2 because I only need combinations at most half the length of the entire string. Note that I search the entire string for combinations, not half of it.
To further simplify what I am trying to do -> Suppose I have an input string =
"abcd"
the output would be =
a, b, c, d, ab, bc, cd, This rest will be cut out as it is longer than half the length of my primary string -> //abc, bcd, abcd
I was hoping if some regex method may help me achieve this. Anything that doesn't consist of loops. Anything that is exponentially faster than this? Some simple code with less complexity which is more efficient?
Update
When I used Hashset instead of List<string> for my dictionary, I did not experience any change of performance and also got an OutOfMemoryException:
You can use linq to simplify the code and very easily parallelize it, but it's not going to be orders of magnitude faster, as you would need to run it on files of 100s of MBs (that's very likely impossible).
var data = File.ReadAllText("MyFilePath");
var result = Enumerable.Range(1, data.Length / 2)
.AsParallel()
.Select(len => new HashSet<string>(
Enumerable.Range(0, data.Length - len + 1) //Adding the +1 here made it work perfectly
.Select(x => data.Substring(x, len))))
.SelectMany(t=>t)
.ToList();
General improvements, that you can do in your code to improve the performance (I don't consider if there're other more optimal solutions).
calculate data.Substring(x, stringLengthIncrementer) only once
as you do search, use SortedList, it will be faster.
initialize the List (or SortedList, or whatever) with calculated number of items. Like new List(CalucatedCapacity).
or you can try to write an algorithm that produces combinations without checking for duplicates.
You may be able to use HashSet combined with MoreLINQ's Batch feature (available on NuGet) to simplify the code a little.
public static void Main()
{
string data = File.ReadAllText("MyFilePath");
//string data = "Afnan is awesome";
var dictionary = new HashSet<string>();
for (var stringLengthIncrementer = 1; stringLengthIncrementer <= (data.Length / 2); stringLengthIncrementer++)
{
foreach (var skipper in Enumerable.Range(0, stringLengthIncrementer))
{
var batched = data.Skip(skipper).Batch(stringLengthIncrementer);
foreach (var batch in batched)
{
dictionary.Add(new string(batch.ToArray()));
}
}
}
Console.WriteLine(dictionary);
dictionary.ForEach(z => Console.WriteLine(z));
Console.ReadLine();
}
For this input:
"Afnan is awesome askdjkhaksjhd askjdhaksjsdhkajd asjsdhkajshdkjahsd asksdhkajshdkjashd aksjdhkajsshd98987ad asdhkajsshd98xcx98asdjaksjsd askjdakjshcc98z98asdsad"
performance is roughly 10x faster than your current code.

Rearranging a string of numbers from highest to lowest C#

What I'm trying to do is to create a function that will rearrange a string of numbers like "1234" to "4321". I'm certain that there are many much more efficient ways to do this than my method but I just want to see what went wrong with what I did because I'm a beginner at programming and can use the knowledge to get better.
My thought process for the code was to:
find the largest number in the inputted string
add the largest number into a list
remove the largest number from the inputted string
find the largest number again from the (now shorter) string
So I made a function that found the largest number in a string and it worked fine:
static int LargestNumber(string num)
{
int largestnumber = 0;
char[] numbers = num.ToCharArray();
foreach (var number in numbers)
{
int prevNumber = (int) char.GetNumericValue(number);
if (prevNumber >= largestnumber)
{
largestnumber = prevNumber;
}
}
return largestnumber;
}
Now the rearranging function is what I am having problems with:
static List<int> Rearrange(string num)
{
List<int> rearranged = new List<int>(); // to store rearranged numbers
foreach (var number in num) //for every number in the number string
{
string prevnumber = number.ToString(); // the previous number in the loop
if (prevnumber == LargestNumber(num).ToString()) // if the previous number is the larges number in the inputted string (num)
{
rearranged.Add(Convert.ToInt32(prevnumber)); // put the previous number into the list
// removing the previous number (largest) from the inputted string and update the inputted string (which should be now smaller)
StringBuilder sb = new StringBuilder(num);
sb.Remove(num.IndexOf(number), 1);
num = sb.ToString();
}
}
return rearranged; // return the final rearranged list of numbers
}
When I run this code (fixed for concatenation):
var rearranged = Rearrange("3250");
string concat = String.Join(" ", rearranged.ToArray());
Console.WriteLine(concat);
All I get is:
5
I'm not sure what I'm missing or what I'm doing wrong - the code doesn't seem to be going back after removing '5' which i s the highest number then removing the next highest number/
Your issue is your if statement within your loop.
if (prevnumber == LargestNumber(num).ToString()
{
rearranged.Add(Convert.ToInt32(prevnumber));
//...
}
You only ever add to your List rearranged if the value of prevnumber is the largest value, which is false for every number but 5, so the only value that ever gets added to the list is 5.
That's the answer to why it's only returning 5, but I don't think that will make your method work properly necessarily. You're doing a very dangerous thing by changing the value of the collection you are iterating over (the characters in num) from within the loop itself. Other answers have been written for you containing a method that rearranges the numbers as you've described.
Your Rearrange method is returning List<int> when you try to write that to the console, the best it can do is write System.Collections.Generic.List1[System.Int32] (its type)
Instead of trying to write the list, convert it first into a data type that can be written (string for example)
eg:
var myList = Rearrange("3250");
string concat = String.Join(" ", myList.ToArray());
Console.WriteLine(concat);
Building on pats comment you could iterate through your list and write them to the console.
e.g.
foreach(var i in Rearrange(3250))
{
console.writeline(i.ToString());
}
or if you want to see the linq example.
using system.linq;
Rearrange(3250).foreach(i => console.writeline(i.ToString()));
--edit after seeing you're only getting '5' output
This is because your function only adds number to the list if they are the largest number in your list, which is why 5 is only being added and returned.
Your Rearrange method can be written easily using Array.Sort (or similar with (List<T>) :
int[] Rearrange(int num)
{
var arr = num.ToString ().ToCharArray ();
Array.Sort (arr, (d1, d2) => d2 - d1);
return Array.ConvertAll (arr, ch => ch - '0');
}
Just reading your first sentence
Does not test for integers
static int ReversedNumber(string num)
{
char[] numbers = num.ToCharArray();
Array.Sort(numbers);
Array.Reverse(numbers);
Debug.WriteLine(String.Concat(numbers));
return (int.Parse(String.Concat(numbers)));
}
Because your foreach loop within Rearrange method only loop through the original num. The algorithm doesn't continue to go through the new num string after you have removed the largest number.
You can find the problem by debugging, this foreach loop in Rearrange goes only 4 times if your input string is "3250".

Median Maintenance Algorithm - Same implementation yields different results depending on Int32 or Int64

I found something interesting while doing a HW question.
The howework question asks to code the Median Maintenance algorithm.
The formal statement is as follows:
The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 5 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting xi denote the ith number of the file, the kth median mk is defined as the median of the numbers x1,…,xk. (So, if k is odd, then mk is ((k+1)/2)th smallest number among x1,…,xk; if k is even, then m1 is the (k/2)th smallest number among x1,…,xk.)
In order to get O(n) running time, this should be implemented using heaps obviously. Anyways, I coded this using Brute Force (deadline was too soon and needed an answer right away) (O(n2)) with the following steps:
Read data in
Sort array
Find Median
Add it to running time
I ran the algorithm through several test cases (with a known answer) and got the correct results, however when I was running the same algorithm on a larger data set I was getting the wrong answer. I was doing all the operations using Int64 ro represent the data.
Then I tried switching to Int32 and magically I got the correct answer which makes no sense to me.
The code is below, and it is also found here (the data is in the repo). The algorithm starts to give erroneous results after the 3810 index:
private static void Main(string[] args)
{
MedianMaintenance("Question2.txt");
}
private static void MedianMaintenance(string filename)
{
var txtData = File.ReadLines(filename).ToArray();
var inputData32 = new List<Int32>();
var medians32 = new List<Int32>();
var sums32 = new List<Int32>();
var inputData64 = new List<Int64>();
var medians64 = new List<Int64>();
var sums64 = new List<Int64>();
var sum = 0;
var sum64 = 0f;
var i = 0;
foreach (var s in txtData)
{
//Add to sorted list
var intToAdd = Convert.ToInt32(s);
inputData32.Add(intToAdd);
inputData64.Add(Convert.ToInt64(s));
//Compute sum
var count = inputData32.Count;
inputData32.Sort();
inputData64.Sort();
var index = 0;
if (count%2 == 0)
{
//Even number of elements
index = count/2 - 1;
}
else
{
//Number is odd
index = ((count + 1)/2) - 1;
}
var val32 = Convert.ToInt32(inputData32[index]);
var val64 = Convert.ToInt64(inputData64[index]);
if (i > 3810)
{
var t = sum;
var t1 = sum + val32;
}
medians32.Add(val32);
medians64.Add(val64);
//Debug.WriteLine("Median is {0}", val);
sum += val32;
sums32.Add(Convert.ToInt32(sum));
sum64 += val64;
sums64.Add(Convert.ToInt64(sum64));
i++;
}
Console.WriteLine("Median Maintenance result is {0}", (sum).ToString("N"));
Console.WriteLine("Median Maintenance result is {0}", (medians32.Sum()).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (sum64).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (medians64.Sum()).ToString("N"));
}
What's more interesting is that the running sum (in the sum64 variable) yields a different result than summing all items in the list with LINQ's Sum() function.
The results (the thirs one is the one that's wrong):
These are the computer details:
I'll appreciate if someone can give me some insights on why is this happening.
Thanks,
0f is initializing a 32 bit float variable, you meant 0d or 0.0 to receive a 64 bit floating point.
As for linq, you'll probably get better results if you use strongly typed lists.
new List<int>()
new List<long>()
The first thing I notice is what the commenter did: var sum64 = 0f initializes sum64 as a float. As the median value of a collection of Int64s will itself be an Int64 (the specified rules don't use the mean between two midpoint values in a collection of even cardinality), you should instead declare this variable explicitly as a long. In fact, I would go ahead and replace all usages of var in this code example; the convenience of var is being lost here in causing type-related bugs.

Faster way to do a List<T>.Contains()

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)
// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();
// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.
There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().
This operation can be called a symmetric difference.
You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.
UPDATE:
I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:
Original: 79499 ms
Hashset: 33 ms
BTW, there is a method on HashSet called SymmetricExceptWith which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, rather than leaving the initial two sets unmodified, and the code would be more elegant.
Here is the code:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
class Program
{
static void Main(string[] args)
{
// foo and bar have some identical elements (given a case-insensitive match)
var foo = getRandomStrings();
var bar = getRandomStrings();
var timer = new Stopwatch();
timer.Start();
// remove non matches
var f = foo.Where(x => !bar.Contains(x)).ToList();
var b = bar.Where(x => !foo.Contains(x)).ToList();
timer.Stop();
Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));
timer.Reset();
timer.Start();
var intersect = new HashSet<String>(foo);
intersect.IntersectWith(bar);
var fSet = new HashSet<String>(foo);
var bSet = new HashSet<String>(bar);
fSet.ExceptWith(intersect);
bSet.ExceptWith(intersect);
timer.Stop();
var fCheck = new HashSet<String>(f);
var bCheck = new HashSet<String>(b);
Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));
Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
Console.ReadKey();
}
static Random _rnd = new Random();
private const int Count = 50000;
private static List<string> getRandomStrings()
{
var strings = new List<String>(Count);
var chars = new Char[10];
for (var i = 0; i < Count; i++)
{
var len = _rnd.Next(2, 10);
for (var j = 0; j < len; j++)
{
var c = (Char)_rnd.Next('a', 'z');
chars[j] = c;
}
strings.Add(new String(chars, 0, len));
}
return strings;
}
}
With intersect it would be done like this:
var matches = ((from f in foo
select f)
.Intersect(
from b in bar
select b, StringComparer.InvariantCultureIgnoreCase))
If the elements are unique within each list you should consider using an HashSet
The HashSet(T) class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order.
With sorted list, you can use binary search.
Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Categories