Improve performance of sorting files by extension - c#

With a given array of file names, the most simpliest way to sort it by file extension is like this:
Array.Sort(fileNames,
(x, y) => Path.GetExtension(x).CompareTo(Path.GetExtension(y)));
The problem is that on very long list (~800k) it takes very long to sort, while sorting by the whole file name is faster for a couple of seconds!
Theoretical, there is a way to optimize it: instead of using Path.GetExtension() and compare the newly created extension-only-strings, we can provide a Comparison than compares the existing filename strings starting from the LastIndexOf('.') without creating new strings.
Now, suppose i found the LastIndexOf('.'), i want to reuse native .NET's StringComparer and apply it only to the part on string after the LastIndexOf('.'), to preserve all culture consideration. Didn't found a way to do that.
Any ideas?
Edit:
With tanascius's idea to use char.CompareTo() method, i came with my Uber-Fast-File-Extension-Comparer, now it sorting by extension 3x times faster! it even faster than all methods that uses Path.GetExtension() in some manner. what do you think?
Edit 2:
I found that this implementation do not considering culture since char.CompareTo() method do not considering culture, so this is not a perfect solution.
Any ideas?
public static int CompareExtensions(string filePath1, string filePath2)
{
if (filePath1 == null && filePath2 == null)
{
return 0;
}
else if (filePath1 == null)
{
return -1;
}
else if (filePath2 == null)
{
return 1;
}
int i = filePath1.LastIndexOf('.');
int j = filePath2.LastIndexOf('.');
if (i == -1)
{
i = filePath1.Length;
}
else
{
i++;
}
if (j == -1)
{
j = filePath2.Length;
}
else
{
j++;
}
for (; i < filePath1.Length && j < filePath2.Length; i++, j++)
{
int compareResults = filePath1[i].CompareTo(filePath2[j]);
if (compareResults != 0)
{
return compareResults;
}
}
if (i >= filePath1.Length && j >= filePath2.Length)
{
return 0;
}
else if (i >= filePath1.Length)
{
return -1;
}
else
{
return 1;
}
}

Create a new array that contains each of the filenames in ext.restofpath format (or some sort of pair/tuple format that can default sort on the extension without further transformation). Sort that, then convert it back.
This is faster because instead of having to retrieve the extension many times for each element (since you're doing something like N log N compares), you only do it once (and then move it back once).

Not the most memory efficient but the fastest according to my tests:
SortedDictionary<string, List<string>> dic = new SortedDictionary<string, List<string>>();
foreach (string fileName in fileNames)
{
string extension = Path.GetExtension(fileName);
List<string> list;
if (!dic.TryGetValue(extension, out list))
{
list = new List<string>();
dic.Add(extension, list);
}
list.Add(fileName);
}
string[] arr = dic.Values.SelectMany(v => v).ToArray();
Did a mini benchmark on 800k randomly generated 8.3 filenames:
Sorting items with Linq to Objects... 00:00:04.4592595
Sorting items with SortedDictionary... 00:00:02.4405325
Sorting items with Array.Sort... 00:00:06.6464205

You can write a comparer that compares each character of the extension. char has a CompareTo(), too (see here).
Basically you loop until you have no more chars left in at least one string or one CompareTo() returns a value != 0.
EDIT: In response to the edits of the OP
The performance of your comparer method can be significantly improved. See the following code. Additionally I added the line
string.Compare( filePath1[i].ToString(), filePath2[j].ToString(),
m_CultureInfo, m_CompareOptions );
to enable the use of CultureInfo and CompareOptions. However this slows down everything compared to a version using a plain char.CompareTo() (about factor 2). But, according to my own SO question this seems to be the way to go.
public sealed class ExtensionComparer : IComparer<string>
{
private readonly CultureInfo m_CultureInfo;
private readonly CompareOptions m_CompareOptions;
public ExtensionComparer() : this( CultureInfo.CurrentUICulture, CompareOptions.None ) {}
public ExtensionComparer( CultureInfo cultureInfo, CompareOptions compareOptions )
{
m_CultureInfo = cultureInfo;
m_CompareOptions = compareOptions;
}
public int Compare( string filePath1, string filePath2 )
{
if( filePath1 == null || filePath2 == null )
{
if( filePath1 != null )
{
return 1;
}
if( filePath2 != null )
{
return -1;
}
return 0;
}
var i = filePath1.LastIndexOf( '.' ) + 1;
var j = filePath2.LastIndexOf( '.' ) + 1;
if( i == 0 || j == 0 )
{
if( i != 0 )
{
return 1;
}
return j != 0 ? -1 : 0;
}
while( true )
{
if( i == filePath1.Length || j == filePath2.Length )
{
if( i != filePath1.Length )
{
return 1;
}
return j != filePath2.Length ? -1 : 0;
}
var compareResults = string.Compare( filePath1[i].ToString(), filePath2[j].ToString(), m_CultureInfo, m_CompareOptions );
//var compareResults = filePath1[i].CompareTo( filePath2[j] );
if( compareResults != 0 )
{
return compareResults;
}
i++;
j++;
}
}
}
Usage:
fileNames1.Sort( new ExtensionComparer( CultureInfo.GetCultureInfo( "sv-SE" ),
CompareOptions.StringSort ) );

the main problem here is that you are calling Path.GetExtension multiple times for each path. if this is doing a quicksort then you could expect Path.GetExtension to be called anywhere from log(n) to n times where n is the number of items in your list for each item in the list. So you are going to want to cache the calls to Path.GetExtension.
if you were using linq i would suggest something like this:
filenames.Select(n => new {name=n, ext=Path.GetExtension(n)})
.OrderBy(t => t.ext).ToArray();
this ensures that Path.GetExtension is only called once for each filename.

Related

Split a list of objects into sub-lists of contiguous elements using LINQ?

I have a simple class Item:
public class Item
{
public int Start { get; set;}
public int Stop { get; set;}
}
Given a List<Item> I want to split this into multiple sublists of contiguous elements. e.g. a method
List<Item[]> GetContiguousSequences(Item[] items)
Each element of the returned list should be an array of Item such that list[i].Stop == list[i+1].Start for each element
e.g.
{[1,10], [10,11], [11,20], [25,30], [31,40], [40,45], [45,100]}
=>
{{[1,10], [10,11], [11,20]}, {[25,30]}, {[31,40],[40,45],[45,100]}}
Here is a simple (and not guaranteed bug-free) implementation that simply walks the input data looking for discontinuities:
List<Item[]> GetContiguousSequences(Item []items)
{
var ret = new List<Item[]>();
var i1 = 0;
for(var i2=1;i2<items.Length;++i2)
{
//discontinuity
if(items[i2-1].Stop != items[i2].Start)
{
var num = i2 - i1;
ret.Add(items.Skip(i1).Take(num).ToArray());
i1 = i2;
}
}
//end of array
ret.Add(items.Skip(i1).Take(items.Length-i1).ToArray());
return ret;
}
It's not the most intuitive implementation and I wonder if there is a way to have a neater LINQ-based approach. I was looking at Take and TakeWhile thinking to find the indices where discontinuities occur but couldn't see an easy way to do this.
Is there a simple way to use IEnumerable LINQ algorithms to do this in a more descriptive (not necessarily performant) way?
I set of a simple test-case here: https://dotnetfiddle.net/wrIa2J
I'm really not sure this is much better than your original, but for the purpose of another solution the general process is
Use Select to project a list working out a grouping
Use GroupBy to group by the above
Use Select again to project the grouped items to an array of Item
Use ToList to project the result to a list
public static List<Item[]> GetContiguousSequences2(Item []items)
{
var currIdx = 1;
return items.Select( (item,index) => new {
item = item,
index = index == 0 || items[index-1].Stop == item.Start ? currIdx : ++currIdx
})
.GroupBy(x => x.index, x => x.item)
.Select(x => x.ToArray())
.ToList();
}
Live example: https://dotnetfiddle.net/mBfHru
Another way is to do an aggregation using Aggregate. This means maintaining a final Result list and a Curr list where you can aggregate your sequences, adding them to the Result list as you find discontinuities. This method looks a little closer to your original
public static List<Item[]> GetContiguousSequences3(Item []items)
{
var res = items.Aggregate(new {Result = new List<Item[]>(), Curr = new List<Item>()}, (agg, item) => {
if(!agg.Curr.Any() || agg.Curr.Last().Stop == item.Start) {
agg.Curr.Add(item);
} else {
agg.Result.Add(agg.Curr.ToArray());
agg.Curr.Clear();
agg.Curr.Add(item);
}
return agg;
});
res.Result.Add(res.Curr.ToArray()); // Remember to add the last group
return res.Result;
}
Live example: https://dotnetfiddle.net/HL0VyJ
You can implement ContiguousSplit as a corutine: let's loop over source and either add item into current range or return it and start a new one.
private static IEnumerable<Item[]> ContiguousSplit(IEnumerable<Item> source) {
List<Item> current = new List<Item>();
foreach (var item in source) {
if (current.Count > 0 && current[current.Count - 1].Stop != item.Start) {
yield return current.ToArray();
current.Clear();
}
current.Add(item);
}
if (current.Count > 0)
yield return current.ToArray();
}
then if you want materialization
List<Item[]> GetContiguousSequences(Item []items) => ContiguousSplit(items).ToList();
Your solution is okay. I don't think that LINQ adds any simplification or clarity in this situation. Here is a fast solution that I find intuitive:
static List<Item[]> GetContiguousSequences(Item[] items)
{
var result = new List<Item[]>();
int start = 0;
while (start < items.Length) {
int end = start + 1;
while (end < items.Length && items[end].Start == items[end - 1].Stop) {
end++;
}
int len = end - start;
var a = new Item[len];
Array.Copy(items, start, a, 0, len);
result.Add(a);
start = end;
}
return result;
}

How can I find biggest number in a specific row of 2d array?

I need to find the biggest value of the specific row in the 2d array.
static void BiggestValueOfKRow(Matrix matrica, int j, out int maxI)
{
int max = matrica.TakeValue(0,j);
maxI = 0;
for (int i = 0; i < matrica.n; i++)
{
if (matrica.TakeValue(i, j) > max)
{
max = matrica.TakeValue(i, j);
maxI = i;
}
}
}
I have tried other options before, but I still can not get it.
I should be able to choose number of the row and then int that row I have to find that biggest value
Assuming that Matrix::TakeValue(a,b) is column-major and Matrix::n is the absolute width of the matrix (i.e. an exclusive upper-bound, rather than an inclusive upper-bound), here is how I would do it, using MaxBy:
// Requires C# 7.3 for the use of value-tuples:
static (Int32 columnIndex, Int32 value) GetRowMax( Matrix m, int rowIndex )
{
if( m == null ) throw new ArgumentNullException( nameof(m) );
return Enumerable
.Range( 0, m.n )
.Select( colIdx => ( columnIndex: colIdx, value: m.TakeValue( colIdx, rowIndex ) ) )
.MaxBy( t => t.value );
}
Note that MaxBy is not a part of normal Linq (grrr) however it is included in almost every decent Linq extension library, such as Jon Skeet's MoreLINQ.
An implementation of MaxBy is provided below:
// Rather than defining `MaxBy` yourself, you can also use MoreLINQ from NuGet.
static class LinqExtensions
{
public static T MaxBy<T,TValue>( this IEnumerable<T> source, Func<T,TValue> selector )
where TValue : IComparable<TValue>
{
if( source == null ) throw new ArgumentNullException( nameof(source) );
if( selector == null ) throw new ArgumentNullException( nameof(selector) );
TValue max = default(TValue);
foreach( T item in source )
{
if( item != null && item.CompareTo( max ) > 0 )
{
max = item;
}
}
return max;
}
}

Check if string contains characters in certain order in C#r

I have a code that's working right now, but it doesn't check if the characters are in order, it only checks if they're there. How can I modify my code so the the characters 'gaoaf' are checked in that order in the string?
Console.WriteLine("5.feladat");
StreamWriter sw = new StreamWriter("keres.txt");
sw.WriteLine("gaoaf");
string s = "";
for (int i = 0; i < n; i++)
{
s = zadatok[i].nev+zadatok[i].cim;
if (s.Contains("g") && s.Contains("a") && s.Contains("o") && s.Contains("a") && s.Contains("f") )
{
sw.WriteLine(i);
sw.WriteLine(zadatok[i].nev + zadatok[i].cim);
}
}
sw.Close();
You can convert the letters into a pattern and use Regex:
var letters = "gaoaf";
var pattern = String.Join(".*",letters.AsEnumerable());
var hasletters = Regex.IsMatch(s, pattern, RegexOptions.IgnoreCase);
For those that needlessly avoid .*, you can also solve this with LINQ:
var ans = letters.Aggregate(0, (p, c) => p >= 0 ? s.IndexOf(c.ToString(), p, StringComparison.InvariantCultureIgnoreCase) : p) != -1;
If it is possible to have repeated adjacent letters, you need to complicate the LINQ solution slightly:
var ans = letters.Aggregate(0, (p, c) => {
if (p >= 0) {
var newp = s.IndexOf(c.ToString(), p, StringComparison.InvariantCultureIgnoreCase);
return newp >= 0 ? newp+1 : newp;
}
else
return p;
}) != -1;
Given the (ugly) machinations required to basically terminate Aggregate early, and given the (ugly and inefficient) syntax required to use an inline anonymous expression call to get rid of the temporary newp, I created some extensions to help, an Aggregate that can terminate early:
public static TAccum AggregateWhile<TAccum, T>(this IEnumerable<T> src, TAccum seed, Func<TAccum, T, TAccum> accumFn, Predicate<TAccum> whileFn) {
using (var e = src.GetEnumerator()) {
if (!e.MoveNext())
throw new Exception("At least one element required by AggregateWhile");
var ans = accumFn(seed, e.Current);
while (whileFn(ans) && e.MoveNext())
ans = accumFn(ans, e.Current);
return ans;
}
}
Now you can solve the problem fairly easily:
var ans2 = letters.AggregateWhile(-1,
(p, c) => s.IndexOf(c.ToString(), p+1, StringComparison.InvariantCultureIgnoreCase),
p => p >= 0
) != -1;
Why not something like this?
static bool CheckInOrder(string source, string charsToCheck)
{
int index = -1;
foreach (var c in charsToCheck)
{
index = source.IndexOf(c, index + 1);
if (index == -1)
return false;
}
return true;
}
Then you can use the function like this:
bool result = CheckInOrder("this is my source string", "gaoaf");
This should work because IndexOf returns -1 if a string isn't found, and it only starts scanning AFTER the previous match.

Returning the smallest integer in an arrayList in C#

I recently got asked in a interview to create an method where the following checks are to be made:
Code to check if ArrayList is null
Code to loop through ArrayList objects
Code to make sure object is an integer
Code to check if it is null, and if not then to compare it against a variable containing the smallest integer from the list and if smaller then
overwrite it.
Return the smallest integer in the list.
So I created the following method
static void Main(string[] args)
{
ArrayList list = new ArrayList();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(4);
list.Add(5);
Program p = new Program();
p.Min(list);
}
private int? Min(ArrayList list)
{
int value;
//Code to check if ArrayList is null
if (list.Count > 0)
{
string minValue = GetMinValue(list).ToString();
//Code to loop through ArrayList objects
for(int i = 0; i < list.Count; i++)
{
//Code to make sure object is an integer
//Code to check if it is null, and if not to compare it against a variable containing the
//smallest integer from the list and if smaller overwrite it.
if (Int32.TryParse(i.ToString(), out value) || i.ToString() != string.Empty)
{
if (Convert.ToInt32(list[i]) < Convert.ToInt32(minValue))
{
minValue = list[i];
}
}
}
}
return Convert.ToInt32(GetMinValue(list));
}
public static object GetMinValue(ArrayList arrList)
{
ArrayList sortArrayList = arrList;
sortArrayList.Sort();
return sortArrayList[0];
}
I think the above is somewhat correct, however am not entirely sure about 4?
I think The following logic may help you. It is simpler than the current and are using int.TryParse() for parsing, which is better than Convert.To..() and int.Parse() Since it has some internal error handling and hence it will will not throw any exception for invalid input. If the input is invalid then it gives 0 to the out variable and returns false, From that we can assume the conversion failed. See the code for this:
var arrayMin = listOfInt;
int currentNum = 0;
int yourNum = int.MaxValue;
bool isSuccess = true;
foreach (var item in listOfInt)
{
if (int.TryParse(item.ToString(), out currentNum) && currentNum <= yourNum)
{
yourNum = currentNum;
}
else
{
isSuccess = false;
break;
}
}
if(isSuccess)
Console.WriteLine("Minimum Number in the array is {0}",yourNum);
else
Console.WriteLine("Invalid input element found");
Simplistic version:
private int? Min(ArrayList list)
{
if (list == null || list.Count == 0) return null;
return list.Cast<int>().Min();
}

Linq - Get all items between 2 matching elements

Provided a list, I want to select all items between the 2 given. (including the begin and end params)
My current solution is as follows:
private IEnumerable<string> GetAllBetween(IEnumerable<string> list, string begin, string end)
{
bool isBetween = false;
foreach (string item in list)
{
if (item == begin)
{
isBetween = true;
}
if (item == end)
{
yield return item;
yield break;
}
if (isBetween)
{
yield return item;
}
}
}
But surely there must be a pretty linq query that accomplishes the same thing?
You can nearly use SkipWhile and TakeWhile, but you want the last item as well - you want the functionality of TakeUntil from MoreLINQ. You can then use:
var query = source.SkipWhile(x => x != begin)
.TakeUntil(x => x == end);
static IEnumerable<T> GetAllBetween<T>( this List<T> list, T a, T b )
{
var aOffset = list.IndexOf( a );
var bOffset = list.IndexOf( b );
// what to do if one or all items not found?
if( -1 == aOffset || -1 == bOffset )
{
// for this example I will return an empty array
return new T[] { };
}
// what to do if a comes after b?
if( aOffset > bOffset )
{
// for this example i'll simply swap them
int temp = aOffset;
aOffset = bOffset;
bOffset = temp;
}
return list.GetRange( aOffset, bOffset - aOffset );
}
I think a simple Skip, Take should do it. I normally take it for paging ASP.NET resultsites.
var startIndex = list.IndexOf(begin);
var endIndex = list.IndexOf(end);
var result = list.Skip(startIndex + 1).Take(endIndex - 1 - startIndex);

Categories