Sort large CSV file by specific column using csvhelper

Sort large CSV file by specific column using csvhelper - c#

I have a big csv file contains about 20 millions records (size 4GB)
My goal is to read data from this file and sort the list by specific columns (multi columns, not one column) in that csv file. I'm currently using csvhelper library to achieve my goal
My idea is to add data to a List so I can use Linq function to order the data as expect
When using csvReader.GetRecords<T>(), it returns an IQueryable<T> data. but when adding .ToList() then it throws a System.OutOfMemoryException
var records = csvReader.GetRecords<T>();
I tried another way to add it to list by calling a for loop and add it to an empty List<T>
List<T> lst = new List<T>();
foreach (var item in datas)
{
lst.Add(item);
}
but it still throws System.OutOfMemoryException
is there any solution to add data from csv file to List and order my list by specific columns
My PC has 16GB RAM and the file has about 4GB size

Chop it into subsets, sort, then merge. Here is code example of extension I wrote:
public static class MergeSortEnumerableExtensions
{
public static IEnumerable<TElement> MergeOrderBy<TElement, TKey>(this IEnumerable<TElement> sourceEnumerable, Func<TElement, TKey> keyProvider,
IEnumerableStorage<TElement> storage, IComparer<TKey> comparer = null, int minimalChunkSize = 1024*1024)
{
if(sourceEnumerable == null)
throw new ArgumentNullException(nameof(sourceEnumerable));
if (storage == null)
throw new ArgumentNullException(nameof(storage));
if (keyProvider == null)
throw new ArgumentNullException(nameof(keyProvider));
comparer = comparer ?? Comparer<TKey>.Default;
storage.Clear();
//chunking, sorting, saving
foreach (var chunk in sourceEnumerable.ChunkInPlace(minimalChunkSize))
{
chunk.Sort((x,y)=> comparer.Compare(keyProvider(x), keyProvider(y)));
storage.Add(chunk);
}
if (storage.Count == 0)
return Enumerable.Empty<TElement>();
//making a merge tree IEnumerable out of all files, it will cost us O(totalSize/chunkSize) open handles
var queue = new Queue<IEnumerable<TElement>>(storage.Count);
while (storage.Count > 0)
{
queue.Enqueue(storage.Take());
}
while (queue.Count > 1)
{
queue.Enqueue(MergeSorted(queue.Dequeue(), queue.Dequeue(), keyProvider, comparer));
}
return queue.Dequeue();
}
private static IEnumerable<TElement> MergeSorted<TElement, TKey>(IEnumerable<TElement> a, IEnumerable<TElement> b, Func<TElement, TKey> keyProvider, IComparer<TKey> comparer)
{
using var aiter = a.GetEnumerator();
using var biter = b.GetEnumerator();
var amoved = aiter.MoveNext();
var bmoved = biter.MoveNext();
while (amoved || bmoved)
{
var cmp = amoved && bmoved ? comparer.Compare(keyProvider(aiter.Current), keyProvider(biter.Current)) : (amoved ? -1 : 1);
if (cmp <= 0)
{
yield return aiter.Current;
amoved = aiter.MoveNext();
}
else
{
yield return biter.Current;
bmoved = biter.MoveNext();
}
}
}
//chop incoming enumerable into chunks, but uses same array each time to lower GC usage
private static IEnumerable<List<TValue>> ChunkInPlace<TValue>(
this IEnumerable<TValue> values,
int chunkSize)
{
var list = new List<TValue>(chunkSize);
using var enumerator = values.GetEnumerator();
while (enumerator.MoveNext())
{
list.Add(enumerator.Current);
if (list.Count == chunkSize)
{
yield return list;
list.Clear();
}
}
if (list.Count > 0)
yield return list;
list.Clear();
}
}
You basically write a streamed version of persistent storage (in CSV/Json any other format, or just straight into SQL) with Push/Pop methods and plug it into extension.
Memory tradeoff/consumption can be adjusted selecting minimal size - depends on size and amount of entities you want to sort at a time. This way you can sort any amount of data, even if it is not reside on your disk. It can easily be cloud blob storage.
Usage
var sorted = infiniteInput.MergeOrderBy(x=> x.DickSize, tmpStorage); //you can safely use it in foreach or any other memory cheap extension like Any(), First(), Last(), Sum(), etc
Benchmark in memory 1 million ints
| Method | Mean | Error | StdDev | Completed Work Items | Lock Contentions | Allocated |
|---------------- |---------:|--------:|--------:|---------------------:|-----------------:|----------:|
| MergeSortOnDisk | 285.6 ms | 5.67 ms | 7.76 ms | 1.0000 | - | 12.2 MB |
Benchmark in json 1 million ints
| Method | Mean | Error | StdDev | Completed Work Items | Lock Contentions | Allocated |
|---------------- |--------:|---------:|---------:|---------------------:|-----------------:|----------:|
| MergeSortOnDisk | 1.545 s | 0.0301 s | 0.0282 s | 2.0000 | - | 314.26 MB |
PS
Link at my repo with tests/benchmarks (and storage examples):
https://github.com/eocron/Algorithm/tree/master/Algorithm/Sorted

I just loaded 20 million records of a simple Foo class of Id and Name. The file was 0.4 GB, but the loaded records in memory was close to 2GB. If we assume your class is also going to need about 5 times the memory, then you would need 20GB memory for your 4GB file. Even if you do manage to load all the records, you are going to need more memory to do the sorting.
If you don't want to use the suggestion from Ňɏssa Pøngjǣrdenlarp to upload the file to a database, you might be able to use a smaller SortingClass that only has the columns that you want to sort on and a Row column that will allow you to find the records again in the file. After sorting, if you just need a smaller subset of the records, use the row numbers to find the complete records in the file.
void Main()
{
List<SortingClass> sortedRecords;
using (var reader = new StreamReader(#"C:\Temp\VeryLargeFoo.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
csv.Context.RegisterClassMap<SortingClassMap>();
sortedRecords = csv.GetRecords<SortingClass>().OrderByDescending(f => f.Name).ThenBy(f => f.Id).ToList();
}
var selectedRows = sortedRecords.Take(100).Select(r => r.Row).ToArray();
var selectedFoo = new List<Foo>();
using (var reader = new StreamReader(#"C:\Temp\VeryLargeFoo.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
csv.Read();
csv.ReadHeader();
while (csv.Read())
{
if (selectedRows.Contains(csv.Parser.Row)){
selectedFoo.Add(csv.GetRecord<Foo>());
}
}
}
}
public class SortingClass
{
public int Row { get; set; }
public int Id { get; set; }
public string Name { get; set; }
}
public class SortingClassMap : ClassMap<SortingClass>
{
public SortingClassMap()
{
Map(r => r.Row).Convert(args => args.Row.Context.Parser.Row);
Map(r => r.Id);
Map(r => r.Name);
}
}
public class Foo
{
public int Id { get; set; }
public string Name { get; set; }
public string Column3 { get; set; }
public string Column4 { get; set; }
public string Column5 { get; set; }
public string Column6 { get; set; }
}

Related

Multiple lines and columns in Specflow

I have a Specflow table that looks like this.
When I Perform POST Operation for "/example/" with body
| answerValue1 | answerValue2 | countryCode | Cash |
| Yes | Yes | AD | 76-100% |
| | | AF | |
The column CountryCode is the only one that can be multiple choices.
What I tried to do was to add the columns to dictionary with a simple tableExtenstions
public class TableExtensions
{
public static Dictionary<string, string> ToDictionary(Table table)
{
var dictionary = new Dictionary<string, string>();
foreach (var row in table.Rows)
{
dictionary.Add(row[0], row[1]);
}
return dictionary;
}
}
and call it from the method.
var dictionary = TableExtensions.ToDictionary(table);
var countryCode = dictionary["countryCode"];
Unfortnally I get error The given key was not present in the dictionary,
since the dictionary only returns two values from the first and the second Key
Ofcourse if I change the keys to row[2], row[3] it gets the right columns.
But I would like to reuse the Table Extension.
Also i tried to increment them, but it only took the first to columns
var i = 0;
foreach (var row in table.Rows)
{
dictionary.Add(row[i], row[i]);
i++;
}
Does anyone has a better solution?

I'm not entirely sure what you want the dictionary to ultimately contain, but as you mention that manually changing the rows it looks for to:
row[2], row[3]
gives the data you want, perhaps this would give you the reusability you're looking for:
public class TableExtensions
{
public static Dictionary<string, string> ToDictionary(Table table, int columnOne, int columnTwo)
{
int i = 0;
var dictionary = new Dictionary<string, string>();
foreach (var row in table.Rows)
{
dictionary.Add(row[columnOne], row[columnTwo]);
}
return dictionary;
}
}
Usage:
var dictionary = TableExtensions.ToDictionary(table, 2, 3);
This produces a dictionary with the following contents:
You could get the country code like this:
foreach (var row in dictionary)
{
var countryCode = row.Key;
var score = row.Value ?? string.empty;
}

Given the simplicity of the country codes I would make them a comma separated list and use a vertical table instead:
When I Perform POST Operation for "/example/"
| Field | Value |
| answerValue1 | ... |
| answerValue2 | ... |
| countryCodes | AD, AF |
| cash | ... |
In your step definition:
var example = table.CreateInstance<ExampleRow>();
// use example.GetCountryCodes();
And the ExampleRow class to parse the table into an object:
public class ExampleRow
{
public string AnswerValue1 { get; set; }
public string AnswerValue2 { get; set; }
private string[] countryCodes;
public string CountryCodes
{
get => string.Join(", ", countryCodes);
set => countryCodes = value.Split(", ");
}
public string[] GetCountryCodes()
{
return countryCodes;
}
}

C#: read text file and process it

I need a program in C# which is write out
how many Eric Clapton songs played in the radios.
is there any eric clapton song which all 3 radios played.
how much time they broadcasted eric clapton songs in SUM.
The first columns contain the radio identification(1-2-3)
The second column is about the song playtime minutes
the third column is the song playtime in seconds
the last two is the performer : song
So the file looks like this:
1 5 3 Deep Purple:Bad Attitude
2 3 36 Eric Clapton:Terraplane Blues
3 2 46 Eric Clapton:Crazy Country Hop
3 3 25 Omega:Ablakok
2 4 23 Eric Clapton:Catch Me If You Can
1 3 27 Eric Clapton:Willie And The Hand Jive
3 4 33 Omega:A szamuzott
.................
And more 670 lines.
so far i get this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
namespace radiplaytime
{
public struct Adat
{
public int rad;
public int min;
public int sec;
public Adat(string a, string b, string c)
{
rad = Convert.ToInt32(a);
min = Convert.ToInt32(b);
sec = Convert.ToInt32(c);
}
}
class Program
{
static void Main(string[] args)
{
String[] lines = File.ReadAllLines(#"...\zenek.txt");
List<Adat> adatlista = (from adat in lines
//var adatlista = from adat in lines
select new Adat(adat.Split(' ')[0],
adat.Split(' ')[1],
adat.Split(' ')[2])).ToList<Adat>();
var timesum = (from adat in adatlista
group adat by adat.rad into ertekek
select new
{
rad = ertekek.Key,
hour = (ertekek.Sum(adat => adat.min) +
ertekek.Sum(adat => adat.sec) / 60) / 60,
min = (ertekek.Sum(adat => adat.min) +
ertekek.Sum(adat => adat.sec) / 60) % 60,
sec = ertekek.Sum(adat => adat.sec) % 60,
}).ToArray();
for (int i = 0; i < timesum.Length; i++)
{
Console.WriteLine("{0}. radio: {1}:{2}:{3} playtime",
timesum[i].rad,
timesum[i].hour,
timesum[i].min,
timesum[i].sec);
}
Console.ReadKey();
}
}
}

You can define a custom class to store the values of each line. You will need to use Regex to split each line and populate your custom class. Then you can use linq to get the information you need.
public class Plays
{
public int RadioID { get; set; }
public int PlayTimeMinutes { get; set; }
public int PlayTimeSeconds { get; set; }
public string Performer { get; set; }
public string Song { get; set; }
}
So you then read your file and populate the custom Plays:
String[] lines = File.ReadAllLines(#"songs.txt");
List<Plays> plays = new List<Plays>();
foreach (string line in lines)
{
var matches = Regex.Match(line, #"^(\d+)\s(\d+)\s(\d+)\s(.+)\:(.+)$"); //this will split your line into groups
if (matches.Success)
{
Plays play = new Plays();
play.RadioID = int.Parse(matches.Groups[1].Value);
play.PlayTimeMinutes = int.Parse(matches.Groups[2].Value);
play.PlayTimeSeconds = int.Parse(matches.Groups[3].Value);
play.Performer = matches.Groups[4].Value;
play.Song = matches.Groups[5].Value;
plays.Add(play);
}
}
Now that you have your list of songs, you can use linq to get what you need:
//Get Total Eric Clapton songs played - assuming distinct songs
var ericClaptonSongsPlayed = plays.Where(x => x.Performer == "Eric Clapton").GroupBy(y => y.Song).Count();
//get eric clapton songs played on all radio stations
var radioStations = plays.Select(x => x.RadioID).Distinct();
var commonEricClaptonSong = plays.Where(x => x.Performer == "Eric Clapton").GroupBy(y => y.Song).Where(z => z.Count() == radioStations.Count());
etc.

String splitting works only if the text is really simple and doesn't have to deal with fixed length fields. It generates a lot of temporary strings as well, that can cause your program to consume many times the size of the original in RAM and harm performance due to the constant allocations and garbage collection.
Riv's answer shows how to use a Regex to parse this file. It can be improved in several ways though:
var pattern=#"^(\d+)\s(\d+)\s(\d+)\s(.+)\:(.+)$";
var regex=new Regex(pattern);
var plays = from line in File.ReadLines(filePath)
let matches=regex.Match(line)
select new Plays {
RadioID = int.Parse(matches.Groups[1].Value),
PlayTimeMinutes = int.Parse(matches.Groups[2].Value),
PlayTimeSeconds = int.Parse(matches.Groups[3].Value),
Performer = matches.Groups[4].Value,
Song = matches.Groups[5].Value
};
ReadLines returns an IEnumerable<string> instead of returning all of the lines in a buffer. This means that parsing can start immediatelly
By using a single regular expression, we don't have to rebuild the regex for each line.
No list is needed. The query returns an IEnumerable to which other LINQ operations can be applied directly.
For example :
var durations = plays.GroupBy(p=>p.RadioID)
.Select(grp=>new { RadioID=grp.Key,
Hours = grp.Sum(p=>p.PlayTimeMinutes + p.PlayTimeSecons/60)/60,)
Mins = grp.Sum(p=>p.PlayTimeMinutes + p.PlayTimeSecons/60)%60,)
Secss = grp.Sum(p=> p.PlayTimeSecons)%60)
});
A farther improvement could be to give names to the groups:
var pattern=#"^(?<station>\d+)\s(?<min>\d+)\s(?<sec>\d+)\s(?<performer>.+)\:(?<song>.+)$";
...
select new Plays {
RadioID = int.Parse(matches.Groups["station"].Value),
PlayTimeMinutes = int.Parse(matches.Groups["min"].Value),
...
};
You can also get rid of the Plays class and use a single, slightly more complex LINQ query :
var durations = from line in File.ReadLines(filePath)
let matches=regex.Match(line)
let play= new {
RadioID = int.Parse(matches.Groups["station"].Value),
Minutes = int.Parse(matches.Groups["min"].Value),
Seconds = int.Parse(matches.Groups["sec"].Value)
}
group play by play.RadioID into grp
select new { RadioID = grp.Key,
Hours = grp.Sum(p=>p.Minutes + p.Seconds/60)/60,)
Mins = grp.Sum(p=>p.Minutes + p.Seconds/60)%60,)
Secs = grp.Sum(p=> p.Seconds)%60)
};
In this case, no strings are generated for Performer and Song. That's another benefit of regular expressions. Matches and groups are just indexes into the original string. No string is generated until the .Value is read. This would reduce the RAM used in this case by about 75%.
Once you have the results, you can iterate over them :
foreach (var duration in durations)
{
Console.WriteLine("{0}. radio: {1}:{2}:{3} playtime",
duration.RadioID,
duration.Hours,
duration.Mins,
duration.Secs);
}

How to compare two csv files by 2 columns?

I have 2 csv files
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I read both files and add data to lists
var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();
Then I find all unique strings from both lists and add it to result lists
var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();
rezlist contains this data
[0] = "italy;russia;france"
[1] = "india;iran;pakistan"
At now I want to compare, make except and union by second and third column in all rows.
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.
rezlist must contains
india;iran;pakistan
I added class
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
...
}
public int GetHashCode(string obj)
{
...
}
}
StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();

Your question is not very clear: for instance, is india;iran;pakistan the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.
Then there is this: find all unique string from both lists which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].
Also note that a data sample of 2 leaves room for a fair amount of error.
Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:
// parse "1.CSV"
List<string[]> lst1 = File.ReadAllLines(#"C:\Temp\1.csv").
Select(line => line.Split(';')).
ToList();
// parse "2.CSV"
List<string[]> lst2 = File.ReadAllLines(#"C:\Temp\2.csv").
Select(line => line.Split(';')).
ToList();
// extracting once speeds things up in the next step
// and leaves open the possibility of iterating in a method
List<List<string>> tgts = new List<List<string>>();
tgts.Add(lst1.Select(z => z[1]).Distinct().ToList());
tgts.Add(lst1.Select(z => z[2]).Distinct().ToList());
var tmpLst = lst2.Where(x => !tgts[0].Contains(x[1]) ||
!tgts[1].Contains(x[2])).
ToList();
That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.
Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:
var unique = tmpLst.
GroupBy(g => g[1], (key, values) =>
new GroupItem(key,
values.ToArray()[0],
values.Count())
).Where(q => q.Count == 1).
GroupBy(g => g.Data[2], (key, values) => new
{
Item = string.Join(";", values.ToArray()[0]),
Count = values.Count()
}
).Where(q => q.Count == 1).Select(s => s.Item).
ToList();
The GroupItem class is trivial:
class GroupItem
{
public string Item { set; get; } // debug aide
public string[] Data { set; get; }
public int Count { set; get; }
public GroupItem(string n, string[] d, int c)
{
Item = n;
Data = d;
Count = c;
}
public override string ToString()
{
return string.Join(";", Data);
}
}
It starts with tmpList, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.
The second GroupBy acts on those results, this time looking at col[2]. Finally, it selects the joined string data.
Results
Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:
NepalX and ItalyX were the repeats in both files and they correctly eliminated each other.
With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.

your GetHashCode()-Method in EqualityComparer are buggy. Fixed version:
public int GetHashCode(string obj)
{
return obj.Split(';')[1].GetHashCode();
}
now the result are correct:
// one result: "india;iran;pakistan"
btw. "StringLengthEqualityComparer"is not a good name ;-)

private void GetUnion(List<string> lst1, List<string> lst2)
{
List<string> lstUnion = new List<string>();
foreach (string value in lst1)
{
string valueColumn1 = value.Split(';')[0];
string valueColumn2 = value.Split(';')[1];
string valueColumn3 = value.Split(';')[2];
string result = lst2.FirstOrDefault(s => s.Contains(";" + valueColumn2 + ";" + valueColumn3));
if (result != null)
{
if (!lstUnion.Contains(result))
{
lstUnion.Add(result);
}
}
}
}

class Program
{
static void Main(string[] args)
{
var lst1 = File.ReadLines(#"D:\test\1.csv").Select(x => new StringWrapper(x)).ToList();
var lst2 = File.ReadLines(#"D:\test\2.csv").Select(x => new StringWrapper(x));
var set = new HashSet<StringWrapper>(lst1);
set.SymmetricExceptWith(lst2);
foreach (var x in set)
{
Console.WriteLine(x.Value);
}
}
}
struct StringWrapper : IEquatable<StringWrapper>
{
public string Value { get; }
private readonly string _comparand0;
private readonly string _comparand14;
public StringWrapper(string value)
{
Value = value;
var split = value.Split(';');
_comparand0 = split[0];
_comparand14 = split[14];
}
public bool Equals(StringWrapper other)
{
return string.Equals(_comparand0, other._comparand0, StringComparison.OrdinalIgnoreCase)
&& string.Equals(_comparand14, other._comparand14, StringComparison.OrdinalIgnoreCase);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
return obj is StringWrapper && Equals((StringWrapper) obj);
}
public override int GetHashCode()
{
unchecked
{
return ((_comparand0 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand0) : 0)*397)
^ (_comparand14 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand14) : 0);
}
}
}

How to count occurences of number stored in file containing multiple delimeters?

This is my input store in file:
50|Carbon|Mercury|P:4;P:00;P:1
90|Oxygen|Mars|P:10;P:4;P:00
90|Serium|Jupiter|P:4;P:16;P:10
85|Hydrogen|Saturn|P:00;P:10;P:4
Now i will take my first row P:4 and then next P:00 and then next like wise and want to count occurence in every other row so expected output will be:
P:4 3(found in 2nd row,3rd row,4th row(last cell))
P:00 2 (found on 2nd row,4th row)
P:1 0 (no occurences are there so)
P:10 1
P:16 0
etc.....
Like wise i would like to print occurence of each and every proportion.
So far i am successfull in splitting row by row and storing in my class file object like this:
public class Planets
{
//My rest fields
public string ProportionConcat { get; set; }
public List<proportion> proportion { get; set; }
}
public class proportion
{
public int Number { get; set; }
}
I have already filled my planet object like below and Finally my List of planet object data is like this:
List<Planets> Planets = new List<Planets>();
Planets[0]:
{
Number:50
name: Carbon
object:Mercury
ProportionConcat:P:4;P:00;P:1
proportion[0]:
{
Number:4
},
proportion[1]:
{
Number:00
},
proportion[2]:
{
Number:1
}
}
Etc...
I know i can loop through and perform search and count but then 2 to 3 loops will be required and code will be little messy so i want some better code to perform this.
Now how do i search each and count every other proportion in my planet List object??

Well, if you have parsed proportions, you can create new struct for output data:
// Class to storage result
public class Values
{
public int Count; // count of proportion entry.
public readonly HashSet<int> Rows = new HashSet<int>(); //list with rows numbers.
/// <summary> Add new proportion</summary>
/// <param name="rowNumber">Number of row, where proportion entries</param>
public void Increment(int rowNumber)
{
++Count; // increase count of proportions entries
Rows.Add(rowNumber); // add number of row, where proportion entry
}
}
And use this code to fill it. I'm not sure it's "messy" and don't see necessity to complicate the code with LINQ. What do you think about it?
var result = new Dictionary<int, Values>(); // create dictionary, where we will storage our results. keys is proportion. values - information about how often this proportion entries and rows, where this proportion entry
for (var i = 0; i < Planets.Count; i++) // we use for instead of foreach for finding row number. i == row number
{
var planet = Planets[i];
foreach (var proportion in planet.proportion)
{
if (!result.ContainsKey(proportion.Number)) // if our result dictionary doesn't contain proportion
result.Add(proportion.Number, new Values()); // we add it to dictionary and initialize our result class for this proportion
result[proportion.Number].Increment(i); // increment count of entries and add row number
}
}

You can use var count = Regex.Matches(lineString, input).Count;. Try this example
var list = new List<string>
{
"50|Carbon|Mercury|P:4;P:00;P:1",
"90|Oxygen|Mars|P:10;P:4;P:00",
"90|Serium|Jupiter|P:4;P:16;P:10",
"85|Hydrogen|Saturn|P:00;P:10;P:4"
};
int totalCount;
var result = CountWords(list, "P:4", out totalCount);
Console.WriteLine("Total Found: {0}", totalCount);
foreach (var foundWords in result)
{
Console.WriteLine(foundWords);
}
public class FoundWords
{
public string LineNumber { get; set; }
public int Found { get; set; }
}
private List<FoundWords> CountWords(List<string> words, string input, out int total)
{
total = 0;
int[] index = {0};
var result = new List<FoundWords>();
foreach (var f in words.Select(word => new FoundWords {Found = Regex.Matches(word, input).Count, LineNumber = "Line Number: " + index[0] + 1}))
{
result.Add(f);
total += f.Found;
index[0]++;
}
return result;
}

I made a DotNetFiddle for you here: https://dotnetfiddle.net/z9QwmD
string raw =
#"50|Carbon|Mercury|P:4;P:00;P:1
90|Oxygen|Mars|P:10;P:4;P:00
90|Serium|Jupiter|P:4;P:16;P:10
85|Hydrogen|Saturn|P:00;P:10;P:4";
string[] splits = raw.Split(
new string[] { "|", ";", "\n" },
StringSplitOptions.None
);
foreach (string p in splits.Where(s => s.ToUpper().StartsWith(("P:"))).Distinct())
{
Console.WriteLine(
string.Format("{0} - {1}",
p,
splits.Count(s => s.ToUpper() == p.ToUpper())
)
);
}
Basically, you can use .Split to split on multiple delimiters at once, it's pretty straightforward. After that, everything is gravy :).
Obviously my code simply outputs the results to the console, but that part is fairly easy to change. Let me know if there's anything you didn't understand.

What is the fastest way to compare a value from a list to a specific sum from another list?

I have two huge lists of created objects. A List<Forecast> with all the forecasts from different Resources and a List<Capacity> with the capacities of these Resources.
A Forecast also contains booleans indicating if this Resource is over or below the capacity for the sum of all his forecasts.
public class Forecast
{
public int ResourceId { get; set; }
public double? ForecastJan { get; set; }
// and ForecastFeb, ForecastMarch, ForecastApr, ForecastMay, etc.
public bool IsOverForecastedJan { get; set; }
// and IsOverForecastedFeb, IsOverForecastedMarch, IsOverForecastedApr, etc.
}
public class Capacity
{
public int ResourceId { get; set; }
public double? CapacityJan { get; set; }
// and CapacityFeb, CapacityMar, CapacityApr, CapacityMay, etc.
}
I have to set the IsOverForecastXXX properties so I have to know for each month if the sum of forecasts for each resource is higher than the sum of the capacity for this specific resource.
Here is my code :
foreach (Forecast f in forecastList)
{
if (capacityList.Where(c => c.Id == f.ResourceId)
.Select(c => c.CapacityJan)
.First()
< forecastList.Where(x => x.ResourceId == f.ResourceId)
.Sum(x => x.ForecastJan)
)
{
f.IsOverForecastedJan = true;
}
//Same for each month...
}
My code works but I have really bad performances when the lists are too big (thousands of elements).
Do you how of can I improve this algorithm ? How to compare the sum of forecasts for each resource with the associated capacity ?

You can use First or FirstOrdefault to get the capacities for the currect resource, then compare them. I would use ToLookup which is similar to a Dictionary to get all forecasts for all resources.
ILookup<int, Forecast> forecastMonthGroups = forecastList
.ToLookup(fc => fc.ResourceId);
foreach (Forecast f in forecastList)
{
double? janSum = forecastMonthGroups[f.ResourceId].Sum(fc => fc.ForecastJan);
double? febSum = forecastMonthGroups[f.ResourceId].Sum(fc => fc.ForecastFeb);
var capacities = capacityList.First(c => c.ResourceId == f.ResourceId);
bool overJan = capacities.CapacityJan < janSum;
bool overFeb = capacities.CapacityFeb < febSum;
// ...
f.IsOverForecastedJan = overJan;
f.IsOverForecastedFeb = overFeb;
// ...
}
It seems that there is only one Capacity per ResourceID, then i would use a Dictionary to store the "way" from ResourceId to Capacity, this would improve performance even more:
ILookup<int, Forecast> forecastMonthGroups = forecastList
.ToLookup(fc => fc.ResourceId);
Dictionary<int, Capacity> capacityResources = capacityList
.ToDictionary(c => c.ResourceId);
foreach (Forecast f in forecastList)
{
double? janSum = forecastMonthGroups[f.ResourceId].Sum(fc => fc.ForecastJan);
double? febSum = forecastMonthGroups[f.ResourceId].Sum(fc => fc.ForecastFeb);
bool overJan = capacityResources[f.ResourceId].CapacityJan < janSum;
bool overFeb = capacityResources[f.ResourceId].CapacityFeb < febSum;
// ...
f.IsOverForecastedJan = overJan;
f.IsOverForecastedFeb = overFeb;
// ...
}

I would try selecting out your capacities and forecasts for each month before entering the loop this way you are not iterating each list every time you go round the loop.
Something like this:
var capicities = capacityList.GroupBy(c => c.ResourceId).ToDictionary(c=>c.First().ResourceId, c=>c.First().CapacityJan);
var forecasts = forecastList.GroupBy(x => x.ResourceId).ToDictionary(x => x.First().ResourceId, x => x.Sum(f => f.ForecastJan));
foreach (Forecast f in forecastList)
{
if (capicities[f.ResourceId] < forecasts[f.ResourceId])
{
f.IsOverForecastedJan = true;
}
}

There are lots of things you can do to speed this up. First up, make a single pass over forecastList and sum the capacity forecast for each month:
var demandForecasts = new Dictionary<int, double?[]>();
foreach (var forecast in forecastList)
{
var rid = forecast.ResourceId;
if (!demandForecasts.ContainsKey(rid))
{
demandForecasts[rid] = new double?[12];
}
var demandForecast = demandForecasts[rid];
demandForecast[0] += forecast.ForecastJan;
// etc
demandForecast[11] += forecast.ForecastDec;
}
Do the same for capacities, resulting in a capacities dictionary. Then, one more loop over forecastList to set the "over forecasted" flags:
foreach (var forecast in forecastList)
{
var rid = forecast.ResourceId;
forecast.IsOverForecastedJan = capacities[rid][0] < demandForecast[rid][0];
// ...
forecast.IsOverForecastedDec = capacities[rid][11] < demandForecast[rid][11];
}
As is obvious from the twelve-fold code repetition implicit in this code, modelling capacities etc as separate properties for each month is probably not the best way of doing things -- using some kind of indexed collection would allow the repetition to be eliminated.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sort large CSV file by specific column using csvhelper - c#

Related

Multiple lines and columns in Specflow

C#: read text file and process it

How to compare two csv files by 2 columns?

How to count occurences of number stored in file containing multiple delimeters?

What is the fastest way to compare a value from a list to a specific sum from another list?

Categories

Resources