Get Distinct List from csv file based on first column - c#

I have the following list of stings taken from a csv file...
List<string> listOfRecords;
Each line is a string in the list...
one,bob,black
two,steve,smith
three,bill,brown
one,jill,brown
one,sue,smith
I would like to remove duplicates based on the first value on each line. Resulting in...
one,bob,black
two,steve,smith
three,bill,brown
I thought the code would look something like....
distinctlist = Select listOfRecords.split(',')[0].distinct
this is obviously wrong but I wanted to avoid making a list of lists and doing it that way. Thinking linq would be simpler.
All the posts I can find on here seem quite complex or do not address the specifics of my question. Any help would be greatly appreciated...

Simple with a GroupBy:
var distinctByFirstColumn = listOfRecords
.GroupBy(x => x.Split(',')[0])
.Select(x => x.First());

I'd rather use HashSet<String> and simple foreach loop instead of Linq (which, IMHO, is overshoot) here:
var distinctList = new List<String>();
HashSet<String> taken = new HashSet<String>();
foreach (var line in listOfRecords)
// you don't want to split all the line, but 1st item only
if (taken.Add(line.SubString(0, line.IndexOf(',')))
distinctList.Add(line);
Edit: In case of a real csv file:
private static IEnumerable<String> CsvDistinctLines(String fileName) {
HashSet<String> taken = new HashSet<String>();
foreach (var line in File.ReadLines(fileName))
if (taken.Add(line.SubString(0, line.IndexOf(',')))
yield return line;
}
...
var distinctList = CsvDistinctLines(#"C:\MyFile.csv").ToList();

Related

How to efficiently split a string, add it to a List, and convert to a Double?

I have some basic code that does what I want it to do, but I think it can be abbreviated / cleaned up. I am struggling a bit on how to do so, however.
The code reads as follows:
List<string> positions = new List<string>();
List<string> players = new List<string>();
foreach (string element in fractionedList)
{
positions.Add(element.Split(',')[2]);
positions.Add(element.Split(',')[3]);
positions.Add(element.Split(',')[4]);
players.Add(element.Split(',')[5]);
players.Add(element.Split(',')[6]);
players.Add(element.Split(',')[7]);
}
List<double> convertedPositions = positions.Select(x => double.Parse(x)).ToList();
List<double> convertedPlayers = playerss.Select(x => double.Parse(x)).ToList();
For reference, my fractionedList will look something like:
"string0,string1,string2,string3,string4,string5,string6,string7,string8,string9,string10,string11,string12",
"string0,string1,string2,string3,string4,string5,string6,string7,string8,string9,string10,string11,string12",
"string0,string1,string2,string3,string4,string5,string6,string7,string8,string9,string10,string11,string12",
"string0,string1,string2,string3,string4,string5,string6,string7,string8,string9,string10,string11,string12",
So I am trying to split each string instance of the List, get the next three elements, and then add them to a new List and then convert that List to a new List of doubles. I am wondering if there is a cleaner way to handle the Split method. Is there an equivalent to Take()? Also, can this all be done in one List creation, rather than creating a List of strings, creating a List of doubles?
The first thing I would change is to not split your string 6 times for no reason. Split it once and store the result in a variable.
With a little LINQ you could shorten your code:
List<double> positions = new List<double>();
List<double> players = new List<double>();
foreach (string element in fractionedList)
{
string[] elementSplit = element.Split(',');
positions.AddRange(elementSplit.Skip(2).Take(3).Select(x => double.Parse(x));
players.AddRange(elementSplit.Skip(5).Take(3).Select(x => double.Parse(x));
}
What my code does is split your element variable on , like you were doing (now only doing it once). Then using Linq's Take() and Skip() I am selecting the [2,3,4] and [5,6,7] indices and adding them to their respective lists (after parsing to double).
Keep in mind that this code will throw an exception if your string input is something that can not reasonably parse into a double. If you are certain that the input will always be good then this code should get you there the quickest.
This would perform the conversion inline, without the need to store in an initial string list
List<double> convertedPositions = new List<double>();
List<double> convertedPlayers = new List<double>();
foreach (string element in fractionedList)
{
var elements = element.Split(',');
convertedPositions.AddRange(elements.Skip(2).Take(3).Select(x=> Convert.ToDouble(x)));
convertedPositions.AddRange(elements.Skip(5).Take(3).Select(x => Convert.ToDouble(x));
}

Remove lines from List<String[]> using Linq, if meeting a certain criteria

I've searched around for a solution to this question but can't find an applicable circumstance and can't get my head around it either.
I've got a List<String[]> object (a parsed CSV file) and want to remove any rows if the first value in the row is equal to my criteria.
I've tried the following (with variations) and can't seem to get it to delete the lines, it just passes over them:
rows.RemoveAll(s => s[0].ToString() != "Test");
Which I'm currently reading as, remove s if s[0] (the first value in the row) does not equal "Test".
Can someone point me in the right direction for this?
Thanks, Al.
Edit for wider context / better understanding:
The code is as follows:
private void CleanUpCSV(string path)
{
List<string[]> rows = File.ReadAllLines(path).Select(x => x.Split(',')).ToList();
rows.RemoveAll(s => s[0] != "Test");
using (StreamWriter writer = new StreamWriter(path, false))
{
foreach (var row in rows)
{
writer.WriteLine(row);
}
}
}
So the question is -> Why won't this remove the lines that do not start with "Test" and upon writing, why is it returning System.String[] as all the values?
Did you try with Where? Where is going to filter based on a predicate. You should be able to do something like this:
Demo: Try it online!
List<string[]> rows = new List<string[]> { new []{"Test"}, new []{ "Foo"} };
rows = rows.Where(s => s[0] == "Test").ToList();
foreach(var row in rows)
{
Console.WriteLine(string.Join(",", row));
}
output
Test
You dont need ToString() because S[0] is already a string
You may want to handle empty case or s[0] could throw
You can use s.First() instead of s[0]
You can learn more about Predicateon msdn
Edit
For your example:
private void CleanUpCSV(string path)
{
var rows = File.ReadAllLines(path).Select(x => x.Split(','));
using (StreamWriter writer = new StreamWriter(path, false))
{
foreach (var row in rows.Where(s => s[0] == "Test"))
{
writer.WriteLine(string.Join(",", row));
}
}
}
By the way, you may want to use a library to handle csv parsing. I personally use CsvHelper
The only error in your code is the following:
Since row is string[] this
writer.WriteLine(row);
won't give you the result you were expecting.
Change it like this
writer.WriteLine(String.Join(",", row));
To convert the string[]back into its orginal form.
Any other "optimisation" in all the answers proposed here arent really optimal either.
If you're really trying to remove items where the first element isn't "Test", then your code should work, though you don't need to call .ToString() on s[0] since it's already a string. If this doesn't work for you, perhaps your problem lurks elsewhere? If you give an example of your code in a wider context you could get more help
Filter it like this instead:
var filteredList = rows.Where(s => s[0] == "test").ToArray();

Sorting files in a directory by date in c# using directory.get files()

At the moment I have my code to get some files from a Dir.
foreach (var file in
Directory.GetFiles(MainForm.DIRECTORY_PATH, "*.csv"))
{
//Process File
string[] values = File.ReadAllLines(file)
.SelectMany(lineRead => lineRead.Split(',')
.Select(s => s.Trim()))
.ToArray();
I want to be able to order these file by date order first before i start reading them and processing them.
I looked at a suggestion on MDSN to use DirectoryInfo:
DirectoryInfo DirInfo = new DirectoryInfo(MainForm.DIRECTORY_PATH);
var filesInOrder = from f in DirInfo.EnumerateFiles()
orderby f.CreationTime
select f;
foreach (var item in filesInOrder)
{
//Process File
string[] values = File.ReadAllLines(item )
.SelectMany(lineRead => lineRead.Split(',')
.Select(s => s.Trim()))
.ToArray();
}
this doesnt work however as the System.IO.File.ReadAllLine(file) seems to red line with the error as item is a string and not an actual file. :(
Does anyone know a solution to this or has had a similar issue? :)
Regards
J.
From MSDN File.ReadAllLines(string path) takes file path as input.
Opens a text file, reads all lines of the file, and then closes the file.
You have to pass file path:
string[] values = File.ReadAllLines(item.FullName)
your code:
foreach (var item in filesInOrder)
{
string[] values = File.ReadAllLines(item.FullName)
...............................
...............................
}
You can replace all of your chunk with following code via lambda expressions:
var values = DirInfo.EnumerateFiles().OrderBy(f => f.CreationTime)
.Select(x => File.ReadAllLines(x.FullName)
.SelectMany(lineRead => lineRead.Split(',')
.Select(s => s.Trim())).ToArray()
);
Your first code snippet reads all lines in one file, where as the second one reads from all files in the directory. So it is not very clear what you want to do.
The second code snippet cannot work, because the variable values is declared inside the loop. Its visibility scope is limited to the code block of the loop. The result will therefore never be visible outside of the loop.
var filesInOrder = from f in DirInfo.EnumerateFiles() ...;
var items = new List<string>();
foreach (FileInfo f in filesInOrder) {
using (StreamReader sr = f.OpenText()) {
while (!sr.EndOfStream) {
items.AddRange(sr.ReadLine().Split(','));
}
}
}
Here I define a List<string> before the loop that will hold all the items of all files. We need two loops: one that loops over the files (foreach) and one that reads the lines in each file and successively adds items to the list (while).

Is there a better method of calling a comparision over a list of objects in C#?

I am reading in lines from a large text file. Amongst these file are occasional strings, which are in a preset list of possibilities, and I wish to check the line currently being read for a match to any of the strings in the possibilities list. If there is a match I want to simply append them to a different list, and continue the loop I am using to read the file.
I was just wondering if there is a more efficent way to do a line.Contains() or equivilance check against say the first element in the list, then the second, etc. without using a nested loop or a long if statement filled with "or"s.
Example of what I have now:
List<string> possible = new List<string> {"Cat", "Dog"}
using(StreamReader sr = new StreamReader(someFile))
{
string aLine;
while ((aLine = sr.Readline()) != null)
{
if (...)
{
foreach (string element in possible)
{
if line.Contains(element) == true
{
~add to some other list
continue
}
}
~other stuff
}
}
I don't know about more efficient run-time wise, but you can eliminate a lot of code by using LINQ:
otherList.AddRange(File.ReadAllLines(somefile).
.Where(line => possible.Any(p => line.Contains(p)));
I guess you are looking for:
if(possible.Any(r=> line.Contains(r)))
{
}
You can separate your work to Get Data and then Analyse Data. You don't have to do it in the same loop.
After reading lines, there are many ways to filter them. The most readable and maintenable IMO is to use Linq.
You can change your code to this:
// get lines
var lines = File.ReadLines("someFile");
// what I am looking for
var clues = new List<string> { "Cat", "Dog" };
// filter 1. Are there clues? This is if you only want to know
var haveCluesInLines = lines.Any(l => clues.Any(c => l.Contains(c)));
// filter 2. Get lines with clues
var linesWithClues = lines.Where(l => clues.Any(c => l.Contains(c)));
Edit:
Most likely you will have little clues and many lines. This example checks each line with every clue, saving time.

A filter/search query where results must contain all queryTerms

Can anyone suggest how this can be improved?
public IEnumerable<Person> FindPersons(string queryTerms)
{
if (queryTerms == null)
return new List<Person>();
var queryTermsList = queryTerms.Split(' ').ToList();
var first = queryTermsList.First();
queryTermsList.Remove(first);
var people = FindPerson(first);
foreach (var queryTerm in queryTermsList)
{
people = people.Intersect(FindPerson(queryTerm));
}
return people;
}
Basically what it does is searches for people that contain EVERY queryTerm within the queryTermList.
Because results have to contain ALL terms I used Intersect.
Because I was using intersect I had to do an initial search for the first query term outside the foreach loop so the intersect within the loop would have something to intersect with. Otherwise you’d obviously always get empty results.
This meant I then needed to remove the first query term from the list before entering the foreach loop.
Ok, so this works. It just seems there must be a more elegant way of writing this.
Any suggestions?
You can just start with the entire collection and intersect all of the terms with that:
var people = AllPeople;
foreach (var queryTerm in queryTermsList)
{
people = people.Intersect(FindPerson(queryTerm));
}

Categories