Get files with same/ similar name from array - c#

I have multiple objects in an array of which the format:
id_name_date_filetype.
I need to take all the objects with, let's say same id or same name and insert them in a new array.
With the GetFiles method I already have all the object in one array and I have their names but I don't know how to differentiate them.
I have a foreach I which I'll be going through all the objects but I'm kind of stuck.
Any hints as to what do I do?
//Process the files
string[] filelist = Directory.GetFiles(SourceDirectory, "*.tsv*", SearchOption.TopDirectoryOnly).Select(filename => Path.GetFullPath(filename)).Distinct().ToArray();
foreach (string file in filelist)
{
string[] fileNameSplit = file.Split('_');
switch (fileNameSplit.Last().ToLower())
{
case "assets.tsv":
assets = ReadDataFromCsv<Asset>(file);
break;
case "financialaccounts.tsv":
financialAccounts = ReadDataFromCsv<FinancialAccount>(file);
break;
case "households.tsv":
households = ReadDataFromCsv<Household>(file);
break;
case "registrations.tsv":
registrations = ReadDataFromCsv<Registration>(file);
break;
case "representatives.tsv":
representatives = ReadDataFromCsv<Representative>(file);
break;
}
}
// Find all files from one firm and insert them in a list
foreach (string file in filelist)
{
}

Here is a linq approach as I proposed it in my comment:
First get all distinct ID's from your filelist
string [] allDistinctIDs = filelist.Select(x=>x.Split('_').First()).Distinct(). ToArray();
now you can iterate through the list of ID's and compare each value
for (int i = 0; i < allDistinctIDs.Length; i++)
{
string [] allSameIDStrings = filelist.Where(x=>x.Split('_').First() == allDistinctIDs[i]).ToArray();
}
Basically you split every item by '_' and compare the first (id part) of the string with each item from your list of distinct ID's.
Another approach would be to use GroupBy.
// example input
string[] filelist = {
"123_Name1_xxx_Asset.tsv",
"456_Name2_xxx_Asset.tsv",
"123_Name3_xxx_HouseHold.tsv",
"456_Name4_xxx_HouseHold.tsv"};
IEnumerable<IGrouping<string, string>> ID_Groups = filelist.GroupBy(x=>x.Split('_').First());
This would give you a collection of all filenames grouped by the ID:
at each position in ID_Groups is a list of items with the same ID. You can filter them by fileName:
foreach (var id_group in ID_Groups)
{
assets = ReadDataFromCsv<Asset>(id_group.FirstOrDefault(x=>x.ToLower().Contains("assets.tsv")));
// and so on
households = ReadDataFromCsv<Household>(id_group.FirstOrDefault(x=>x.ToLower().Contains("households.tsv")));
}

You gotta define what is "Similar" to you. It could be the initial letter of the file name? Half of it? Whole filename?
This function should do more or less what you want without using Linq or something more complex than loops.
var IDOffileNameIWant = object.GetFiles()[0].id;
List<string> arrayThatContainsSimilar = new List<string>();
foreach(var file in object.GetFiles())
{
if(file.Name.Split('_')[0].Contains(IDOffileNameIWant))
{
arrayThatContainsSimilar.Add(file.Name);
}
}
It's very basic and can be refined, but you gotta give more details on what is the exact result you want to obtain.
Since you're still struggling, here's a working example:
List<string> files = new List<string>() {
"123_novica_file1", "123_novica_file3", "123_novica_file2", "456_myfilename_file1",
"789_myfilename_file1", "101_novica_file2", "102_novica_file3"};
List<string> filesbyID = new List<string>();
List<string> filesbyName = new List<string>();
string theIDPattern = "123";
string theFileNamePattern = "myfilename";
foreach(var file in files)
{
//splitting the filename and checking by ID
if(file.Split('_')[0].Contains(theIDPattern))
{
filesbyID.Add(file);
}
//splitting the filename and checking by name
if (file.Split('_')[1].Contains(theFileNamePattern))
{
filesbyName.Add(file);
}
}
Result:
files by id:
123_novica_file1
123_novica_file3
123_novica_file2
files by name:
456_myfilename_file1
789_myfilename_file1

Related

LINQ or Lambda for two for loops

The code I have written works fine, this inquiry being purely for educational purposes. I want to know how others would do this better and cleaner. I especially hate the way I use two for loops to get data. There has to be a more efficient way.
I tried to do with LINQ but one of them is a class and the other one is just a string[]. So I couldn't figure out how to use it.
I have got a Document Name Table in my SQL database and Files in Content Folder.
I have got a Two list- ListOfFileNamesSavedInTheDB and ListOfFileNamesInTheFolder.
Basically, I am getting all file names saved in Database and checking is it exist in the Folder, if not delete file name from the database.
var clientDocList = documentRepository.Documents.Where(c => c.ClientID == clientID).ToList();
if (Directory.Exists(directoryPath))
{
string[] fileList = Directory.GetFiles(directoryPath).Select(Path.GetFileName).ToArray();
foreach (var clientDoc in clientDocList)
{
bool fileNotExist = true;
foreach (var file in fileList)
{
if (clientDoc.DocFileName.Trim().ToUpper()==file.ToUpper().Trim())
{
fileNotExist = false;
break;
}
}
if (fileNotExist)
{
documentRepository.Delete(clientDoc);
}
}
}
I am not exactly sure of how you want your code to work but I believe you need something like this
//string TextResult = "";
ClientDocList documentRepository = GetClientDocList();
var directoryPath = "";
var clientID = 1;
var clientDocList = documentRepository.Documents.Where(c => c.ClientID == clientID).ToList();
if (Directory.Exists(directoryPath) || true) // I need to pass your condition
{
string[] files = new string[] { "file1", "file5", "file6" };
List<string> fileList = files.Select(x => x.Trim().ToUpper()).ToList(); // I like working with lists, if you want an array it's ok
foreach (var clientDoc in clientDocList.Where(c => !fileList.Contains(c.DocFileName.Trim().ToUpper())))
{
//TextResult += $" {clientDoc.DocFileName} does not exists so you have to delete it from db";
documentRepository.Delete(clientDoc);
}
}
//Console.WriteLine(TextResult);
To be honest, I really don't like this line
fileList = files.Select(x => x.Trim().ToUpper()).ToList()
so I would suggest you add a helper function comparing the list of file names to the specific file name
public static bool TrimContains(List<string> names, string name)
{
return names.Any(x => x.Trim().Equals(name.Trim(), StringComparison.InvariantCultureIgnoreCase));
}
and your final code would become
List<string> fileList = new List<string>() { "file1", "file5", "file6" };
foreach (var clientDoc in clientDocList.Where(c => !TrimContains(fileList, c.DocFileName)))
{
//TextResult += $" {clientDoc.DocFileName} does not exists so you have to delete it from db";
documentRepository.Delete(clientDoc);
}
Instead of retrieving all documents from database and do the checking in memory, I suggest to check which document doesn't exist in folder in one query:
if (Directory.Exists(directoryPath))
{
var fileList = Directory.GetFiles(directoryPath).Select(Path.GetFileName);
var clientDocList = documentRepository.Documents.Where(c => c.ClientID == clientID && !fileList.Contains(c.DocFileName.Trim())).ToList();
documentRepository.Documents.RemoveRange(clientDocList);
}
Note: this is just a sample to demonstrate the idea, may have syntax error somewhere since I don't have IDE with me at the moment. But the idea is there
This code is not only shorter but also more efficient since it only uses a single query to retrieve documents from database. I assume the number of files in a folder is not too large to convert to SQL by EF

Extract a pattern from a group of filenames and place into listbox

I have a folder with a lot of files like this:
2016-01-02-03-abc.txt
2017-01-02-03-defjh.jpg
2018-05-04-03-hij.txt
2022-05-04-03-klmnop.jpg
I need to extract the pattern from each group of filenames.
For example, I need the pattern 01-02-03 from the first two files placed in a list. I also need the pattern 05-04-03 placed in the same list. So, my list will look like this:
01-02-03
05-04-03
Here is what I have so far. I can successfully remove the characters but getting one instance of a pattern back into a list is beyond my pay grade:
public void GetPatternsToList()
{
//Get all filenames with characters removed and place in listbox.
List<string> files = new List<string>(Directory.EnumerateFiles(folderBrowserDialog1.SelectedPath));
foreach (var file in files)
{
var removeallbeforefirstdash = file.Substring(file.IndexOf("-") + 1); // removes everthing before the dash in the filename
var finalfile = removeallbeforefirstdash.Substring(0,removeallbeforefirstdash.LastIndexOf("-")); // removes everything after dash in name -- will crash if file without dash is in folder (not sure how to fix this either)
string[] array = finalfile.ToArray(); // I need to do the above with each file in the list and then place it back in an array to display in a listbox
List<string> filesList = array.ToList();
listBox1.DataSource = filesList;
}
}
You could do it this way:
public void GetPatternsToList()
{
var files = Directory.GetFiles(folderBrowserDialog1.SelectedPath);
var patterns = new HashSet<string>();
foreach (var file in files)
{
var splitFileName = file.Split('-').Skip(1).Take(3);
var joinedFileName = string.Join("-", splitFileName);
if(!string.IsNullOrEmpty(joinedFileName)
patterns.Add(joinedFileName);
}
listBox1.DataSource = patterns;
}
I used a HashSet<string> in order to avoid adding duplicate patterns to the DataSource.
A few remarks that aren't related to your question, but your code in general:
I would pass the SelectedPath as a string to the method
I would let the method return you the HashSet
If you implement the above, please also name the method accordingly
All of the above is of course optional for you, but would improve your code quality.
Try this:
public void GetPatternsToList()
{
List<string> files = new List<string>(Directory.EnumerateFiles(folderBrowserDialog1.SelectedPath));
List<string> resultFiles = new List<string>();
foreach (var file in files)
{
var removeallbeforefirstdash = file.Substring(file.IndexOf("-") + 1); // removes everthing before the dash in the filename
var finalfile = removeallbeforefirstdash.Substring(0, removeallbeforefirstdash.LastIndexOf("-")); // removes everything after dash in name -- will crash if file without dash is in folder (not sure how to fix this either)
resultFiles.Add(finalfile);
}
listBox1.DataSource = resultFiles.Distinct().ToList();
}

how to compare filenames, then classify uniquely from folder into ListBox if repeated

I have json files that i'm trying to classify so the file names are as such:
inputTestingSetting_test
inputTestingSetting_test1310
inputTestingSetting_test1310_ckf
inputTestingSetting_test1310_ols
inputTestingSetting_test1310_sum
inputTestingSetting_test1311_ckf
inputTestingSetting_test1311_ols
inputTestingSetting_test1311_sum
So the output that i want in the ListBox lbJsonFileNames will be
test
test1310
test1311
currently my codes are
DirectoryInfo dInfo = new DirectoryInfo(tbJSFolder.Text);
FileInfo[] Files = dInfo.GetFiles("*.json");
List<jSonName> jsonName = new List<jSonName>();
foreach (FileInfo file in Files)
{
string filename = Path.GetFileNameWithoutExtension(file.Name);
string[] fileNameSplit = filename.Split('_');
jsonName = new List<jSonName>{
new jSonName(fileNameSplit[0],fileNameSplit[1])
};
for(int i=0;i<jsonName.Count;i++)
{
if(jsonName[i].TestNumber == fileNameSplit[1])
{
lbJsonFileNames.Items.Add(jsonName[i].TestNumber);
}
}
}
so my output for lbJsonFileNames is what i want, however it is repeated. is it possible to just show one? i've tried to put jsonName[i].TestNumber to jsonName[i+1].TestNumber. but failed as it is out of range.
is there a way to read the file names, and then compare it with the previous file name to see if it is the same? and if it is the same, ignore, move on to the next file name, if it's different then it is added into the ListBox
changed my codes to
DirectoryInfo dInfo = new DirectoryInfo(tbJSFolder.Text);
FileInfo[] Files = dInfo.GetFiles("*.json");
List<jSonName> jsonName = new List<jSonName>();
HashSet<string> fileNames = new HashSet<string>();
foreach (FileInfo file in Files)
{
string filename = Path.GetFileNameWithoutExtension(file.Name);
string[] fileNameSplit = filename.Split('_');
fileNames.Add(fileNameSplit[1]);
}
foreach(var value in fileNames)
{
lbJsonFileNames.Items.Add(value);
}
got what i want now thanks all~
Your code basically says to put the following into list box:
test
test1310
test1310
test1310
test1310
test1311
test1311
test1311
Before you add as in lbJsonFileNames.Items.Add(jsonName[i].TestNumber);, check for duplicate first. Maybe you can put that list into a Set variable. Set will automatically remove the duplicate. Then put the Set back to lbJsonFileNames.
[Edit] Sorry there is no Set in dot net. Please use HashSet instead.[/Edit]
Your code did not mention what jSonName class is like and the constructor parameters stand for. However to get your output from your input can be much easier:
string[] all = Directory.GetFiles(tbJSFolder.Text, "*.json")
.Select(x => Path.GetFileNameWithoutExtension(x))
.Select(x => x.Split(new char[] { '_' })[1])
.Distinct().ToArray();
lbJsonFileNames.Items.AddRange(all);

how to efficiently Comparing two lists with 500k objects and strings

So i have a main directory with sub folders and around 500k images. I know alot of theese images does not exist in my database and i want to know which ones so that i can delete them.
This is the code i have so far:
var listOfAdPictureNames = ImageDB.GetAllAdPictureNames();
var listWithFilesFromImageFolder = ImageDirSearch(adPicturesPath);
var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));
var differenceList = result.ToList();
listOfAdPictureNames is of type List<string>
here is my model that im returing from the ImageDirSearch:
public class CheckNotUsedAdImagesModel
{
public List<ImageDirModel> ListWithUnusedAdImages { get; set; }
}
public class ImageDirModel
{
public string FileName { get; set; }
public string Path { get; set; }
}
and here is the recursive method to get all images from my folder.
private List<ImageDirModel> ImageDirSearch(string path)
{
string adPicturesPath = ConfigurationManager.AppSettings["AdPicturesPath"];
List<ImageDirModel> files = new List<ImageDirModel>();
try
{
foreach (string f in Directory.GetFiles(path))
{
var model = new ImageDirModel();
model.Path = f.ToLower();
model.FileName = Path.GetFileName(f.ToLower());
files.Add(model);
}
foreach (string d in Directory.GetDirectories(path))
{
files.AddRange(ImageDirSearch(d));
}
}
catch (System.Exception excpt)
{
throw new Exception(excpt.Message);
}
return files;
}
The problem I have is that this row:
var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));
takes over an hour to complete. I want to know if there is a better way to check in my images folder if there are images there that doesn't exist in my database.
Here is the method that get all the image names from my database layer:
public static List<string> GetAllAdPictureNames()
{
List<string> ListWithAllAdFileNames = new List<string>();
using (var db = new DatabaseLayer.DBEntities())
{
ListWithAllAdFileNames = db.ad_pictures.Select(b => b.filename.ToLower()).ToList();
}
if (ListWithAllAdFileNames.Count < 1)
return new List<string>();
return ListWithAllAdFileNames;
}
Perhaps Except is what you're looking for. Something like this:
var filesInFolderNotInDb = listWithFilesFromImageFolder.Select(p => p.FileName).Except(listOfAdPictureNames).ToList();
Should give you the files that exist in the folder but not in the database.
Instead of the search being repeated on each of these lists its optimal to sort second list "listOfAdPictureNames" (Use any of n*log(n) sorts). Then checking for existence by binary search will be the most efficient all other techniques including the current one are exponential in order.
As I said in my comment, you seem to have recreated the FileInfo class, you don't need to do this, so your ImageDirSearch can become the following
private IEnumerable<string> ImageDirSearch(string path)
{
return Directory.EnumerateFiles(path, "*.jpg", SearchOption.TopDirectoryOnly);
}
There doesn't seem to be much gained by returning the whole file info where you only need the file name, and also this only finds jpgs, but this can be changed..
The ToLower calls are quite expensive and a bit pointless, so is the to list when you are planning on querying again so you can get rid of that and return an IEnumerable again, (this is in the GetAllAdPictureNames method)
Then your comparison can use equals and ignore case.
!listOfAdPictureNames.Any(q => p.Equals(q, StringComparison.InvariantCultureIgnoreCase));
One more thing that will probably help is removing items from the list of file names as they are found, this should make the searching of the list quicker every time one is removed since there is less to iterate through.

using foreach to store names of one array in another in c#

Now I have a directory with bunch files with format of name like "EXT2-401-B-140422-1540-1542.mp4", within which the "140422" part indicates the date. Now assume that this bunch of files have the dates like 140421, 140422, 140423...(for every date there are couple of files). Now I shall sort these files according to their dates, so I'd like to know how could I get these names (140421,140422,etc). I tried like this:
directory = new DirectoryInfo(camera_dir);
string[] date = new string[directory.GetFiles().Length];
foreach (FileInfo file in directory.GetFiles())
{
foreach(string name in date)
{
name = file.Name.Substring(11, 6);
}
}
And the error message is that I can't assign to name. So anybody could help?
You can use LINQ to simplify this a bit.
To get filenames only, you simply need to project each FileInfo into a string:
var dates = directory
.EnumerateFiles("*.mp4")
.Select(f => f.Name)
.ToArray();
To get an ordered list, you also need to use OrderBy and specify the value to be used for ordering:
var dates = directory
.EnumerateFiles("*.mp4")
.Select(f => f.Name)
.OrderBy(f => f.Substring(11, 6)) // this will throw if string is too short
.ToArray();
You should also probably add some validation to prevent exceptions when filenames are not formatted properly. The least you can do is check if the string is long enough to have these 6 characters extracted:
var dates = directory
.EnumerateFiles("*.mp4")
.Select(f => f.Name)
.Where(f => f.Length >= 17) // check if there are enough characters
.OrderBy(f => f.Substring(11, 6))
.ToArray();
try this:
var files = directory.GetFiles();
string[] dates = new string[files.Length];
for(int i = 0; i < files.Length; i++)
{
dates[i] = files[i].Name.Substring(11, 6);
}
directory = new DirectoryInfo(camera_dir);
string[] date = new string[directory.GetFiles().Length];
int i=0;
foreach (FileInfo file in directory.GetFiles())
{
name[i] = file.Name.Substring(11, 6);
i++;
}
In fact you are asking two questions.
You cannot assign a variable from the foreach. Create a new one to store the name in.
For the second part, how to extract it: use this regex:
string s = Regex.Replace(file.Name, #"(.*?)\-(.*?)\-(.*?)\-(.*?)\-(.*)", "$4");
The ? makes the regex non-greedy, meaning that this expression will find the text after the third dash.
I would be doing something like creating a class for the file name (if they are all the same structure) then splitting them out (apologies for reworking the entire code, also what the other guys are saying is something worth noting about foreach's not begin able to modify the set)...
public class MyFile
{
// EXT2-401-B-140422-1540-1542.mp4
public string part1 { get; set;}
public string part2 { get; set;}
public string part3 { get; set;}
public string date { get; set;}
public string part5 { get; set;}
public string part6 { get; set;}
void MyFile(string fileName)
{
string[] parts = fileName.split('-');
part1 = parts[0];
part2 = parts[1];
part3 = parts[2];
date = parts[3];
part5 = parts[4];
part6 = parts[5];
}
}
then you can loop through...
directory = new DirectoryInfo(camera_dir);
List<MyFile> myFiles = new List<MyFile>();
foreach (FileInfo file in directory.GetFiles())
{
myFiles.Add(new MyFile(file.Name));
}
Then you can sort, select, "blah" on any of the files using linq...
EDIT:
Additionally, if your previous code worked, you would end up with the date array being full of one date because you are trying to set the name (being each element in the array) to the current file name for the length of the array every single iteration of the directory.GetFiles() array.
Name the 'parts' something useful to you.

Categories