I don't have much experience with regexes and I wanted to rectify that. I decided to build an application that takes a directory name, scans all files (that all have a increasing serial number but differ subtly in their filenames. Example : episode01.mp4, episode_02.mp4, episod03.mp4, episode04.rmvb etc.)
The application should scan the directory, find the number in each file name and rename the file along wit the extension to a common format (episode01.mp4,episode02.mp4,episode03.mp4,episode04.rmvb etc.).
I have the following code:
Dictionary<string, string> renameDictionary = new Dictionary<string,string>();
DirectoryInfo dInfo = new DirectoryInfo(path);
string newFormat = "Episode{0}.{1}";
Regex regex = new Regex(#".*?(?<no>\d+).*?\.(?<ext>.*)"); //look for a number(before .) aext: *(d+)*.*
foreach (var file in dInfo.GetFiles())
{
string fileName = file.Name;
var match = regex.Match(fileName);
if (match != null)
{
GroupCollection gc = match.Groups;
//Console.WriteLine("Number : {0}, Extension : {2} found in {1}.", gc["no"], fileName,gc["ext"]);
renameDictionary[fileName] = string.Format(newFormat, gc["no"], gc["ext"]);
}
}
foreach (var renamePair in renameDictionary)
{
Console.WriteLine("{0} will be renamed to {1}.", renamePair.Key, renamePair.Value);
//stuff for renaming here
}
One problem in this code is that it also includes files which don't have numbers in the renameDictionary. It would also be helpful if you could point out any other gotchas that I should be careful about.
PS: I am assuming that the filenames will only contain numbers corresponding to serial (nothing like cam7_0001.jpg)
This simplest solution is probably to use Path.GetFileNameWithoutExtension to get the file name, and then the regex \d+$ to get the number at its end (or Path.GetExtension and \d+ to get the number anywhere).
You can also achieve this in a single replace:
Regex.Replace(fileName, #".*?(\d+).*(\.[^.]+)$", "Episode$1$2")
This regex is a bit better, in that it forces the extension not to contain dots.
Related
I have a File-Info-List of more than 200 log-files from a directory.
Most of the files need to be in the list, but there are a few lists that should be ignored.
Here is an example of the File-List:
A300a1_ContentLink.log
A301a20_ContentLink.log
A1_4a0_ContentLink.log
B200a101_ContentLink.log
B200a101_ContentLink_20221208_115905.log
B200a101_ContentLink_20221208_115907.log
B200a101_ContentLink_20221208_120647.log
B201a1_ContentLink.log
B202a0_ContentLink.log
Explanation of the file name:
The first chars refer to a room (e.g. room A300 or A1). A room could have any description, eg B200, CXS2 or only CDD, the next to a device-name (e.g. device a1 oder device a20). Each device starts with a, followed by 1-3 digits. Last part of each file is "_ContentLink" .
All files with further ending, like _202211208_115905 are duplicates of older versions, that are needed in other programs, but not in my List.
My problem is that I only need the newest File of each logfile in my File-Info-List.
I initialized a FileInfo[] allFiles that contains all of the files of the directory.
Next I initialized a new FileInfo[] in which I would like to store only the newest version of each file.
My first attempt was to compare the LastWrite time
FileInfo currentFile = allFiles[0];
foreach (FileInfo file in allFiles)
{
if (file.LastWriteTime > currentFile.LastWriteTime)
{
currentFile = file;
}
}
But I only get back the latest file of the whole folder.
Now, I am thinking about to use Regular Expressions insteadt of .LastWriteTime, to exclude all Files that have a suffix after ContentLink.
But I don't know how and how to remove the outdated files from the list with all files (or transfer only the relevatn to a new File Info[]-List)
Thank you in advance for your ideas.
You can use a LINQ query to:
extract the name and time part from each file name
group the files by name and
select the latest (maximum) file by time
Something like :
var regex=new Regex("^(.*?)_ContentLink(.*?).log");
var latest=allFiles.Select(f=>{
var parts=regex.Match(f.Name);
return new {
File=f,
Name=parts.Groups[1].ToString(),
Date=parts.Groups[2].ToString()
};
})
.GroupBy(f=>f.Name)
.Select(g=>g.MaxBy(f=>f.Date).File)
.ToArray();
foreach(var file in latest)
{
Console.WriteLine(file.Name);
}
This produces
A300a1_ContentLink.log
A301a20_ContentLink.log
A1_4a0_ContentLink.log
B200a101_ContentLink_20221208_120647.log
B201a1_ContentLink.log
B202a0_ContentLink.log
MaxBy was added in .NET 6. Before that you can use the equivalent method from the MoreLINQ library.
The regular expression captures the smallest possible string before _ContentLink in the first group (.*?) and the smallest possible date part in the second group.
You could get a bit fancier and use different regular expressions to capture the name and time part. Combined with local functions, this results in a somewhat cleaner query:
var nameRex=new Regex("^(.*?)_ContentLink.*.log");
var timeRex=new Regex("^.*_ContentLink(.*?).log");
string NamePart(FileInfo f)
{
return nameRex.Match(f.Name).Groups[1].ToString();
}
string TimePart(FileInfo f)
{
return timeRex.Match(f.Name).Groups[1].ToString();
}
var latest=allFiles
.GroupBy(NamePart)
.Select(g=>g.MaxBy(TimePart))
.ToArray();
I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);
Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.
I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:
Using streamreader to store the entire file in a string, then do a
find and replace in that string.
Using MatchCollection in combination
with File.ReadAllText()
Read the file line by line and look for
matches there.
The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.
What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?
Edit:
Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:
Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";
//For keep track of trailing bits after our match.
int matchlength = 0;
using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
//While we are still reading lines in the file...
while ((line = sr.ReadLine()) != null)
{
//Keep adding lines to buildingmatch until we can match the regular expression.
buildingmatch = buildingmatch + line + "\r\n";
if (myreg.IsMatch(buildingmatch)
{
match = myreg.Match(buildingmatch).Value;
matchlength = match.Lengh;
//Make sure we are not at the end of the file.
if (matchlength < buildingmatch.Length)
{
whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
}
sw.Write(match, + "\f\f");
buildingmatch = whatremains;
whatremains = "";
}
}
}
The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...
If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:
string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(myregex.Replace(text, "$&\f\f"));
}
Details:
string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.
I was able to find a solution that works in a reasonable time; it can process my entire 150MB file in under 5 minutes.
First, as mentioned in the comments, it's a waste to compare the string to the Regex after every iteration. Rather, I started with this:
string match = File.ReadAllText(srcFile);
MatchCollection mymatches = myregex.Matches(match);
Strings can hold up to 2GB of data, so while not ideal, I figured roughly 150MB worth wouldn't hurt to be stored in a string. Then, as opposed to checking a match every x amount of lines read in from the file, I can check the file for matches all at once!
Next, I used this:
StringBuilder matchsb = new StringBuilder(134217728);
foreach (Match m in mymatches)
{
matchsb.Append(m.Value + "\f\f");
}
Since I already know (roughly) the size of my file, I can go ahead and initialize my stringbuilder. Not to mention, it's a lot more efficient to use string builder if you are doing multiple operations on a string (which I was). From there, it's just a matter of appending the form feed to each of my matches.
Finally, the part the cost the most on performance:
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(matchsb.ToString());
}
The way that you initialize StreamWriter is critical. Normally, you just declare it as:
StreamWriter sw = new StreamWriter(destfile);
This is fine for most use cases, but the problem becomes apparent with you are dealing with larger files. When declared like this, you are writing to the file with a default buffer of 4KB. For a smaller file, this is fine. But for 150MB files? This will end up taking a long time. So I corrected the issue by changing the buffer to approximately 5MB.
I found this resource really helped me to understand how to write to files more efficiently: https://www.jeremyshanks.com/fastest-way-to-write-text-files-to-disk-in-c/
Hopefully this will help the next person along as well.
I have log file and i need to find some parameters.
For example:
11:26:42 In [INF] File opened
11:27:48 In [INF] some operations
And i want to find string numer 2- with extra space.
So, i try to find like this:
string pattern = #"\[INF\]";
foreach (String inf in lines)
{
if (Regex.IsMatch(inf, pattern))
{
//Console.WriteLine(inf);
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outputPath, true,Encoding.ASCII))
{
file.WriteLine(inf);
}
}
}
But how to find INF category with extra white space?
I do it via c#, but it doesnt matter.
Thanks.
An easy way to find double (or more) spaces is
#"\s{2,}"
This will match the spaces only.
string pattern = #"\s\s\[INF\]\s\s";
Regex Test
I have two lists containing paths to a directory of music files and I want to determine which of these files are stored on both lists and which are only stored on one. The problem lies in that the format of the paths differ between the two lists.
Format example:
List1: file://localhost//FILE/Musik/30%20Seconds%20To%20Mars.mp3
List2: \\FILE\Musik\30 Seconds To Mars.mp3
How do I go about comparing these two file paths and matching them to the same source?
The answer depends on your notion of "same file". If you merely want to check if the file is equal, but not the very same file, you could simply generate a hash over the file's content and compare that. If the hashes are equal (please use a strong hash, like SHA-256), you can be confident that the files are also. Likewise you could of course also compare the files byte by byte.
If you really want to figure that the two files are actually the same file, i.e. just addressed via different means (like file-URL or UNC path), you have a little more work to do.
First you need to find out the true file system path for each of the addresses. For example, you need to find the file system path behind the UNC path and/or file-URL (which typically is the URL itself). In the case of UNC paths, that are shares on a remote computer, you might even be able to do so.
Also, even if you have the local path figured out somehow, you also need to deal with different redirection mechanisms for local paths (on Windows junctions/reparse points/links; on UNIX symbolic or hard links). For example, you could have a share using file system link as source, while the file URL uses the true source path. So to the casual observer they still look like different files.
Having all that said, the "algorithm" would be something like this:
Figure out the source path for the URLs, UNC paths/shares, etc. you have
Figure out the local source path from those paths (considering links/junctions, subst.exe, etc.)
Normalize those paths, if necessary (i.e. a/b/../c is actually a/c)
Compare the resulting paths.
I think the best way to do it is by temporarily converting one of the paths to the other one's format. I would suggest you change the first to match the second.
string List1 = "file://localhost//FILE/Musik/30%20Seconds%20To%20Mars.mp3"
string List2 = "\\FILE\Musik\30 Seconds To Mars.mp3"
I would recommend you use Replace()-method.
Get rid of "file://localhost":
var tempStr = List1.Replace("file://localhost", "");
Change all '%20' into spaces:
tempStr = List1.Replace("%20", " ");
Change all '/' into '\':
tempStr = List1.Replace("/", "\");
VoilĂ ! To strings in matching format!
Use python: you can easily compare the two files like this
>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False
to open the files with the file:// syntax use URLLIB
>>> import urllib
>>> file1 = urllib.urlopen('file://localhost/tmp/test')
for the normal files path use the standard file open.
>>> file2 = open('/pathtofile','r')
I agree completely with Christian, you should re-think structure of the lists, but the below should get you going.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication5
{
class Program
{
public static List<string> SanitiseList(List<string> list)
{
List<string> sanitisedList = new List<string>();
foreach (string filename in list)
{
String sanitisedFilename = String.Empty;
if (!String.IsNullOrEmpty(filename))
{
sanitisedFilename = filename;
// get rid of the encoding
sanitisedFilename = Uri.UnescapeDataString(sanitisedFilename);
// first of all change all back-slahses to forward slashes
sanitisedFilename = sanitisedFilename.Replace(#"\", #"/");
// if we have two back-slashes at the beginning assume its localhsot
if (sanitisedFilename.Substring(0, 2) == "//")
{
// remove these first double slashes and stick in localhost
sanitisedFilename = sanitisedFilename.TrimStart('/');
sanitisedFilename = sanitisedFilename = "//localhost" + "/" + sanitisedFilename;
}
// remove file
sanitisedFilename = sanitisedFilename.Replace(#"file://", "//");
// remove double back-slashes
sanitisedFilename = sanitisedFilename.Replace("\\", #"\");
// remove double forward-slashes (but not the first two)
sanitisedFilename = sanitisedFilename.Substring(0,2) + sanitisedFilename.Substring(2, sanitisedFilename.Length - 2).Replace("//", #"/");
}
if (!String.IsNullOrEmpty(sanitisedFilename))
{
sanitisedList.Add(sanitisedFilename);
}
}
return sanitisedList;
}
static void Main(string[] args)
{
List<string> listA = new List<string>();
List<string> listB = new List<string>();
listA.Add("file://localhost//FILE/Musik/BritneySpears.mp3");
listA.Add("file://localhost//FILE/Musik/30%20Seconds%20To%20Mars.mp3");
listB.Add("file://localhost//FILE/Musik/120%20Seconds%20To%20Mars.mp3");
listB.Add(#"\\FILE\Musik\30 Seconds To Mars.mp3");
listB.Add(#"\\FILE\Musik\5 Seconds To Mars.mp3");
listA = SanitiseList(listA);
listB = SanitiseList(listB);
List<string> missingFromA = listB.Except(listA).ToList();
List<string> missingFromB = listA.Except(listB).ToList();
}
}
}
I have a C# app that uses the search functions to find all files in a directory, then shows them in a list. I need to be able to filter the files based on extension (possible using the search function) and directory (eg, block any in the "test" or "debug" directories from showing up).
My current code is something like:
Regex filter = new Regex(#"^docs\(?!debug\)(?'display'.*)\.(txt|rtf)");
String[] filelist = Directory.GetFiles("docs\\", "*", SearchOption.AllDirectories);
foreach ( String file in filelist )
{
Match m = filter.Match(file);
if ( m.Success )
{
listControl.Items.Add(m.Groups["display"]);
}
}
(that's somewhat simplified and consolidated, the actual regex is created from a string read from a file and I do more error checking in between.)
I need to be able to pick out a section (usually a relative path and filename) to be used as the display name, while ignoring any files with a particular foldername as a section of their path. For example, for these files, only ones with +s should match:
+ docs\info.txt
- docs\data.dat
- docs\debug\info.txt
+ docs\world\info.txt
+ docs\world\pictures.rtf
- docs\world\debug\symbols.rtf
My regex works for most of those, except I'm not sure how to make it fail on the last file. Any suggestions on how to make this work?
Try Directory.GetFiles. This should do what you want.
Example:
// Only get files that end in ".txt"
string[] dirs = Directory.GetFiles(#"c:\", "*.txt", SearchOption.AllDirectories);
Console.WriteLine("The number of files ending with .txt is {0}.", dirs.Length);
foreach (string dir in dirs)
{
Console.WriteLine(dir);
}
^docs\\(?:(?!\bdebug\\).)*\.(?:txt|rtf)$
will match a string that
starts with docs\,
does not contain debug\ anywhere (the \b anchor ensures that we match debug as an entire word), and
ends with .txt or .rtf.