File formatting with Regex - c#

I am trying to split a string into multiple matches, each containing 'name', 'attributes' and 'files' (files only applies to a file with the directory attribute)
The string I'm trying to format: (I'm using the Hex-edit program as a test folder)
Hex Edit\ 1pÝó/Õ\<changelog.txt\ RÖ©òó/Õ ð`s7bÆÔ%ªòó/Õ < \HxD32.exe\ %ovòó/Õ ð‚fNcÆÔ­ÿ—òó/Õ< Þ \HxD64.exe\ ¤M˜òó/Õ ð‚fNcÆÔ:Ùžòó/Õ) †e" \license.txt\ “Lªòó/Õ ðõhÿªÔ“Lªòó/Õ¯? c \readme.txt\ ·&Ÿòó/Õ ðËóyÿªÔp°©òó/Õ„? ¦
\Settings\ IRýòó/Õ\<HxD Hex Editor.ini\ ÉÌ"ô/Õ ôeìÔ)3ÖôeìÔ)3Ö¸Ž? õ \HxD Hex Editor.lang\ yýòó/Õ yýòó/Õyýòó/Õ­•? ” \>>
Currently I am using (?<name>.+?)\\(?<attributes>.{10}( .{32})*?)\\(?<files>(<(?:[^<>]*|(?<open>\<)|(?<-open>\>))+(?(open)(?!))>)*)
The way the file is formatted:
filename\attributes\files
attributes can either be .{10}\s.{32} or .{10} followed by the \.
There isn't always files but if there is then files would be < + more files (recursive, can go to infinity) + >.
What I was hoping this Regex would respond with:
Name: Hex Edit
Attributes: 1pÝó/Õ
Files: <changelog.txt\ RÖ©òó/Õ ð`s7bÆÔ%ªòó/Õ < \HxD32.exe\ %ovòó/Õ ð‚fNcÆÔ­ÿ—òó/Õ< Þ \HxD64.exe\ ¤M˜òó/Õ ð‚fNcÆÔ:Ùžòó/Õ) †e" \license.txt\ “Lªòó/Õ ðõhÿªÔ“Lªòó/Õ¯? c \readme.txt\ ·&Ÿòó/Õ ðËóyÿªÔp°©òó/Õ„? ¦
\Settings\ IRýòó/Õ\<HxD Hex Editor.ini\ ÉÌ"ô/Õ ôeìÔ)3ÖôeìÔ)3Ö¸Ž? õ \HxD Hex Editor.lang\ yýòó/Õ yýòó/Õyýòó/Õ­•? ” \>>
For each match that I returned, if it had no files I would add it to a treeview otherwise I would perform the same Regex on it (until there is none left, eventually making a treeview that has all the files in it).
I have been attempting this for just over two hours now and still have not gotten any closer with my current attempt being (?<name>[^\\/:*?<>"|]+?)\\(?<attributes>.{10}( .{32})*?)\\(?<files>\<(?>\<(?<c>)|[^<>]+|\>(?<-c>))*(?(c)(?!))\>).
The Regex needs to be .net compatible.
Sorry for poor explaination, I am unsure how to word this aswell as it being my first post.

Try following :
string input = File.ReadAllText(FILENAME);
string pattern = #"^(?'name'[^\\]+)\\(?'attribute'[^\\]+)\\(?'files'.*)";
Match match = Regex.Match(input,pattern);
string name = match.Groups["name"].Value;
string attribute = match.Groups["attribute"].Value;
string files = match.Groups["files"].Value;

Related

Regex trying to get just package name from `az.accounts.2.10.4.nupkg`

I am trying to get the package name from the file name using C# and Regex. This is my attempt so far which works, but I am wondering if is there a more elegant way.
Given for example, az.accounts.2.10.4.nupkg I want to get az.accounts
My attempt:
var filename = Path.GetFileNameWithoutExtension(nupkgPackagePath);
var nupkgPackageGetModulePath = Regex.Matches(filename, #"[^\d]+").First().Value.TrimEnd('.'));
Test cases:
$ ls *.nupkg
PowerShellGet.nupkg az.iothub.2.7.4.nupkg
az.9.2.0.nupkg az.keyvault.4.9.1.nupkg
az.accounts.2.10.4.nupkg az.kusto.2.1.0.nupkg
az.advisor.2.0.0.nupkg az.logicapp.1.5.0.nupkg
az.aks.5.1.0.nupkg az.machinelearning.1.1.3.nupkg
az.analysisservices.1.1.4.nupkg az.maintenance.1.2.1.nupkg
az.apimanagement.4.0.1.nupkg az.managedserviceidentity.1.1.0.nupkg
az.appconfiguration.1.2.0.nupkg az.managedservices.3.0.0.nupkg
az.applicationinsights.2.2.0.nupkg az.marketplaceordering.2.0.0.nupkg
az.attestation.2.0.0.nupkg az.media.1.1.1.nupkg
az.automation.1.8.0.nupkg az.migrate.2.1.0.nupkg
az.batch.3.2.1.nupkg az.monitor.4.3.0.nupkg
az.billing.2.0.0.nupkg az.mysql.1.1.0.nupkg
az.cdn.2.1.0.nupkg az.network.5.2.0.nupkg
az.cloudservice.1.1.0.nupkg az.notificationhubs.1.1.1.nupkg
az.cognitiveservices.1.12.0.nupkg az.operationalinsights.3.2.0.nupkg
az.compute.5.2.0.nupkg az.policyinsights.1.5.1.nupkg
az.confidentialledger.1.0.0.nupkg az.postgresql.1.1.0.nupkg
az.containerinstance.3.1.0.nupkg az.powerbiembedded.1.2.0.nupkg
az.containerregistry.3.0.0.nupkg az.privatedns.1.0.3.nupkg
az.cosmosdb.1.9.0.nupkg az.recoveryservices.6.1.2.nupkg
az.databoxedge.1.1.0.nupkg az.rediscache.1.6.0.nupkg
az.databricks.1.4.0.nupkg az.redisenterprisecache.1.1.0.nupkg
az.datafactory.1.16.11.nupkg az.relay.1.0.3.nupkg
az.datalakeanalytics.1.0.2.nupkg az.resourcemover.1.1.0.nupkg
az.datalakestore.1.3.0.nupkg az.resources.6.5.0.nupkg
az.dataprotection.1.0.1.nupkg az.security.1.3.0.nupkg
az.datashare.1.0.1.nupkg az.securityinsights.3.0.0.nupkg
az.deploymentmanager.1.1.0.nupkg az.servicebus.2.1.0.nupkg
az.desktopvirtualization.3.1.1.nupkg az.servicefabric.3.1.0.nupkg
az.devtestlabs.1.0.2.nupkg az.signalr.1.5.0.nupkg
az.dns.1.1.2.nupkg az.sql.4.1.0.nupkg
az.eventgrid.1.5.0.nupkg az.sqlvirtualmachine.1.1.0.nupkg
az.eventhub.3.2.0.nupkg az.stackhci.1.4.0.nupkg
az.frontdoor.1.9.0.nupkg az.storage.5.2.0.nupkg
az.functions.4.0.6.nupkg az.storagesync.1.7.0.nupkg
az.hdinsight.5.0.1.nupkg az.streamanalytics.2.0.0.nupkg
az.healthcareapis.2.0.0.nupkg az.support.1.0.0.nupkg
You can try something like this:
string text = "az.streamanalytics.2.0.0.nupkg";
var result = Regex
.Match(text, #"(?<name>[a-zA-Z0-9.]+?)(\.[0-9]+)*\.nupkg$")
.Groups["name"]
.Value;
Pattern explained:
(?<name>[a-zA-Z0-9.]+?) - letters, digits, dots as few as possible
(in order do not match version part)
(\.[0-9]+)* - zero or more version part: . followed by digits
\.nupkg - .nupkg
$ - end of string
Fiddle
^[^.]*\.[^.]*
You can test it out at https://regex101.com/
using System.Text.RegularExpressions;
// ...
string filename = "az.accounts.2.10.4.nupkg";
string pattern = #"^[^.]*\.[^.]*";
string nupkgPackageGetModulePath = Regex.Match(filename, pattern).Value;
// nupkgPackageGetModulePath is now "az.accounts"
You've got two different input formats
<PackageName>.nupkg
<PackageName>.<Major>.<Minor>.<Patch>.nupkg
Your current attempt:
Regex.Matches(fileName, #"[^\d]+").First().Value.TrimEnd('.')
This actually doesn't work for an input of "PowerShellGet.nupkg". To explain how this code works.
Starting at the beginning of the string, find the first non-digit character, and greedily include all other consecutive non-digit characters. This is the "matched text"
If the matched text ends with a period, take off that period.
This works fine if your input has a number in it, but "PowerShellGet.nupkg" doesn't, hence nupkgPackageGetModulePath in your code example will be the full file name not "PowerShellGet".
This will also be a huge problem if the package name itself contains a digit. How about "runtime.opensuse.13.2-x64.runtime.native.System.Security.Cryptography.OpenSsl.4.3.3.nupkg", or (and I can't believe this is actually a package) "2.2.0.0.nupgk".
It's not a good idea to find the first non-digit. Instead, work with the expected format of nuget packages.
Using string.Split:
Split the input by periods. If there's two elements in the resulting array, it's the first format and return the first element of the array. If there's at least 5 elements in the array, it's the second format. Otherwise, the format is unknown.
private static string GetPackageName(string packageFileName)
{
var segments = packageFileName.Split('.');
return segments.Length switch
{
2 => segments[0],
>= 5 => string.Join(".", segments[..^4]),
_ => throw new Exception("Unknown what you want done here")
};
}
segments[..^4] is a handy way to get all the element(s) before the major version.
https://dotnetfiddle.net/Ok6jbq
Using Regex:
Again, because you've got two different formats you've got to account for both so this gets a bit more complicated.
([\S]+?)(?:\.\d+\.\d+\.\d+)?\.nupkg
The middle section ((?:\.\d+\.\d+\.\d+)?) is a non-capture group (starts with ?:) which is optional (suffixed with ?).
Capture group 1 will have the package name.
https://regexr.com/74mgf

C# Adding Whitespace around a specific character for spacing in file names

I'm building a program which processes documents based on their file path and file name.
My current solution is based on file names containing 3 strings each separated by a space, dash and another space so that a valid name would be: "STRING1 - STRING2 - STRING3.pdf".
My program reads these values by using IndexOf().
string1.Substring(fileName.IndexOf("-") - 1)
string3.Substring(fileName.LastIndexOf("-") + 2)
The problem is that this breaks when the file names don't contain whitespaces, therefore breaking everything. So I opted to use Regex instead but how would I add a condition, so it doesn't add spaces to a name which already contains them.
Example,
String fileName[1] = "Test123 - Dog - Page 1.pdf"
String fileName[2] = "Test123-Dog-Page1.pdf"
Regex.Replace(fileName[1], "-", " - ");
Regex.Replace(fileNameB[2], "-", " - ");
Output:
fileName[1] = Test123 - Dog - Page 1.pdf
fileName[2] = Test123 - Dog - Page 1.pdf
fileName[1] was originally valid, now it's invalid.
fileName[2] was originally invalid, now it's valid.
I need both to be valid via an if condition.
Ps. Apologies if anything is unclear, I'm new to posting on Stack
You don't need regex, in case pure string methods are more readable for you:
string FixFileName(string fn)
{
string fnwe = System.IO.Path.GetFileNameWithoutExtension(fn);
return string.Join(" - ", fnwe.Split('-').Select(token => token.Trim()))
+ System.IO.Path.GetExtension(fn);
}
Demo: https://dotnetfiddle.net/alv6sB

Make a new file who's name is a directory path

I'm creating a csv file with a bunch of data. This file is going to be pushed up to another location and its name is going to be used to put it in the directory it belongs in. I need to create the filename to mimic a directory, without actually using that directory.
I'm using the below, basically "outputDirectory" is where the file should live, everything after it needs to be part of the filename.
String fileName = outputDirectory + DateTime.Now.ToString("yyyy-mm-hh") + "//" + app + "//" + client +"//" + site +"//" + unit + ".csv";
using (StreamWriter sw = new StreamWriter(fileName, false))
{
foreach (AFValue AFval in AFvals)
{
string tagname = AFval.PIPoint.Name;
string timestamp = AFval.Timestamp.ToString();
string value = AFval.Value.ToString();
var newLine = string.Format("{0},{1},{2}", tagname, timestamp, value);
sw.Write(newLine);
sw.Write(Environment.NewLine);
}
}
So right now this code is throwing an exception with
'Could not find a part of the path 'C:\Users\user\Desktop\Output\2019-53-01\app\client\site\Unit.csv'.'
I need it to create a file in 'C:\Users\user\Desktop\Output\' called
2019-53-01\app\client\site\Unit.csv'.'
Any ideas?
You cannot use the slash **** in the file name.
Here is an excerpt from Naming Files, Paths, and Namespaces
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
(greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
(asterisk)
Integer value zero, sometimes referred to as the ASCII NUL character.
Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File Streams.
Any other character that the target file system does not allow.

C# Regex Replacement Not Working

I'm trying to remove new lines from a text file. Opening the text file in notepad doesn't reveal the line breaks I'm trying to remove (it looks like one big wall of text), however when I open the file in sublime, I can see them.
In sublime, I can remove the pattern '\n\n' and then the pattern '\n(?!AAD)' no problem. However, when I run the following code, the resulting text file is unchanged:
public void Format(string fileloc)
{
string str = File.ReadAllText(fileloc);
File.WriteAllText(fileloc + "formatted", Regex.Replace(Regex.Replace(str, "\n\n", ""), "\n(?!AAD)", ""));
}
What am I doing wrong?
If you do not want to spend hours trying to re-adjust the code for various types of linebreaks, here is a generic solution:
string str = File.ReadAllText(fileloc);
File.WriteAllText(fileloc + "formatted",
Regex.Replace(Regex.Replace(str, "(?:\r?\n|\r){2}", ""), "(?:\r?\n|\r)(?!AAD)", "")
);
Details:
A linebreak can be matched with (?:\r?\n|\r): an optional CR followed with a single obligatory LF. To match 2 consecutive linebreaks, a limiting quantifier can be appended - (?:\r?\n|\r){2}.
An empirical solution. Opening your sample file in binary mode revealed that it contains 0x0D characters, which are carriage returns \r. So I came up with this (multiple lines for easier debugging):
public void Format(string fileloc)
{
var str = File.ReadAllText(fileloc);
var firstround = Regex.Replace(str, #"\r\r", "");
var secondround = Regex.Replace(firstround, #"\r(?!AAD)", "");
File.WriteAllText(fileloc + "formatted", secondround);
}
Is this possibly a windows/linux mismatch? Try replacing '\r\n' instead.

Trim all chars off file name after first "_"

I'd like to trim these purchase order file names (a few examples below) so that everything after the first "_" is omitted.
INCOLOR_fc06_NEW.pdf
Keep: INCOLOR (write this to db as the VendorID) Remove: _fc08_NEW.pdf
NORTHSTAR_sc09.xls
Keep: NORTHSTAR (write this to db as the VendorID) Remove: _sc09.xls
Our scenario: The managers are uploading these files to our Intranet web server, to make them available to download/view ect. I'm using Brettles NeatUpload, and for each file uploaded, am writing the files attributes into the PO table (sql 2000). The first part of the file name will be written to the DB as a VendorID.
The naming convention for these files is consistent in that the the first part of the file is always the vendor name (or Vendor ID) followed by an "_" then other unpredictable chars used to identify the type of Purchase Order then the file extention - which is consistently either .xls, .XLS, .PDF, or .pdf.
I tried TrimEnd - but the array of chars that you have to provide ends up being long and can conflict with the part of the file name I want to keep. I have a feeling I'm not using TrimEnd properly.
What is the best way to use string.TrimEnd (or any other string manipulation in C#) that will strip off all chars after the first "_" ?
String s = "INCOLOR_fc06_NEW.pdf";
int index = s.IndexOf("_");
return index >= 0 ? s.Substring(0,index) : s;
I'll probably offend the anti-regex lobby, but here I go (ducking):
string stripped = Regex.Replace(filename, #"(?<=[^_]*)_.*",String.Empty);
This code will strip all extra characters after the first '_', unless there is no '_' in the string (then it will just return the original string).
It's one line of code. It's slower than the more elaborate IndexOf() algorithm, but when used in a non-performance-sensitive part of the code, it's a good solution.
Get your flame-throwers out...
TrimEnd removes white spaces and punctuation marks at the end of the String, it won't help you here. Read more about TrimEnd here:
http://msdn.microsoft.com/en-us/library/system.string.trimend.aspx
Bnaffas code (with a small tweak):
String fileName = "INCOLOR_fc06_NEW.pdf";
int index = fileName.IndexOf("_");
return index >= 0 ? fileName.Substring(0, index) : fileName;
If you want to do something with the other parts, you could use a Split
string fileName = "INCOLOR_fc06_NEW.pdf";
string[] parts = fileName.Split('_');
public string StripOffStuff(string sInput)
{
int iIndex = sInput.IndexOf("_");
return (iIndex > 0) ? sInput.Substring(0, iIndex) : sInput;
}
// Call it like:
string sNewString = StripOffStuff("INCOLOR_fc06_NEW.pdf");
I would go with the SubString approach but to round out the available solutions here's a LINQ approach just for fun:
string filename = "INCOLOR_fc06_NEW.pdf";
string result = new string(filename.TakeWhile(c => c != '_').ToArray());
It'll return the original string if no underscore is found.
To go with all the "alternative" solutions, here's the second one that I thought of (after substring):
string filename = "INCOLOR_fc06_NEW.pdf";
string stripped = filename.Split('_')[0];

Categories