glob pattern matching in .NET - c#

Is there a built-in mechanism in .NET to match patterns other than Regular Expressions? I'd like to match using UNIX style (glob) wildcards (* = any number of any character).
I'd like to use this for a end-user facing control. I fear that permitting all RegEx capabilities will be very confusing.

I like my code a little more semantic, so I wrote this extension method:
using System.Text.RegularExpressions;
namespace Whatever
{
public static class StringExtensions
{
/// <summary>
/// Compares the string against a given pattern.
/// </summary>
/// <param name="str">The string.</param>
/// <param name="pattern">The pattern to match, where "*" means any sequence of characters, and "?" means any single character.</param>
/// <returns><c>true</c> if the string matches the given pattern; otherwise <c>false</c>.</returns>
public static bool Like(this string str, string pattern)
{
return new Regex(
"^" + Regex.Escape(pattern).Replace(#"\*", ".*").Replace(#"\?", ".") + "$",
RegexOptions.IgnoreCase | RegexOptions.Singleline
).IsMatch(str);
}
}
}
(change the namespace and/or copy the extension method to your own string extensions class)
Using this extension, you can write statements like this:
if (File.Name.Like("*.jpg"))
{
....
}
Just sugar to make your code a little more legible :-)

Just for the sake of completeness. Since 2016 in dotnet core there is a new nuget package called Microsoft.Extensions.FileSystemGlobbing that supports advanced globing paths. (Nuget Package)
some examples might be, searching for wildcard nested folder structures and files which is very common in web development scenarios.
wwwroot/app/**/*.module.js
wwwroot/app/**/*.js
This works somewhat similar with what .gitignore files use to determine which files to exclude from source control.

I found the actual code for you:
Regex.Escape( wildcardExpression ).Replace( #"\*", ".*" ).Replace( #"\?", "." );

The 2- and 3-argument variants of the listing methods like GetFiles() and EnumerateDirectories() take a search string as their second argument that supports filename globbing, with both * and ?.
class GlobTestMain
{
static void Main(string[] args)
{
string[] exes = Directory.GetFiles(Environment.CurrentDirectory, "*.exe");
foreach (string file in exes)
{
Console.WriteLine(Path.GetFileName(file));
}
}
}
would yield
GlobTest.exe
GlobTest.vshost.exe
The docs state that there are some caveats with matching extensions. It also states that 8.3 file names are matched (which may be generated automatically behind the scenes), which can result in "duplicate" matches in given some patterns.
The methods that support this are GetFiles(), GetDirectories(), and GetFileSystemEntries(). The Enumerate variants also support this.

If you want to avoid regular expressions this is a basic glob implementation:
public static class Globber
{
public static bool Glob(this string value, string pattern)
{
int pos = 0;
while (pattern.Length != pos)
{
switch (pattern[pos])
{
case '?':
break;
case '*':
for (int i = value.Length; i >= pos; i--)
{
if (Glob(value.Substring(i), pattern.Substring(pos + 1)))
{
return true;
}
}
return false;
default:
if (value.Length == pos || char.ToUpper(pattern[pos]) != char.ToUpper(value[pos]))
{
return false;
}
break;
}
pos++;
}
return value.Length == pos;
}
}
Use it like this:
Assert.IsTrue("text.txt".Glob("*.txt"));

If you use VB.Net, you can use the Like statement, which has Glob like syntax.
http://www.getdotnetcode.com/gdncstore/free/Articles/Intoduction%20to%20the%20VB%20NET%20Like%20Operator.htm

I have written a globbing library for .NETStandard, with tests and benchmarks. My goal was to produce a library for .NET, with minimal dependencies, that doesn't use Regex, and outperforms Regex.
You can find it here:
github.com/dazinator/DotNet.Glob
https://www.nuget.org/packages/DotNet.Glob/

I wrote a FileSelector class that does selection of files based on filenames. It also selects files based on time, size, and attributes. If you just want filename globbing then you express the name in forms like "*.txt" and similar. If you want the other parameters then you specify a boolean logic statement like "name = *.xls and ctime < 2009-01-01" - implying an .xls file created before January 1st 2009. You can also select based on the negative: "name != *.xls" means all files that are not xls.
Check it out.
Open source. Liberal license.
Free to use elsewhere.

Based on previous posts, I threw together a C# class:
using System;
using System.Text.RegularExpressions;
public class FileWildcard
{
Regex mRegex;
public FileWildcard(string wildcard)
{
string pattern = string.Format("^{0}$", Regex.Escape(wildcard)
.Replace(#"\*", ".*").Replace(#"\?", "."));
mRegex = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
}
public bool IsMatch(string filenameToCompare)
{
return mRegex.IsMatch(filenameToCompare);
}
}
Using it would go something like this:
FileWildcard w = new FileWildcard("*.txt");
if (w.IsMatch("Doug.Txt"))
Console.WriteLine("We have a match");
The matching is NOT the same as the System.IO.Directory.GetFiles() method, so don't use them together.

From C# you can use .NET's LikeOperator.LikeString method. That's the backing implementation for VB's LIKE operator. It supports patterns using *, ?, #, [charlist], and [!charlist].
You can use the LikeString method from C# by adding a reference to the Microsoft.VisualBasic.dll assembly, which is included with every version of the .NET Framework. Then you invoke the LikeString method just like any other static .NET method:
using Microsoft.VisualBasic;
using Microsoft.VisualBasic.CompilerServices;
...
bool isMatch = LikeOperator.LikeString("I love .NET!", "I love *", CompareMethod.Text);
// isMatch should be true.

https://www.nuget.org/packages/Glob.cs
https://github.com/mganss/Glob.cs
A GNU Glob for .NET.
You can get rid of the package reference after installing and just compile the single Glob.cs source file.
And as it's an implementation of GNU Glob it's cross platform and cross language once you find another similar implementation enjoy!

I don't know if the .NET framework has glob matching, but couldn't you replace the * with .*? and use regexes?

Just out of curiosity I've glanced into Microsoft.Extensions.FileSystemGlobbing - and it was dragging quite huge dependencies on quite many libraries - I've decided why I cannot try to write something similar?
Well - easy to say than done, I've quickly noticed that it was not so trivial function after all - for example "*.txt" should match for files only in current directly, while "**.txt" should also harvest sub folders.
Microsoft also tests some odd matching pattern sequences like "./*.txt" - I'm not sure who actually needs "./" kind of string - since they are removed anyway while processing.
(https://github.com/aspnet/FileSystem/blob/dev/test/Microsoft.Extensions.FileSystemGlobbing.Tests/PatternMatchingTests.cs)
Anyway, I've coded my own function - and there will be two copies of it - one in svn (I might bugfix it later on) - and I'll copy one sample here as well for demo purposes. I recommend to copy paste from svn link.
SVN Link:
https://sourceforge.net/p/syncproj/code/HEAD/tree/SolutionProjectBuilder.cs#l800
(Search for matchFiles function if not jumped correctly).
And here is also local function copy:
/// <summary>
/// Matches files from folder _dir using glob file pattern.
/// In glob file pattern matching * reflects to any file or folder name, ** refers to any path (including sub-folders).
/// ? refers to any character.
///
/// There exists also 3-rd party library for performing similar matching - 'Microsoft.Extensions.FileSystemGlobbing'
/// but it was dragging a lot of dependencies, I've decided to survive without it.
/// </summary>
/// <returns>List of files matches your selection</returns>
static public String[] matchFiles( String _dir, String filePattern )
{
if (filePattern.IndexOfAny(new char[] { '*', '?' }) == -1) // Speed up matching, if no asterisk / widlcard, then it can be simply file path.
{
String path = Path.Combine(_dir, filePattern);
if (File.Exists(path))
return new String[] { filePattern };
return new String[] { };
}
String dir = Path.GetFullPath(_dir); // Make it absolute, just so we can extract relative path'es later on.
String[] pattParts = filePattern.Replace("/", "\\").Split('\\');
List<String> scanDirs = new List<string>();
scanDirs.Add(dir);
//
// By default glob pattern matching specifies "*" to any file / folder name,
// which corresponds to any character except folder separator - in regex that's "[^\\]*"
// glob matching also allow double astrisk "**" which also recurses into subfolders.
// We split here each part of match pattern and match it separately.
//
for (int iPatt = 0; iPatt < pattParts.Length; iPatt++)
{
bool bIsLast = iPatt == (pattParts.Length - 1);
bool bRecurse = false;
String regex1 = Regex.Escape(pattParts[iPatt]); // Escape special regex control characters ("*" => "\*", "." => "\.")
String pattern = Regex.Replace(regex1, #"\\\*(\\\*)?", delegate (Match m)
{
if (m.ToString().Length == 4) // "**" => "\*\*" (escaped) - we need to recurse into sub-folders.
{
bRecurse = true;
return ".*";
}
else
return #"[^\\]*";
}).Replace(#"\?", ".");
if (pattParts[iPatt] == "..") // Special kind of control, just to scan upper folder.
{
for (int i = 0; i < scanDirs.Count; i++)
scanDirs[i] = scanDirs[i] + "\\..";
continue;
}
Regex re = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
int nScanItems = scanDirs.Count;
for (int i = 0; i < nScanItems; i++)
{
String[] items;
if (!bIsLast)
items = Directory.GetDirectories(scanDirs[i], "*", (bRecurse) ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
else
items = Directory.GetFiles(scanDirs[i], "*", (bRecurse) ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
foreach (String path in items)
{
String matchSubPath = path.Substring(scanDirs[i].Length + 1);
if (re.Match(matchSubPath).Success)
scanDirs.Add(path);
}
}
scanDirs.RemoveRange(0, nScanItems); // Remove items what we have just scanned.
} //for
// Make relative and return.
return scanDirs.Select( x => x.Substring(dir.Length + 1) ).ToArray();
} //matchFiles
If you find any bugs, I'll be grad to fix them.

I wrote a solution that does it. It does not depend on any library and it does not support "!" or "[]" operators. It supports the following search patterns:
C:\Logs\*.txt
C:\Logs\**\*P1?\**\asd*.pdf
/// <summary>
/// Finds files for the given glob path. It supports ** * and ? operators. It does not support !, [] or ![] operators
/// </summary>
/// <param name="path">the path</param>
/// <returns>The files that match de glob</returns>
private ICollection<FileInfo> FindFiles(string path)
{
List<FileInfo> result = new List<FileInfo>();
//The name of the file can be any but the following chars '<','>',':','/','\','|','?','*','"'
const string folderNameCharRegExp = #"[^\<\>:/\\\|\?\*" + "\"]";
const string folderNameRegExp = folderNameCharRegExp + "+";
//We obtain the file pattern
string filePattern = Path.GetFileName(path);
List<string> pathTokens = new List<string>(Path.GetDirectoryName(path).Split('\\', '/'));
//We obtain the root path from where the rest of files will obtained
string rootPath = null;
bool containsWildcardsInDirectories = false;
for (int i = 0; i < pathTokens.Count; i++)
{
if (!pathTokens[i].Contains("*")
&& !pathTokens[i].Contains("?"))
{
if (rootPath != null)
rootPath += "\\" + pathTokens[i];
else
rootPath = pathTokens[i];
pathTokens.RemoveAt(0);
i--;
}
else
{
containsWildcardsInDirectories = true;
break;
}
}
if (Directory.Exists(rootPath))
{
//We build the regular expression that the folders should match
string regularExpression = rootPath.Replace("\\", "\\\\").Replace(":", "\\:").Replace(" ", "\\s");
foreach (string pathToken in pathTokens)
{
if (pathToken == "**")
{
regularExpression += string.Format(CultureInfo.InvariantCulture, #"(\\{0})*", folderNameRegExp);
}
else
{
regularExpression += #"\\" + pathToken.Replace("*", folderNameCharRegExp + "*").Replace(" ", "\\s").Replace("?", folderNameCharRegExp);
}
}
Regex globRegEx = new Regex(regularExpression, RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
string[] directories = Directory.GetDirectories(rootPath, "*", containsWildcardsInDirectories ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly);
foreach (string directory in directories)
{
if (globRegEx.Matches(directory).Count > 0)
{
DirectoryInfo directoryInfo = new DirectoryInfo(directory);
result.AddRange(directoryInfo.GetFiles(filePattern));
}
}
}
return result;
}

Unfortunately the accepted answer will not handle escaped input correctly, because string .Replace("\*", ".*") fails to distinguish between "*" and "\*" - it will happily replace "*" in both of these strings, leading to incorrect results.
Instead, a basic tokenizer can be used to convert the glob path into a regex pattern, which can then be matched against a filename using Regex.Match. This is a more robust and flexible solution.
Here is a method to do this. It handles ?, *, and **, and surrounds each of these globs with a capture group, so the values of each glob can be inspected after the Regex has been matched.
static string GlobbedPathToRegex(ReadOnlySpan<char> pattern, ReadOnlySpan<char> dirSeparatorChars)
{
StringBuilder builder = new StringBuilder();
builder.Append('^');
ReadOnlySpan<char> remainder = pattern;
while (remainder.Length > 0)
{
int specialCharIndex = remainder.IndexOfAny('*', '?');
if (specialCharIndex >= 0)
{
ReadOnlySpan<char> segment = remainder.Slice(0, specialCharIndex);
if (segment.Length > 0)
{
string escapedSegment = Regex.Escape(segment.ToString());
builder.Append(escapedSegment);
}
char currentCharacter = remainder[specialCharIndex];
char nextCharacter = specialCharIndex < remainder.Length - 1 ? remainder[specialCharIndex + 1] : '\0';
switch (currentCharacter)
{
case '*':
if (nextCharacter == '*')
{
// We have a ** glob expression
// Match any character, 0 or more times.
builder.Append("(.*)");
// Skip over **
remainder = remainder.Slice(specialCharIndex + 2);
}
else
{
// We have a * glob expression
// Match any character that isn't a dirSeparatorChar, 0 or more times.
if(dirSeparatorChars.Length > 0) {
builder.Append($"([^{Regex.Escape(dirSeparatorChars.ToString())}]*)");
}
else {
builder.Append("(.*)");
}
// Skip over *
remainder = remainder.Slice(specialCharIndex + 1);
}
break;
case '?':
builder.Append("(.)"); // Regex equivalent of ?
// Skip over ?
remainder = remainder.Slice(specialCharIndex + 1);
break;
}
}
else
{
// No more special characters, append the rest of the string
string escapedSegment = Regex.Escape(remainder.ToString());
builder.Append(escapedSegment);
remainder = ReadOnlySpan<char>.Empty;
}
}
builder.Append('$');
return builder.ToString();
}
The to use it:
string testGlobPathInput = "/Hello/Test/Blah/**/test*123.fil?";
string globPathRegex = GlobbedPathToRegex(testGlobPathInput, "/"); // Could use "\\/" directory separator chars on Windows
Console.WriteLine($"Globbed path: {testGlobPathInput}");
Console.WriteLine($"Regex conversion: {globPathRegex}");
string testPath = "/Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file";
Console.WriteLine($"Test Path: {testPath}");
var regexGlobPathMatch = Regex.Match(testPath, globPathRegex);
Console.WriteLine($"Match: {regexGlobPathMatch.Success}");
for(int i = 0; i < regexGlobPathMatch.Groups.Count; i++) {
Console.WriteLine($"Group [{i}]: {regexGlobPathMatch.Groups[i]}");
}
Output:
Globbed path: /Hello/Test/Blah/**/test*123.fil?
Regex conversion: ^/Hello/Test/Blah/(.*)/test([^/]*)123\.fil(.)$
Test Path: /Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file
Match: True
Group [0]: /Hello/Test/Blah/All/Hail/The/Hypnotoad/test_somestuff_123.file
Group [1]: All/Hail/The/Hypnotoad
Group [2]: _somestuff_
Group [3]: e
I have created a gist here as a canonical version of this method:
https://gist.github.com/crozone/9a10156a37c978e098e43d800c6141ad

Related

How to split a string in razor with no delimeters only size and position [duplicate]

The .NET Framework gives us the Format method:
string s = string.Format("This {0} very {1}.", "is", "funny");
// s is now: "This is very funny."
I would like an "Unformat" function, something like:
object[] params = string.Unformat("This {0} very {1}.", "This is very funny.");
// params is now: ["is", "funny"]
I know something similar exists in the ANSI-C library (printf vs scanf).
The question: is there something similiar in C#?
Update: Capturing groups with regular expressions are not the solution I need. They are also one way. I'm looking for a system that can work both ways in a single format. It's OK to give up some functionality (like types and formatting info).
There's no such method, probably because of problems resolving ambiguities:
string.Unformat("This {0} very {1}.", "This is very very funny.")
// are the parameters equal to "is" and "very funny", or "is very" and "funny"?
Regular expression capturing groups are made for this problem; you may want to look into them.
Regex with grouping?
/This (.*?) very (.*?)./
If anyone's interested, I've just posted a scanf() replacement for .NET. If regular expressions don't quite cut it for you, my code follows the scanf() format string quite closely.
You can see and download the code I wrote at http://www.blackbeltcoder.com/Articles/strings/a-sscanf-replacement-for-net.
You could do string[] parts = string.Split(' '), and then extract by the index position parts[1] and parts [3] in your example.
Yep. These are called "regular expressions". The one that will do the thing is
This (?<M0>.+) very (?<M1>.+)\.
#mquander: Actualy, PHP solves it even different:
$s = "This is very very funny.";
$fmt = "This %s very %s.";
sscanf($s, $fmt, $one, $two);
echo "<div>one: [$one], two: [$two]</div>\n";
//echo's: "one: [is], two: [very]"
But maybe your regular expression remark can help me. I just need to rewrite "This {0} very {1}." to something like: new Regex(#"^This (.*) very (.*)\.$"). This should be done programmatical, so I can use one format string on the public class interface.
BTW: I've already have a parser to find the parameters: see the Named Format Redux blog entry by Phil Haack (and yes, I also want named paramters to work both ways).
I came across the same problem, i belive that there is a elegante solution using REGEX... but a came up with function in C# to "UnFormat" that works quite well. Sorry about the lack of comments.
/// <summary>
/// Unformats a string using the original formating string.
///
/// Tested Situations:
/// UnFormat("<nobr alt=\"1\">1<nobr>", "<nobr alt=\"{0}\">{0}<nobr>") : "1"
/// UnFormat("<b>2</b>", "<b>{0}</b>") : "2"
/// UnFormat("3<br/>", "{0}<br/>") : "3"
/// UnFormat("<br/>4", "<br/>{0}") : "4"
/// UnFormat("5", "") : "5"
/// UnFormat("<nobr>6<nobr>", "<nobr>{0}<nobr>") : "6"
/// UnFormat("<nobr>2009-10-02<nobr>", "<nobr>{0:yyyy-MM-dd}<nobr>") : "2009-10-02"
/// UnFormat("<nobr><nobr>", "<nobr>{0}<nobr>") : ""
/// UnFormat("bla", "<nobr>{0}<nobr>") : "bla"
/// </summary>
/// <param name="original"></param>
/// <param name="formatString"></param>
/// <returns>If an "unformat" is not possible the original string is returned.</returns>
private Dictionary<int,string> UnFormat(string original, string formatString)
{
Dictionary<int, string> returnList = new Dictionary<int, string>();
try{
int index = -1;
// Decomposes Format String
List<string> formatDecomposed = new List<string> (formatString.Split('{'));
for(int i = formatDecomposed.Count - 1; i >= 0; i--)
{
index = formatDecomposed[i].IndexOf('}') + 1;
if (index > 0 && (formatDecomposed[i].Length - index) > 0)
{
formatDecomposed.Insert(i + 1, formatDecomposed[i].Substring(index, formatDecomposed[i].Length - index));
formatDecomposed[i] = formatDecomposed[i].Substring(0, index);
}
else
//Finished
break;
}
// Finds and indexes format parameters
index = 0;
for (int i = 0; i < formatDecomposed.Count; i++)
{
if (formatDecomposed[i].IndexOf('}') < 0)
{
index += formatDecomposed[i].Length;
}
else
{
// Parameter Index
int parameterIndex;
if (formatDecomposed[i].IndexOf(':')< 0)
parameterIndex = Convert.ToInt16(formatDecomposed[i].Substring(0, formatDecomposed[i].IndexOf('}')));
else
parameterIndex = Convert.ToInt16(formatDecomposed[i].Substring(0, formatDecomposed[i].IndexOf(':')));
// Parameter Value
if (returnList.ContainsKey(parameterIndex) == false)
{
string parameterValue;
if (formatDecomposed.Count > i + 1)
if (original.Length > index)
parameterValue = original.Substring(index, original.IndexOf(formatDecomposed[i + 1], index) - index);
else
// Original String not valid
break;
else
parameterValue = original.Substring(index, original.Length - index);
returnList.Add(parameterIndex, parameterValue);
index += parameterValue.Length;
}
else
index += returnList[parameterIndex].Length;
}
}
// Fail Safe #1
if (returnList.Count == 0) returnList.Add(0, original);
}
catch
{
// Fail Safe #2
returnList = new Dictionary<int, string>();
returnList.Add(0, original);
}
return returnList;
}
I reference earlier reply, wrote a sample see following
string sampleinput = "FirstWord.22222";
Match match = Regex.Match(sampleinput, #"(\w+)\.(\d+)$", RegexOptions.IgnoreCase);
if(match.Success){
string totalmatchstring = match.Groups[0]; // FirstWord.22222
string firstpart = match.Groups[1]; // FirstWord`
string secondpart = match.Groups[2]; // 22222
}

How to replace "CREATE VIEW", "CREATE PROCEDURE", "CREATE FUNCTION", "CREATE TRIGGER" in SQL files?

I am writing a method which processes a large number of SQL procedures written by our previous SQL developer.
I am trying to search the files for the following strings CREATE VIEW, CREATE PROCEDURE, CREATE FUNCTION, CREATE TRIGGER.
The search for these strings in the file needs to be case-insensitive
and should match for any number of spaces between each element, e.g.
CREATE VIEW or CREATE VIEW.
When it finds a match it needs to replace the CREATE with CREATE OR ALTER.
The script shall ignore occurrences such as CREATE TABLE.
The script shall ignore occurrences such as CREATE OR ALTER PROCEDURE.
I started by writing a procedure to process the files line by line (this is because the text to search is always contained within the line), but I got stuck...
/// <summary>
/// This method process each individual line executing the replacement where necessary
/// </summary>
/// <param name="line"></param>
/// <returns></returns>
private static string ProcessLine(string line)
{
// how do I perform the logic here?
return line;
}
/// <summary>
/// This method will process each individual file and create a new file with the _new suffix
/// </summary>
/// <param name="file"></param>
public static void ProcessSqlFile(FileInfo file)
{
StringBuilder sb = new StringBuilder();
var lines = File.ReadAllLines(file.FullName);
for (var i = 0; i < lines.Length; i += 1)
{
sb.Append(ProcessLine(lines[i]));
sb.Append(Environment.NewLine);
}
var outputName = Path.Combine(file.DirectoryName, file.Name +"_new");
File.WriteAllText(outputName, sb.ToString());
}
static void Main(string[] args)
{
var inputPath = new DirectoryInfo(#"...");
var files = inputPath.GetFiles("*.sql");
foreach (var fileInfo in files)
{
ProcessSqlFile(fileInfo);
}
}
You may use Regular Expressions (AKA, Regex) for this. For example, you may use the following pattern:
\bcreate\s+(view|procedure|function|trigger)\b
..and replace with:
CREATE OR ALTER $1
Regex demo.
Regex pattern details:
\b - Ensure a word boundary (avoid matching partial words).
\s+ - Match one or more whitespace characters.
(view|procedure|function|trigger) - Match any of the listed words and capture it in group 1.
\b Ensure a word boundary.
Replacement:
CREATE OR ALTER - Literal string.
$1 - Whatever was captured in group 1.
Full C# example:
string input = "I am trying to search the files for the following strings " +
"CREATE VIEW, CREATE PROCEDURE, CREATE FUNCTION, CREATE TRIGGER";
string output = Regex.Replace(input, #"\bcreate\s+(view|procedure|function|trigger)\b",
#"CREATE OR ALTER $1", RegexOptions.IgnoreCase);
Console.WriteLine(output);
Try it online.
Disclaimer:
As GSerg and Charlieface indicated in the comments, this (and similar solutions) would match false positives in string literals. If you might have those, you'd be better off using an SQL parser as a regex pattern would be overly complicated, in this case, if we wish to cover all edge cases.
the solution that seems to me to be the simplest is not making use of regex, but plain text processing.
I tried to run the following method on a few sql files and it looked good to me.
private static string ProcessLine(string line)
{
if (!line.ToUpper().Contains("CREATE"))
{
return line;
}
var wordArray = line.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
for (var i = 0; i < wordArray.Length - 1; i++)
{
if (wordArray[i].ToUpper() != "CREATE" ||
(wordArray[i + 1].ToUpper() != "VIEW" && wordArray[i + 1].ToUpper() != "PROCEDURE" && wordArray[i + 1].ToUpper() != "FUNCTION" && wordArray[i + 1].ToUpper() != "TRIGGER")) continue;
return line.Replace("CREATE", "CREATE OR ALTER");
}
return line;
}

how to integrate a code properly

My code should be translating a phrase into pig latin. Every word must have an "ay" at the end and every first letter of each word should be placed before "ay"
ex wall = "allway"
any ideas? this is the easiest way i could think of..
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace english_to_pig_latin
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("THIS IS A English to Pig Latin translator");
Console.WriteLine("ENTER Phrase");
string[] phrase = Console.ReadLine().Split(' ');
int words = phrase.Length;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < words; i++)
{
//to add ay in the end
/*sb.Append(phrase[i].ToString());
sb.Append("ay ");
Console.WriteLine(sb);*/
}
Console.ReadLine();
}
}
}
First you need to define your pig-latin rules. your description lacks real pig-latin rules. for instance, English "sharp" is correctly "Pig-Latinized" as 'arpshay', not 'harpsay', as your explanation above explained. (But i prefer to use 'arp-sh-ay' to facilitate reading of PigLatin as well as using hyphens make it possible to reverse translate back into English.) i suggest you first find some rules for Pig-Latin. Your start is a good start. Your code now separates a phrase into (almost) words. Note that your code will turn "Please, Joe" into "Please," and "Joe" tho, and you probably do not want that comma sent to your word-by-word translator.
when defining your rules, i suggest you consider how to Pig-Latin-ize these words:
hello --> 'ellohay' (a normal word),
string --> 'ingstray' ('str' is the whole consonant string moved to the end),
apple --> 'appleway', 'appleay', or 'appleyay', (depending on your dialect of Pig-Latin),
queen --> 'eenquay' ('qu' is the consonant string here),
yellow --> 'ellowyay' (y is consonant here),
rhythm --> 'ythmrhay' (y is vowel here),
sky --> 'yskay' (y is vowel here).
Note that for any word that starts with 'qu' (like 'queen'), this 'qu' is a special condition that needs handled too. Note that y is probably a consonant when it begins an English word, but a vowel when in the middle or at the end of a word.
The hyphenated Pig Latin versions of these words would be:
ello-h-ay, ing-str-ay, ('apple-way', 'apple-ay', or 'apple-yay'), 'een-qu-ay', 'ellow-y-ay', 'ythm-rh-ay', and 'y-sk-ay'. The hyphenation allows both easier reading as well as an ability to reverse the Pig Latin back into English by a computer parser. But unfortunately, many people just cram the Pig Latin word together without showing any hyphenation separation, so reversing the translation cannot be done simply without ambiguity.
Real pig-latin really goes by the sound of the word, not the spelling, so without a very complex word to phoneme system, this is way too difficult. but most (good) pig-latin writing translators handle the above cases and ignore other exceptions because English is really a very bad language when it comes to phonetically sounding out words.
So my first suggestion is get a set of rules. my 2nd suggestion is use two functions, PigLatinizePhrase() and PigLatinizeWord() where your PigLatinizePhrase() method parses a phrase into words (and punctuation), and calls PigLatinizeWord() for each word, excluding any punctuation. you can use a simple loop thru each character and test for char.IsLetter to determine if it's a letter or not. if it's a letter then add it to a string builder and move to the next letter. if it's not a letter and the string builder is not empty then send that word to your word parser to parse it, and then add the non-letter to your result. this would be your logic for your PigLatinizePhrase() method. Here is my code which does just that:
/// <summary>
/// </summary>
/// <param name="eng">English text, paragraphs, etc.</param>
/// <param name="suffixWithNoOnset">Used to differentiate between Pig Latin dialects.
/// Known dialects may use any of: "ay", "-ay", "way", "-way", "yay", or "-yay".
/// Cooresponding translations for 'egg' will yield: "eggay", "egg-ay", "eggway", "egg-way", "eggyay", "egg-yay".
/// Or for 'I': "Iay", "I-ay", "Iway", "I-way", "Iyay", "I-yay".
/// </param>
/// <returns></returns>
public static string PigLatinizePhrase(string eng, string suffixWithNoOnset = "-ay")
{
if (eng == null) { return null; } // don't break if null
var word = new StringBuilder(); // only current word, built char by char
var pig = new StringBuilder(); // pig latin text
char prevChar = '\0';
foreach (char thisChar in eng)
{
// the "'" test is so "I'll", "can't", and "Ashley's" will work right.
if (char.IsLetter(thisChar) || thisChar == '\'')
{
word.Append(thisChar);
}
else
{
if (word.Length > 0)
{
pig.Append(PigLatinizeWord(word.ToString(), suffixWithNoOnset));
word = new StringBuilder();
}
pig.Append(thisChar);
}
prevChar = thisChar;
}
if (word.Length > 0)
{
pig.Append(PigLatinizeWord(word.ToString(), suffixWithNoOnset));
}
return pig.ToString();
} // public static string PigLatinizePhrase(string eng, string suffixWithNoOnset = "-ay")
The suffixWithNoOnset variable is simply passed directly to the PigLatinizeWord() method and it determines exactly which 'dialect' of Pig Latin will be used. (See the XML comment before the method in the source code for more clarity.)
For the PigLatinizeWord() method, upon actually programming it, i found that it was very convenient to split this functionality into two methods, one method to parse the English word into the 2 parts that Pig Latin cares about, and another to actually do what is desired with those 2 parts, depending on which version of Pig Latin is desired. Here's the source code for these two functions:
/// <summary>
/// </summary>
/// <param name="eng">English word before being translated to Pig Latin.</param>
/// <param name="suffixWithNoOnset">Used to differentiate between Pig Latin dialects.
/// Known dialects may use any of: "ay", "-ay", "way", "-way", "yay", or "-yay".
/// Cooresponding translations for 'egg' will yield: "eggay", "egg-ay", "eggway", "egg-way", "eggyay", "egg-yay".
/// Or for 'I': "Iay", "I-ay", "Iway", "I-way", "Iyay", "I-yay".
/// </param>
/// <returns></returns>
public static string PigLatinizeWord(string eng, string suffixWithNoOnset = "-ay")
{
if (eng == null || eng.Length == 0) { return eng; } // don't break if null or empty
string[] onsetAndEnd = GetOnsetAndEndOfWord(eng);
// string h = string.Empty;
string o = onsetAndEnd[0]; // 'Onset' of first syllable that gets moved to end of word
string e = onsetAndEnd[1]; // 'End' of word, without the onset
bool hyphenate = suffixWithNoOnset.Contains('-');
// if (hyphenate) { h = "-"; }
var sb = new StringBuilder();
if (e.Length > 0) { sb.Append(e); if (hyphenate && o.Length > 0) { sb.Append('-'); } }
if (o.Length > 0) { sb.Append(o); if (hyphenate) { sb.Append('-'); } sb.Append("ay"); }
else { sb.Append(suffixWithNoOnset); }
return sb.ToString();
} // public static string PigLatinizeWord(string eng)
public static string[] GetOnsetAndEndOfWord(string word)
{
if (word == null) { return null; }
// string[] r = ",".Split(',');
string uppr = word.ToUpperInvariant();
if (uppr.StartsWith("QU")) { return new string[] { word.Substring(0,2), word.Substring(2) }; }
int x = 0; if (word.Length <= x) { return new string[] { string.Empty, string.Empty }; }
if ("AOEUI".Contains(uppr[x])) // tests first letter/character
{ return new string[] { word.Substring(0, x), word.Substring(x) }; }
while (++x < word.Length)
{
if ("AOEUIY".Contains(uppr[x])) // tests each character after first letter/character
{ return new string[] { word.Substring(0, x), word.Substring(x) }; }
}
return new string[] { string.Empty, word };
} // public static string[] GetOnsetAndEndOfWord(string word)
I have written a PigLatinize() method in JavaScript before, which was a lot of fun for me. :) I enjoyed making my C# version with more features, giving it the ability to translate to 6 varyious 'dialects' of Pig Latin, especially since C# is my favorite (programming) language. ;)
I think you need this transformation: phrase[i].Substring(1) + phrase[i][0] + "ay"

Unusual Regex behavior in c#

I have a Regex that is behaving rather oddly and I can't figure why. Original Regex:
Regex regex = new Regex(#"(?i)\d\.\d\dv");
This expression returns/matches an equivalent to 1.35V or 1.35v, which is what I want. However, it is not exclusive enough for my program and it returns some strings I don't need.
Modified Regex:
Regex rgx = new Regex(#"(?i)\d\.\d\dv\s");
Simply by adding '\s' to the expression, it matches/returns DDR3, which is not at all what I want. I'm guessing some sort of inversion is occurring, but I don't understand why and I can't seem to find a reference to explain it. All I wanted to do was add a space to the end of expression to filter a few more results.
Any help would be greatly appreciated.
EDIT:
Here is a functional test case with a generic version of what is going on in my code. Just open a new WPF in Visual Studio, copy and paste, and it should repeat the results for you.
namespace WpfApplication1
{
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
Regex rgx1 = new Regex(#"(?i)\d\.\d\dv");
Regex rgx2 = new Regex(#"(?i)\d\.\d\dv\s");
string testCase = #"DDR3 Vdd | | | | | 1.35v |";
string str = null;
public void IsMatch(string input)
{
Match rgx1Match = rgx1.Match(input);
if (rgx1Match.Success)
{
GetInfo(input);
}
}
public void GetInfo(string input)
{
Match rgx1Match = rgx1.Match(input);
Match rgx2Match = rgx2.Match(input);
string[] tempArray = input.Split();
int index = 0;
if (rgx1Match.Success)
{
index = GetMatchIndex(rgx1, tempArray);
str = tempArray[index].Trim();
global::System.Windows.Forms.MessageBox.Show("First expression match: " + str);
}
if (rgx2Match.Success)
{
index = GetMatchIndex(rgx2, tempArray);
str = tempArray[index].Trim();
System.Windows.Forms.MessageBox.Show(input);
global::System.Windows.Forms.MessageBox.Show("Second expression match: " + str);
}
}
public int GetMatchIndex(Regex expression, string[] input)
{
int index = 0;
for (int i = 0; i < input.Length; i++)
{
if (index < 1)
{
Match rgxMatch = expression.Match(input[i]);
if (rgxMatch.Success)
{
index = i;
}
}
}
return index;
}
private void button1_Click(object sender, RoutedEventArgs e)
{
string line;
IsMatch(testCase);
}
}
}
The GetMatchesIndex method is called a number of times in other parts of the code without incident, it is just on this one Regex that I've hit a stumbling block.
The behavior you are seeing has entirely to do with your application logic, and very little to do with the regular expression. In GetMatchIndex, you are defaulting index = 0. So what happens if none of the entries in string[] input match? You get back index = 0, which is the index of DDR3, the first element in string[] input.
You don't see that behavior in the first regular expression, because it matches 1.35v. However, when you add the space to the end, it doesn't match any of the entries in the split input, so you get back the first one by default which happens to be DDR3. Also, if (rgx1Match.Success) doesn't really help, because you check for a match in the entire string first (which does match because there's a space there), and then search for the index after splitting, which removed the spaces!
The fix is pretty simple: When you are returning an index from an array in a programming language that uses 0-based numbering, the standard way to represent "not found" is with -1 so it doesn't get confused with the valid result of 0. So default index to -1 instead and handle a result of -1 as a special case, i.e., display an error message to the user like "No matches".
Your question is incorrect:
new Regex(#"(?i)\d\.\d\dv\s").Match("DDR3").Success
is false
In fact, the results seem to work exactly as you'd like.

Filtering file names: getting *.abc without *.abcd, or *.abcde, and so on

Directory.GetFiles(LocalFilePath, searchPattern);
MSDN Notes:
When using the asterisk wildcard character in a searchPattern, such as ".txt", the matching behavior when the extension is exactly three characters long is different than when the extension is more or less than three characters long. A searchPattern with a file extension of exactly three characters returns files having an extension of three or more characters, where the first three characters match the file extension specified in the searchPattern. A searchPattern with a file extension of one, two, or more than three characters returns only files having extensions of exactly that length that match the file extension specified in the searchPattern. When using the question mark wildcard character, this method returns only files that match the specified file extension. For example, given two files, "file1.txt" and "file1.txtother", in a directory, a search pattern of "file?.txt" returns just the first file, while a search pattern of "file.txt" returns both files.
The following list shows the behavior of different lengths for the searchPattern parameter:
*.abc returns files having an extension of .abc, .abcd, .abcde, .abcdef, and so on.
*.abcd returns only files having an extension of .abcd.
*.abcde returns only files having an extension of .abcde.
*.abcdef returns only files having an extension of .abcdef.
With the searchPattern parameter set to *.abc, how can I return files having an extension of .abc, not .abcd, .abcde and so on?
Maybe this function will work:
private bool StriktMatch(string fileExtension, string searchPattern)
{
bool isStriktMatch = false;
string extension = searchPattern.Substring(searchPattern.LastIndexOf('.'));
if (String.IsNullOrEmpty(extension))
{
isStriktMatch = true;
}
else if (extension.IndexOfAny(new char[] { '*', '?' }) != -1)
{
isStriktMatch = true;
}
else if (String.Compare(fileExtension, extension, true) == 0)
{
isStriktMatch = true;
}
else
{
isStriktMatch = false;
}
return isStriktMatch;
}
Test Program:
class Program
{
static void Main(string[] args)
{
string[] fileNames = Directory.GetFiles("C:\\document", "*.abc");
ArrayList al = new ArrayList();
for (int i = 0; i < fileNames.Length; i++)
{
FileInfo file = new FileInfo(fileNames[i]);
if (StriktMatch(file.Extension, "*.abc"))
{
al.Add(fileNames[i]);
}
}
fileNames = (String[])al.ToArray(typeof(String));
foreach (string s in fileNames)
{
Console.WriteLine(s);
}
Console.Read();
}
Anybody else better solution?
The answer is that you must do post filtering. GetFiles alone cannot do it. Here's an example that will post process your results. With this you can use a search pattern with GetFiles or not - it will work either way.
List<string> fileNames = new List<string>();
// populate all filenames here with a Directory.GetFiles or whatever
string srcDir = "from"; // set this
string destDir = "to"; // set this too
// this filters the names in the list to just those that end with ".doc"
foreach (var f in fileNames.All(f => f.ToLower().EndsWith(".doc")))
{
try
{
File.Copy(Path.Combine(srcDir, f), Path.Combine(destDir, f));
}
catch { ... }
}
Not a bug, perverse but well-documented behavior. *.doc matches *.docx based on 8.3 fallback lookup.
You will have to manually post-filter the results for ending in doc.
use linq....
string strSomePath = "c:\\SomeFolder";
string strSomePattern = "*.abc";
string[] filez = Directory.GetFiles(strSomePath, strSomePattern);
var filtrd = from f in filez
where f.EndsWith( strSomePattern )
select f;
foreach (string strSomeFileName in filtrd)
{
Console.WriteLine( strSomeFileName );
}
This won't help in the short term, but voting on the MS Connect post for this issue may get things changed in the future.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=95415
Since for "*.abc" GetFiles will return extensions of 3 or more, anything with a length of 3 after the "." is an exact match, and anything longer is not.
string[] fileList = Directory.GetFiles(path, "*.abc");
foreach (string file in fileList)
{
FileInfo fInfo = new FileInfo(file);
if (fInfo.Extension.Length == 4) // "." is counted in the length
{
// exact extension match - process the file...
}
}
Not sure of the performance of the above - while it uses simple length comparisons rather than string manipulations, new FileInfo() is called each time around the loop.

Categories