Parsing a file or directory from an semi-random text - c#

I've got a method that is going to perform some SVN Commands (using SharpSVN) on a collection of files and or directories, based on what the user has selected within a textbox on the form.
Quickly storing some highlighted text in a variable and looking at it, sample data might be like this:
Modified -- C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r\nModified --
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\\\myFile.cs
Trying to write a Regex to parse out anything inbetween a Space and the \r character, but I can't figure it out.
I thought the pattern would be something like this:
#"\s\S*\\r"
But using my sample data here it yields this as a result:
C:\\\folder\\\trunk\\\SubFolderOne\\\SubFolderTwo\\\SubThree\r
Then I'm just going to throw each result (ie proper path/file) into a collection of strings which will be used elsewhere in the application.
Is there a better way to do this using the Path class, hopefully?
One thing I can think of would be to split up the data using substring any time it finds \r\n, then simply drop the "prefix" (Modified --, NotVersioned --, Normal --) from the strings.
That seems really... poor though.
If it helps, I do know the that the top-most directory will always be C:\\folder\\trunk

I would recommend that you split the string on "\r\n" and then match each string. For example:
Regex re = new Regex(#"\s(\S*?)$");
foreach (var line in s.Split(new[]{"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
{
Match m = re.Match(s);
Console.WriteLine("{0},{1},'{2}'", m.Index, m.Length, m.Groups[1].Value);
}
That works when tested against your sample text.

You can use regex lookahead and lookbehind
String pattern = #"(?<=--\s)\S*(?=\\r|$)";
var result = Regex.Matches(input, pattern);
foreach (Match match in result)
{
Console.WriteLine(match.Value);
}
Parsing invalid values is not for Path class. You should either use regex or split and substring. Both ways are good, you should prefer the one you can easy read, explain and change.
var paths =
Regex.Split(input, #"\\r\\n")
.Select(row => row.Substring(row.LastIndexOf(' ') + 1, row.Length - row.LastIndexOf(' ') - 1));

Related

any character between keywords with a space

i'm creating a blacklist of keywords which I want to check for in text files, however, i'm having trouble finding any regex documentation which will help me with the following issue.
I have a set of blacklisted keywords:
welcome, goodbye, join us
I want to check some text files for any matches. I'm using the following regex to match exact words and also the pluralized version.
string.Format(#"\b{0}s*\b", keyword)
However, I've run into an issue matching keywords with two words and a any character in between. The above regex matches 'join us', but I need to match 'join#us' or 'join_us' for example as well.
Any help would be greatly appreciated.
I thing, that the "any character in between" may cause you a lot of troubles. For example let's consider this:
We want to find "my elf"... but you probably don't want to match "myself".
Anyway. If this is OK with you replace space character with dot in the keyword using string.Replace.
. in regex will match any character.
If you are new to regexes, check this useful cheat sheet: http://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet
To solve the issue with "myself" and "my elf", use something more careful than . in the regex. For example [^a-zA-Z] which will match anything except letters from a to z and A to Z, or maybe \W, which will match non-word character, which means anything except a-zA-Z0-9_, so it is equivalent to [^a-zA-Z0-9_].
Also be careful about plural forms like city - cities and all the irregular ones.
You could try something like this (I left only the {0} part of the regex):
var relevantChars = new char[]{',', '#'}; // add here anything you like
string.Format(#"{0}", keyword.Replace(" ", "(" + string.Join("|", relevantChars ) + ")"));
If you're set on using pluralization, you will have to use the PluralizationService (see this answer for more details).
And seeing that you're using a string.Format, I assume you're looping your backlist array.
So why not do it all in a neat method?
public static string GetBlacklistRegexString(string[] blacklist)
{
//It seems that this service only support engligh natively, to check later
var ps = PluralizationService.CreateService(CultureInfo.GetCultureInfo("en"));
//Using a StringBuilder for ease of use and performance,
//even though it's not easy on the eye :p
StringBuilder sb = new StringBuilder().Append(#"\b(");
//We're just going to make a unique regex with all the words
//and their plurals in a list, so we're looping here
foreach (var word in blacklist)
{
//Using a dot wasn't careful indeed... Feel free to replace
//"\W" with anything that does it for you. It will match
//any non-alphanumerical character
var regexPlural = ps.Pluralize(word).Replace(" ", #"\W");
var regexWord = word.Replace(" ", #"\W");
sb.Append(regexWord).Append('|').Append(regexPlural).Append('|');
}
sb.Remove(sb.Length - 1, 1); //removing the last '|'
sb.Append(#")\b");
return sb.ToString();
}
The usage is nothing surprising if you're already using regular expressions in .NET:
static void Main(string[] args)
{
string[] blacklist = {"Goodbye","Welcome","join us"};
string input = "Welcome, come join us at dummywebsite.com for fun and games, goodbye!";
//I assume that you want it case insensitive
Regex blacklistRegex = new Regex(GetBlacklistRegexString(blacklist), RegexOptions.IgnoreCase);
foreach (Match match in blacklistRegex.Matches(input))
{
Console.WriteLine(match);
}
Console.ReadLine();
}
We get written on the console the expected output:
Welcome
join us
goodbye
Edit: still have a problem (working on it later), if "man" is in your keywords, it will match the "men" in "women"... Weirdly I don't get this behaviour on regexhero.
Edit 2: duh, of course if I don't group the words with parenthesis, the word boundaries are just applied to the first and last one... Corrected.

Regex that exclude all except some words

I though that filtering a string like :
"Hello <strong>plip</strong> plop"
to obtain
"plip plop", that is, excluding all words except 'plip' and 'plop' would be easy with this C# line:
new Regex("[^(plip)(plop)]").Replace(inputString,"").
Unfortunalty, the excluding brackets [^] seem to not accept exclusion words, as it keeps each letters contained in 'plip' and 'plop' (the result is "llooplipoplop").
Is there a way to achieve this in a single regex/line, or is it necessary to loop other all matches of plip and plop, then concat them?
Generally speaking, it is much easier to write a regex that matches what you do want than one that matches all the stuff you don't want.
In this case you want to "exclude all words except plip and plop", but why not just include only plip and plop instead?
var input = "Hello <strong>plip</strong> plop";
var matches = Regex.Matches(input, "plip|plop");
var result = string.Join("", matches.Cast<Match>().Select(x => x.Value));
Console.Out.WriteLine(result); // prints "plipplop"
Of course since you asked for a one-liner, you could do everything without the temp variables (and good luck to the next guy reading the code!):
var result = string.Join("", Regex.Matches("Hello <strong>plip</strong> plop", "plip|plop").Cast<Match>().Select(x => x.Value));
Also, assuming you actual word list is more complicated than plip and plop, you can do something like var pattern = string.Join("|", words); to construct the pattern.
hope this works
(?<=(\bplip\b|\bplop\b|^)).*?(?=(\bplip\b|\bplop\b|$))
You should set the singleline mode for the above regex to work
works here

Matching a substring of any length and characters using RegEx

I would like to be able to match and then extract all substrings in the following string using regex in c#:
"2012-05-15 00:49:02 192.168.100.10 POST /Microsoft-Server-ActiveSync/default.eas User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_ 443 redcloud\nikced 94.234.170.42 Apple-iPhone4C1/902.179 200 0 64 3140491"
Since it's a logfile it the regex should be able to handle any line that is of a similar type.
In this case, the preferred output to a collection should be:
2012-05-15
00:49:02
192.168.100.10
/Microsoft-Server-ActiveSync/default.eas
User=nikced&DeviceId=ApplDNWGRKZQDTC0&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst8_LdapC0_LdapL0_RpcC31_RpcL50_Hb3540_Erq1_Pk1728465481_S2_
443
redcloud\nikced
94.234.170.42
Apple-iPhone4C1/902.179
200
0
64
3140491
Appreciate any answer using C#, .net and Regex to extract the above substrings into a collection (MatchCollection preferred). All log lines follows the same format and pattern.
Incredibly complex regex incoming:
logFile.Split(' ');
This will give you an array that you can iterate through to retrieve all of the "lines" which are separated by a space
string[] lines = log.Split(' ');
You don't need to use a Regex. You can simply use String.Split Method, and specify space as separator:
string [] substrings = line.Split(new Char [] {' '});
If you need to identify the kind of each part, then you should specify what you need to find, and a regex can be created for it.
Anyway, if you really want to use a Regex, do this:
Regex re = new Regex (#"(?:(?<s>[^ ]+)(?: |$))*");
This will give you all the captures in the "s" group, when you call the Match method.
As the OP pointed out in a comment that the separator can be anything appart from a single space, then the possible separators should be included in the (?: |$) and the [^ ] parts of the expression. I.e. if space as well as tab are possible separators, replace that part with (?: |\t|$) and [^ \t]. If you need to accept more than one of those characters as separators, add a + after the () group:
(?:(?<s>[^ \t]+)(?: |\t|$)+)*
The fastest and most obvious way is to use String.Split:
string[] substrings = result = line->Split( nullptr, StringSplitOptions::RemoveEmptyEntries );
But if you insist on a MatchCollection then this will do what you want
MatchCollection ^ substrings = Regex.Matches(line, "\\S+")
Really, you just need to break this down into the parts.
First, the date. Will it always be in YYYY-MM-DD format? Could it be possible that it will be different based on region/culture settings?
(?<LogDate>dddd-dd-dd)
Next, you have the time. Same thing:
(?<LogTime>dd:dd:dd)
Next, I'm assuming this is the web method that was actually called? Not entirely sure, since you haven't really explained how the data is laid out. However, I'm assuming it's either going to be either POST or GET, so that's what we're going to do next...
(?<LogMethod>POST|GET)
Just do this for every part of the log line you're interested in, and you'll be set. IE:
(?<LogDate>dddd-dd-dd) (?<LogTime>dd:dd:dd) (?<LogMethod>POST|GET)...
If you want to anchor to the start/end of the line, be sure to use ^ and $ respectively. When you get the Matches, you can get the values from each group by indexing the Groups property with the named group (such as match.Groups["LogMethod"].Value). Good luck!

Remove substring from a list of strings

I have a list of strings that contain banned words. What's an efficient way of checking if a string contains any of the banned words and removing it from the string? At the moment, I have this:
cleaned = String.Join(" ", str.Split().Where(b => !bannedWords.Contains(b,
StringComparer.OrdinalIgnoreCase)).ToArray());
This works fine for single banned words, but not for phrases (e.g. more than one word). Any instance of more than one word should also be removed. An alternative I thought of trying is to use the List's Contains method, but that only returns a bool and not an index of the matching word. If I could get an index of the matching word, I could just use String.Replace(bannedWords[i],"");
A simple String.Replace will not work as it will remove word parts. If "sex" is a banned word and you have the word "sextet", which is not banned, you should keep it as is.
Using Regex you can find whole words and phrases in a text with
string text = "A sextet is a musical composition for six instruments or voices.".
string word = "sex";
var matches = Regex.Matches(text, #"(?<=\b)" + word + #"(?=\b)");
The matches collection will be empty in this case.
You can use the Regex.Replace method
foreach (string word in bannedWords) {
text = Regex.Replace(text, #"(?<=\b)" + word + #"(?=\b)", "")
}
Note: I used the following Regex pattern
(?<=prefix)find(?=suffix)
where 'prefix' and 'suffix' are both \b, which denotes word beginnings and ends.
If your banned words or phrases can contain special characters, it would be safer to escape them with Regex.Escape(word).
Using #zmbq's idea you could create a Regex pattern once with
string pattern =
#"(?<=\b)(" +
String.Join(
"|",
bannedWords
.Select(w => Regex.Escape(w))
.ToArray()) +
#")(?=\b)";
var regex = new Regex(pattern); // Is compiled by default
and then apply it repeatedly to different texts with
string result = regex.Replace(text, "");
It doesn't work because you have conflicting definitions.
When you want to look for sub-sentences like more than one word you cannot split on whitespace anymore. You'll have to fall back on String.IndexOf()
If it's performance you're after, I assume you're not worried about one-time setup time, but rather about continuous performance. So I'd build one huge regular expression containing all the banned expressions and make sure it's compiled - that's as a setup.
Then I'd try to match it against the text, and replace every match with a blank or whatever you want to replace it with.
The reason for this, is that a big regular expression should compile into something comparable to the finite state automaton you would create by hand to handle this problem, so it should run quite nicely.
Why don't you iterate through the list of banned words and look up each of them in the string by using the method string.IndexOf.
For example, you can remove the banned words and phrases with the following piece of code:
myForbWords.ForEach(delegate(string item) {
int occ = str.IndexOf(item);
if(occ > -1) str = str.Remove(occ, item.Length);
});
Type of myForbWords is List<string>.

Remove all "invisible" chars from a string?

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.
I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.
If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.
One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.
The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.
You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();
If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Categories