C# Pig Latin with Regex Replace - c#

First off- This is a Homework problem. Just getting that out there. Trying to build a Pig Latin Translator in C#, we have to use Regex replace but I'm having some issues. Not allowed to use the Split method to obtain an array of words. We have to use the static method Replace of type Regex. White Space, punctuation linebreaks et should be preserved. Capitalized words should remain so. For those unfamiliar with the rules of Pig Latin-
If the string begins with a vowel, add "way" to the string. (vowels are a,e,i,o,u)
Examples: Pig-Latin for "orange" is "orangeway", Pig-Latin for “eating” is “eatingway”
Otherwise, find the first occurrence of a vowel, move all the characters before the vowel to the end of the word, and add "ay".
(in the middle of the word ‘y’ also counts as a vowel, but NOT at the beginning)
Examples: Pig-Latin for "story" is "orystay" since the characters "st" occur before the first vowel; Pig-Latin for "crystal" is "ystalcray", but Pig-Latin for "yellow" is "ellowyay".
If there are no vowels, add "ay".Examples: Pig-Latin for "mph" is "mphay", Pig-Latin for RPM is RPMay
I've got a ton of commented out code, so I'll remove that for reading ease.
My test sentence is "Eat monkey poo." I'm getting "Ewayaayt moaynkeayy poayoay."
I know Regex is 'greedy', but I can't figure out how to get it to stop with just the first vowel it finds. Using Textboxes as well.
namespace AssignmentPigLatin
{
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
OriginalTb.Text = "Eat monkey poo.";
}
private void translateButton_Click(object sender, RoutedEventArgs e)
{
string vowels = "[AEIOUaeiou]";
var regex = new Regex(vowels);
var translation = regex.Replace(OriginalTb.Text, TranslateToPigLatin);
PigLatinTb.Text = translation;
}
private void ClearButton_Click(object sender, RoutedEventArgs e)
{
OriginalTb.Text = "";
PigLatinTb.Text = "";
}
static string TranslateToPigLatin(Match match)
{
string word = match.ToString();
string firstLetters = word.Substring(0, match.Length);
string restLetters = word.Substring(firstLetters.Length - 1, word.Length-1);
string newWord;
if (match.Index == 0)
{
return word + "way";
}
else
{
return restLetters + firstLetters + "ay";
}
}
}
}

The question was interesting to answer. Don't forget to attribute me ;)
Add this method in your class AssignmentPigLatin
private string PigLatinTranslator(string s)
{
s = Regex.Replace(s, #"(\b[a|e|i|o|u]\w+)", "$1way", RegexOptions.IgnoreCase);
List<string> words = new List<string>();
foreach (Match v in Regex.Matches(s, #"\w+"))
{
string result;
if (!v.Value.EndsWith("way"))
{
result = Regex.Replace(v.Value, #"([^a|e|i|o|u]*)([a|e|i|o|u])(\w+)", "$2$3$1ay", RegexOptions.IgnoreCase);
words.Add(result);
}
else { words.Add(v.Value); }
}
s = string.Join(" ", words);
words.Clear();
foreach (Match v in Regex.Matches(s,#"\w+"))
{
string result = Regex.Replace(v.Value, #"\b([^a|e|i|o|u]+)\b", "$1ay", RegexOptions.IgnoreCase);
words.Add(result);
}
s = string.Join(" ", words);
return s;
}
Call it like this:
string test = "MPH Eat monkey poo."; // Added MPH, so that you can test my method works or not.
string result = PigLatinTranslator(test);
Console.WriteLine(result); // MPHay Eatway onkeymay oopay.

Easier and more clear solution is to use Regex.Replace with lambda.
static string TranslateToPigLatin(string input)
{
char[] vowels = new[] { 'A', 'E', 'I', 'O', 'U', 'a', 'e', 'i', 'o', 'u' };
char[] vowelsExtended = vowels.Concat(new[] { 'Y', 'y' }).ToArray();
string output = Regex.Replace(input, #"\w+", m =>
{
string word = m.Value;
if (vowels.Contains(word[0]))
return word + "way";
else
{
int indexOfVowel = word.IndexOfAny(vowelsExtended, 1);
if (indexOfVowel == -1)
return word + "ay";
else
return word.Substring(indexOfVowel) + word.Substring(0, indexOfVowel) + "ay";
}
});
return output;
}

Related

C# - Identify the matching character when using String.Split(CharArray)

If I use the Split() function on a string, passing in various split characters as a char[] parameter, and given that the matching split character is removed from the string, how can I identify which character it matched & split on?
string inputString = "Hello, there| world";
char[] splitChars = new char[] { ',','|' }
foreach(string section in inputString.Split(splitChars))
{
Console.WriteLine(section) // [0] Hello [1} there [2] world (no splitChars)
}
I understand that perhaps its not possible to retain this information with my approach. If thats the case, could you suggest an alternative approach?
The C# Regex.Split() method documented here can return the split characters as well as the words between them.
string inputString = "Hello, there| world";
string pattern = #"(,)|([|])";
foreach (string result in Regex.Split(inputString, pattern))
{
Console.WriteLine("'{0}'", result);
}
the result is:
'Hello'
','
' there'
'|'
' world'
Use the Regex.Split() method. I have wrapped this method in the following extension method that is as easy to use as string.Split() itself:
public static string[] ExtendedSplit(this string input, char[] splitChars)
{
string pattern = string.Join("|", splitChars.Select(x => "(" + Regex.Escape(x.ToString()) + ")"));
return Regex.Split(input, pattern);
}
Usage:
string inputString = "Hello, there| world";
char[] splitChars = new char[]{',', '|'};
foreach (string result in inputString.ExtendedSplit(splitChars))
{
Console.WriteLine("'{0}'", result);
}
Output:
'Hello'
','
' there'
'|'
' world'
No, but its rather trivial to write one yourself. Remember, framework methods aren't magic, somebody wrote them. If something doesn't exactly match your needs, write one that does!
static IEnumerable<(string Sector, char Separator)> Split(
this string s,
IEnumerable<char> separators,
bool removeEmptyEntries)
{
var buffer = new StringBuilder();
var separatorsSet = new HashSet<char>(separators);
foreach (var c in s)
{
if (separatorsSet.Contains(c))
{
if (!removeEmptyEntries || buffer.Length > 0)
yield return (buffer.ToString(), c);
buffer.Clear();
}
else
buffer.Append(c);
}
if (buffer.Length > 0)
yield return (buffer.ToString(), default(char));
}

C# string.split() separate string by uppercase

I've been using the Split() method to split strings. But this work if you set some character for condition in string.Split(). Is there any way to split a string when is see Uppercase?
Is it possible to get few words from some not separated string like:
DeleteSensorFromTemplate
And the result string is to be like:
Delete Sensor From Template
Use Regex.split
string[] split = Regex.Split(str, #"(?<!^)(?=[A-Z])");
Another way with regex:
public static string SplitCamelCase(string input)
{
return System.Text.RegularExpressions.Regex.Replace(input, "([A-Z])", " $1", System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
}
If you do not like RegEx and you really just want to insert the missing spaces, this will do the job too:
public static string InsertSpaceBeforeUpperCase(this string str)
{
var sb = new StringBuilder();
char previousChar = char.MinValue; // Unicode '\0'
foreach (char c in str)
{
if (char.IsUpper(c))
{
// If not the first character and previous character is not a space, insert a space before uppercase
if (sb.Length != 0 && previousChar != ' ')
{
sb.Append(' ');
}
}
sb.Append(c);
previousChar = c;
}
return sb.ToString();
}
I had some fun with this one and came up with a function that splits by case, as well as groups together caps (it assumes title case for whatever follows) and digits.
Examples:
Input -> "TodayIUpdated32UPCCodes"
Output -> "Today I Updated 32 UPC Codes"
Code (please excuse the funky symbols I use)...
public string[] SplitByCase(this string s) {
var ʀ = new List<string>();
var ᴛ = new StringBuilder();
var previous = SplitByCaseModes.None;
foreach(var ɪ in s) {
SplitByCaseModes mode_ɪ;
if(string.IsNullOrWhiteSpace(ɪ.ToString())) {
mode_ɪ = SplitByCaseModes.WhiteSpace;
} else if("0123456789".Contains(ɪ)) {
mode_ɪ = SplitByCaseModes.Digit;
} else if(ɪ == ɪ.ToString().ToUpper()[0]) {
mode_ɪ = SplitByCaseModes.UpperCase;
} else {
mode_ɪ = SplitByCaseModes.LowerCase;
}
if((previous == SplitByCaseModes.None) || (previous == mode_ɪ)) {
ᴛ.Append(ɪ);
} else if((previous == SplitByCaseModes.UpperCase) && (mode_ɪ == SplitByCaseModes.LowerCase)) {
if(ᴛ.Length > 1) {
ʀ.Add(ᴛ.ToString().Substring(0, ᴛ.Length - 1));
ᴛ.Remove(0, ᴛ.Length - 1);
}
ᴛ.Append(ɪ);
} else {
ʀ.Add(ᴛ.ToString());
ᴛ.Clear();
ᴛ.Append(ɪ);
}
previous = mode_ɪ;
}
if(ᴛ.Length != 0) ʀ.Add(ᴛ.ToString());
return ʀ.ToArray();
}
private enum SplitByCaseModes { None, WhiteSpace, Digit, UpperCase, LowerCase }
Here's another different way if you don't want to be using string builders or RegEx, which are totally acceptable answers. I just want to offer a different solution:
string Split(string input)
{
string result = "";
for (int i = 0; i < input.Length; i++)
{
if (char.IsUpper(input[i]))
{
result += ' ';
}
result += input[i];
}
return result.Trim();
}

How to remove lowercase on a textbox?

I'm trying to remove the lower case letters on a TextBox..
For example, short alpha code representing the insurance (e.g., 'BCBS' for 'Blue Cross Blue Shield'):
txtDesc.text = "Blue Cross Blue Shield";
string Code = //This must be BCBS..
Is it possible? Please help me. Thanks!
Well you could use a regular expression to remove everything that wasn't capital A-Z:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main( string[] args )
{
string input = "Blue Cross Blue Shield 12356";
Regex regex = new Regex("[^A-Z]");
string output = regex.Replace(input, "");
Console.WriteLine(output);
}
}
Note that this would also remove any non-ASCII characters. An alternative regex would be:
Regex regex = new Regex(#"[^\p{Lu}]");
... I believe that should cover upper-case letters of all cultures.
string Code = new String(txtDesc.text.Where(c => IsUpper(c)).ToArray());
Here is my variant:
var input = "Blue Cross Blue Shield 12356";
var sb = new StringBuilder();
foreach (var ch in input) {
if (char.IsUpper(ch)) { // only keep uppercase
sb.Append(ch);
}
}
sb.ToString(); // "BCBS"
I normally like to use regular expressions, but I don't know how to select "only uppercase" in them without [A-Z] which will break badly on characters outside the English alphabet (even other Latin characters! :-/)
Happy coding.
But see Mr. Skeet's answer for the regex way ;-)
Without Regex:
string input = "Blue Cross Blue Shield";
string output = new string(input.Where(Char.IsUpper).ToArray());
Response.Write(output);
string Code = Regex.Replace(txtDesc.text, "[a-z]", "");
I´d map the value to your abbreviation in a dictionary like:
Dictionary<string, string> valueMap = new Dictionary<string, string>();
valueMap.Add("Blue Cross Blue Shield", "BCBS");
string Code = "";
if(valueMap.ContainsKey(txtDesc.Text))
Code = valueMap[txtDesc.Text];
else
// Handle
But if you still want the functionality you mention use linq:
string newString = new string(txtDesc.Text.Where(c => char.IsUpper(c).ToArray());
You can try use the 'Replace lowercase characters with star' implementation, but change '*' to '' (blank)
So the code would look something like this:
txtDesc.Text = "Blue Cross Blue Shield";
string TargetString = txt.Desc.Text;
string MainString = TargetString;
for (int i = 0; i < TargetString.Length; i++)
{
if (char.IsLower(TargetString[i]))
{
TargetString = TargetString.Replace( TargetString[ i ].ToString(), string.Empty );
}
}
Console.WriteLine("The string {0} has converted to {1}", MainString, TargetString);
string caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string.Join("",
"Blue Cross Blue Shield".Select(c => caps.IndexOf(c) > -1 ? c.ToString() : "")
.ToArray());
Rather than matching on all capitals, I think the specification would require matching the first character from all the words. This would allow for inconsitent input but still be reliable in the long run. For this reason, I suggest using the following code. It uses an aggregate on each Match from the Regex object and appends the value to a string object called output.
string input = "Blue Cross BLUE shield 12356";
Regex regex = new Regex("\\b\\w");
string output = regex.Matches(input).Cast<Match>().Aggregate("", (current, match) => current + match.Value);
Console.WriteLine(output.ToUpper()); // outputs BCBS1
string Code = Regex.Replace(txtDesc.text, "[a-z]", "");
This isn't perfect but should work (and passes your BCBS test):
private static string AlphaCode(String Input)
{
List<String> capLetter = new List<String>();
foreach (Char c in Input)
{
if (char.IsLetter(c))
{
String letter = c.ToString();
if (letter == letter.ToUpper()) { capLetter.Add(letter); }
}
}
return String.Join(String.Empty, capLetter.ToArray());
}
And this version will handle strange input scenarios (this makes sure the first letter of each word is capitalized).
private static string AlphaCode(String Input)
{
String capCase = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Input.ToString().ToLower());
List<String> capLetter = new List<String>();
foreach (Char c in capCase)
{
if (char.IsLetter(c))
{
String letter = c.ToString();
if (letter == letter.ToUpper()) { capLetter.Add(letter); }
}
}
return String.Join(String.Empty, capLetter.ToArray());
}

C# Regex for Movie Filename

I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.
Examples of the file names I'm working with are:
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)
So far I'm using:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");
Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([?\\[+A-Z+\\]])", "");
Which left me with:
urorip (2004)
Instead of:
EuroTrip (2004) [SD]
Any whitespace that is left at the ends are ok as I will just perform
movieTitleToFetch = movieTitleToFetch.Trim();
at the end.
Thanks in advance,
Alex
This regex pattern should work ok... maybe needs a bit of tweaking
"[\[\(].+?[\]\)]"
Regex.Replace(movieTitleToFetch, #"[\[\(].+?[\]\)]", "");
This should match anything from either "[" or "(" until the next occurance of "]" or ")"
If that does not work try removing the escape character for the parentheses, like so...
Regex.Replace(movieTitleToFetch, #"[\[(].+?[\])]", "");
#Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.
([\[].*?[\]]|[\(].*?[\)])
I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.
This runs these steps:
Removes everything in brackets (if find a year it tries to keep the info)
Removes a list of common used words (720p, bdrip, h264 and so on...)
Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
if a year was not found into parenthesis looks at the end of remaining string (as for languages)
Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.
Here's the test in XUnit (i used most of italian titles to test it)
using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;
namespace Grappachu.MoVideo.Test
{
public class TitleCleanerTest
{
[Theory]
[InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
"Avengers Confidential La Vedova Nera E Punisher", 2014)]
[InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
"Fuck You, Prof!", 2013)]
[InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
"Il Libro della Giungla", 2016)]
[InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
[InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
"il viaggio di arlo", 2015)]
[InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
"La Mafia Uccide Solo D'estate", 2013)]
[InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
"Ip Man 3", 2015)]
[InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
"Inferno", 2016)]
[InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
"Ghostbusters", 2016)]
[InlineData("Transcendence.mkv", "Transcendence", null)]
[InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
{
var res = MovieTitleCleaner.Clean(filename);
res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
res.Year.Should().Be.EqualTo(year);
}
}
}
and fisrt version of the code
using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
public class MovieTitleCleanerResult
{
public string Title { get; set; }
public int? Year { get; set; }
public string SubTitle { get; set; }
}
public class MovieTitleCleaner
{
private const string SpecialMarker = "§=§";
private static readonly string[] ReservedWords;
private static readonly string[] SpaceChars;
private static readonly string[] Languages;
static MovieTitleCleaner()
{
ReservedWords = new[]
{
SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
};
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var l = cultures.Select(x => x.EnglishName).ToList();
l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
Languages = l.Distinct().ToArray();
SpaceChars = new[] {".", "_", " "};
}
public static MovieTitleCleanerResult Clean(string filename)
{
var temp = Path.GetFileNameWithoutExtension(filename);
int? maybeYear = null;
// Remove what's inside brackets trying to keep year info.
temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
temp = RemoveBrackets(temp, '(', ')', ref maybeYear);
// Removes special markers (codec, formats, ecc...)
var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var title = string.Empty;
for (var i = 0; i < tokens.Length; i++)
{
var tok = tokens[i];
if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
{
if (title.Length > 0)
break;
}
else
{
title = string.Join(" ", title, tok).Trim();
}
}
temp = title;
// Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
for (var i = tokens.Length - 1; i >= 0; i--)
{
var tok = tokens[i];
if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
tokens[i] = string.Empty;
else
break;
}
title = string.Join(" ", tokens).Trim();
// If year is not found inside parenthesis try to catch at the end, just after the title
if (!maybeYear.HasValue)
{
var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var last = resplit.Last();
if (LooksLikeYear(last))
{
maybeYear = int.Parse(last);
title = title.Replace(last, string.Empty).Trim();
}
}
// TODO: review this. when there's one dash separates main title from subtitle
var res = new MovieTitleCleanerResult();
res.Year = maybeYear;
if (title.Count(x => x == '-') == 1)
{
var sp = title.Split('-');
res.Title = sp[0];
res.SubTitle = sp[1];
}
else
{
res.Title = title;
}
return res;
}
private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
{
var str = inputString;
while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
{
var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
if (LooksLikeYear(dataGraph))
{
maybeYear = int.Parse(dataGraph);
}
else
{
var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
foreach (var part in parts)
if (LooksLikeYear(part))
{
maybeYear = int.Parse(part);
break;
}
}
str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
}
return str;
}
private static bool LooksLikeYear(string dataRound)
{
return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
}
}
public static class StringUtils
{
public static string GetBetween(this string src, string a, string b,
StringComparison comparison = StringComparison.Ordinal)
{
var idxStr = src.IndexOf(a, comparison);
var idxEnd = src.IndexOf(b, comparison);
if (idxStr >= 0 && idxEnd > 0)
{
if (idxStr > idxEnd)
Swap(ref idxStr, ref idxEnd);
return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
}
return src;
}
private static void Swap<T>(ref T idxStr, ref T idxEnd)
{
var temp = idxEnd;
idxEnd = idxStr;
idxStr = temp;
}
public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
{
var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, replacement ?? string.Empty);
}
}
}
This does the trick:
#"(\[[^\]]*\])|(\([^\)]*\))"
It removes anything from "[" to the next "]" and anything from "(" to the next ")".
Can you just use:
string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
MovieTitle.Substring(0,MovieTitle.IndexOf('('));
Cant we use this instead:-
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
Above code will surely return you the perfect movie titles for these strings:-
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
if there occurs a case where you will not have year but only type i.e :-
EuroTrip [SD]
Event Horizon [720]
Fast & Furious [1080p]
Star Trek [Unknown]
then use this
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));
I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.
This pattern translates as:
Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)

Find substring ignoring specified characters

Do any of you know of an easy/clean way to find a substring within a string while ignoring some specified characters to find it. I think an example would explain things better:
string: "Hello, -this- is a string"
substring to find: "Hello this"
chars to ignore: "," and "-"
found the substring, result: "Hello, -this"
Using Regex is not a requirement for me, but I added the tag because it feels related.
Update:
To make the requirement clearer: I need the resulting substring with the ignored chars, not just an indication that the given substring exists.
Update 2:
Some of you are reading too much into the example, sorry, i'll give another scenario that should work:
string: "?A&3/3/C)412&"
substring to find: "A41"
chars to ignore: "&", "/", "3", "C", ")"
found the substring, result: "A&3/3/C)41"
And as a bonus (not required per se), it will be great if it's also not safe to assume that the substring to find will not have the ignored chars on it, e.g.: given the last example we should be able to do:
substring to find: "A3C412&"
chars to ignore: "&", "/", "3", "C", ")"
found the substring, result: "A&3/3/C)412&"
Sorry if I wasn't clear before, or still I'm not :).
Update 3:
Thanks to everyone who helped!, this is the implementation I'm working with for now:
http://www.pastebin.com/pYHbb43Z
An here are some tests:
http://www.pastebin.com/qh01GSx2
I'm using some custom extension methods I'm not including but I believe they should be self-explainatory (I will add them if you like)
I've taken a lot of your ideas for the implementation and the tests but I'm giving the answer to #PierrOz because he was one of the firsts, and pointed me in the right direction.
Feel free to keep giving suggestions as alternative solutions or comments on the current state of the impl. if you like.
in your example you would do:
string input = "Hello, -this-, is a string";
string ignore = "[-,]*";
Regex r = new Regex(string.Format("H{0}e{0}l{0}l{0}o{0} {0}t{0}h{0}i{0}s{0}", ignore));
Match m = r.Match(input);
return m.Success ? m.Value : string.Empty;
Dynamically you would build the part [-, ] with all the characters to ignore and you would insert this part between all the characters of your query.
Take care of '-' in the class []: put it at the beginning or at the end
So more generically, it would give something like:
public string Test(string query, string input, char[] ignorelist)
{
string ignorePattern = "[";
for (int i=0; i<ignoreList.Length; i++)
{
if (ignoreList[i] == '-')
{
ignorePattern.Insert(1, "-");
}
else
{
ignorePattern += ignoreList[i];
}
}
ignorePattern += "]*";
for (int i = 0; i < query.Length; i++)
{
pattern += query[0] + ignorepattern;
}
Regex r = new Regex(pattern);
Match m = r.Match(input);
return m.IsSuccess ? m.Value : string.Empty;
}
Here's a non-regex string extension option:
public static class StringExtensions
{
public static bool SubstringSearch(this string s, string value, char[] ignoreChars, out string result)
{
if (String.IsNullOrEmpty(value))
throw new ArgumentException("Search value cannot be null or empty.", "value");
bool found = false;
int matches = 0;
int startIndex = -1;
int length = 0;
for (int i = 0; i < s.Length && !found; i++)
{
if (startIndex == -1)
{
if (s[i] == value[0])
{
startIndex = i;
++matches;
++length;
}
}
else
{
if (s[i] == value[matches])
{
++matches;
++length;
}
else if (ignoreChars != null && ignoreChars.Contains(s[i]))
{
++length;
}
else
{
startIndex = -1;
matches = 0;
length = 0;
}
}
found = (matches == value.Length);
}
if (found)
{
result = s.Substring(startIndex, length);
}
else
{
result = null;
}
return found;
}
}
EDIT: here's an updated solution addressing the points in your recent update. The idea is the same except if you have one substring it will need to insert the ignore pattern between each character. If the substring contains spaces it will split on the spaces and insert the ignore pattern between those words. If you don't have a need for the latter functionality (which was more in line with your original question) then you can remove the Split and if checking that provides that pattern.
Note that this approach is not going to be the most efficient.
string input = #"foo ?A&3/3/C)412& bar A341C2";
string substring = "A41";
string[] ignoredChars = { "&", "/", "3", "C", ")" };
// builds up the ignored pattern and ensures a dash char is placed at the end to avoid unintended ranges
string ignoredPattern = String.Concat("[",
String.Join("", ignoredChars.Where(c => c != "-")
.Select(c => Regex.Escape(c)).ToArray()),
(ignoredChars.Contains("-") ? "-" : ""),
"]*?");
string[] substrings = substring.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
string pattern = "";
if (substrings.Length > 1)
{
pattern = String.Join(ignoredPattern, substrings);
}
else
{
pattern = String.Join(ignoredPattern, substring.Select(c => c.ToString()).ToArray());
}
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine("Index: {0} -- Match: {1}", match.Index, match.Value);
}
Try this solution out:
string input = "Hello, -this- is a string";
string[] searchStrings = { "Hello", "this" };
string pattern = String.Join(#"\W+", searchStrings);
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine(match.Value);
}
The \W+ will match any non-alphanumeric character. If you feel like specifying them yourself, you can replace it with a character class of the characters to ignore, such as [ ,.-]+ (always place the dash character at the start or end to avoid unintended range specifications). Also, if you need case to be ignored use RegexOptions.IgnoreCase:
Regex.Matches(input, pattern, RegexOptions.IgnoreCase)
If your substring is in the form of a complete string, such as "Hello this", you can easily get it into an array form for searchString in this way:
string[] searchString = substring.Split(new[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
This code will do what you want, although I suggest you modify it to fit your needs better:
string resultString = null;
try
{
resultString = Regex.Match(subjectString, "Hello[, -]*this", RegexOptions.IgnoreCase).Value;
}
catch (ArgumentException ex)
{
// Syntax error in the regular expression
}
You could do this with a single Regex but it would be quite tedious as after every character you would need to test for zero or more ignored characters. It is probably easier to strip all the ignored characters with Regex.Replace(subject, "[-,]", ""); then test if the substring is there.
Or the single Regex way
Regex.IsMatch(subject, "H[-,]*e[-,]*l[-,]*l[-,]*o[-,]* [-,]*t[-,]*h[-,]*i[-,]*s[-,]*")
Here's a non-regex way to do it using string parsing.
private string GetSubstring()
{
string searchString = "Hello, -this- is a string";
string searchStringWithoutUnwantedChars = searchString.Replace(",", "").Replace("-", "");
string desiredString = string.Empty;
if(searchStringWithoutUnwantedChars.Contains("Hello this"))
desiredString = searchString.Substring(searchString.IndexOf("Hello"), searchString.IndexOf("this") + 4);
return desiredString;
}
You could do something like this, since most all of these answer require rebuilding the string in some form.
string1 is your string you want to look through
//Create a List(Of string) that contains the ignored characters'
List<string> ignoredCharacters = new List<string>();
//Add all of the characters you wish to ignore in the method you choose
//Use a function here to get a return
public bool subStringExist(List<string> ignoredCharacters, string myString, string toMatch)
{
//Copy Your string to a temp
string tempString = myString;
bool match = false;
//Replace Everything that you don't want
foreach (string item in ignoredCharacters)
{
tempString = tempString.Replace(item, "");
}
//Check if your substring exist
if (tempString.Contains(toMatch))
{
match = true;
}
return match;
}
You could always use a combination of RegEx and string searching
public class RegExpression {
public static void Example(string input, string ignore, string find)
{
string output = string.Format("Input: {1}{0}Ignore: {2}{0}Find: {3}{0}{0}", Environment.NewLine, input, ignore, find);
if (SanitizeText(input, ignore).ToString().Contains(SanitizeText(find, ignore)))
Console.WriteLine(output + "was matched");
else
Console.WriteLine(output + "was NOT matched");
Console.WriteLine();
}
public static string SanitizeText(string input, string ignore)
{
Regex reg = new Regex("[^" + ignore + "]");
StringBuilder newInput = new StringBuilder();
foreach (Match m in reg.Matches(input))
{
newInput.Append(m.Value);
}
return newInput.ToString();
}
}
Usage would be like
RegExpression.Example("Hello, -this- is a string", "-,", "Hello this"); //Should match
RegExpression.Example("Hello, -this- is a string", "-,", "Hello this2"); //Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A41"); // Should match
RegExpression.Example("?A&3/3/C) 412&", "&/3C\\)", "A41"); // Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A3C412&"); // Should match
Output
Input: Hello, -this- is a string
Ignore: -,
Find: Hello this
was matched
Input: Hello, -this- is a string
Ignore: -,
Find: Hello this2
was NOT matched
Input: ?A&3/3/C)412&
Ignore: &/3C)
Find: A41
was matched
Input: ?A&3/3/C) 412&
Ignore: &/3C)
Find: A41
was NOT matched
Input: ?A&3/3/C)412&
Ignore: &/3C)
Find: A3C412&
was matched

Categories