Related
I have c# strings that look like this:
"かたむく;かたぶく[ok]"
"そば[側,傍];そく[側];はた"
"くすり"
"おととい[gikun];おとつい[gikun];いっさくじつ"
How can I trim these down so that the output has only text up to the first occurrence of a ";" character (not a normal semicolon) or a "[" or if neither are present then the new string would be the same as the existing.
"かたむく"
"そば"
"くすり"
"おととい"
Is that something that would be best done with Regex or should I use some indexOf type of code to do this?
You don't need a Regex, just string.IndexOfAny. Something like:
var inputs = new[]
{
"かたむく;かたぶく[ok]",
"そば[側,傍];そく[側];はた",
"くすり",
"おととい[gikun];おとつい[gikun];いっさくじつ"
};
var separators = new[] {' ', '['};
foreach (var input in inputs)
{
var separatorPosition = input.IndexOfAny(separators);
if (separatorPosition >= 0)
{
Debug.WriteLine($"Split: {input.Substring(0, separatorPosition)}");
}
else
{
Debug.WriteLine($"No Split: {input}");
}
}
I get the following output from your inputs:
Split: かたむく;かたぶく
Split: そば
No Split: くすり
Split: おととい
It doesn't quite match what you show, but I think it's correct (and what you show isn't)
Expanding on my comment, "IndexOf can be used to find the first index of the [ character, and Substring can return the string up to that point."
public static string GetSubstringToChar(string input, char delimeter = '[')
{
if (input == null || !input.Contains(delimeter)) return input;
return input.Substring(0, input.IndexOf(delimeter));
}
To make this work with multiple delimeters, we can pass in an array of delimeter characters and use IndexOfAny:
public static string GetSubstringToChar(string input, char[] delimeters)
{
if (input == null || !input.Any(delimeters.Contains)) return input;
return input.Substring(0, input.IndexOfAny(delimeters));
}
You could then call this like:
var strings = new List<string>
{
"かたむく;かたぶく[ok]",
"そば[側,傍];そく[側];はた",
"くすり",
"おととい[gikun];おとつい[gikun];いっさくじつ",
};
var delimeters = new[] { ';', '[' };
foreach (var str in strings)
{
Console.WriteLine(GetSubstringToChar(str, delimeters));
}
An extension method with a little validation will do the job.
public static string GetUntil(this string input, char[] delimiters)
{
if (input == null || input.IndexOfAny(delimiters) == -1)
return input;
else
return input.Split(delimiters)[0];
}
then call like:
var test = "かたむく;かたぶく[ok]".GetUntil(new char[] { ' ', '[' });
I have some strings containing code for emoji icons, like :grinning:, :kissing_heart:, or :bouquet:. I'd like to process them to remove the emoji codes.
For example, given:
Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet:
I want to get this:
Hello , how are you? Are you fine?
I know I can use this code:
richTextBox2.Text = richTextBox1.Text.Replace(":kissing_heart:", "").Replace(":bouquet:", "").Replace(":grinning:", "").ToString();
However, there are 856 different emoji icons I have to remove (which, using this method, would take 856 calls to Replace()). Is there any other way to accomplish this?
You can use Regex to match the word between :anything:. Using Replace with function you can make other validation.
string pattern = #":(.*?):";
string input = "Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet: Are you super fan, for example. :words not to replace:";
string output = Regex.Replace(input, pattern, (m) =>
{
if (m.ToString().Split(' ').Count() > 1) // more than 1 word and other validations that will help preventing parsing the user text
{
return m.ToString();
}
return String.Empty;
}); // "Hello , how are you? Are you fine? Are you super fan, for example. :words not to replace:"
If you don't want to use Replace that make use of a lambda expression, you can use \w, as #yorye-nathan mentioned, to match only words.
string pattern = #":(\w*):";
string input = "Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet: Are you super fan, for example. :words not to replace:";
string output = Regex.Replace(input, pattern, String.Empty); // "Hello , how are you? Are you fine? Are you super fan, for example. :words not to replace:"
string Text = "Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet:";
i would solve it that way
List<string> Emoj = new List<string>() { ":kissing_heart:", ":bouquet:", ":grinning:" };
Emoj.ForEach(x => Text = Text.Replace(x, string.Empty));
UPDATE - refering to Detail's Comment
Another approach: replace only existing Emojs
List<string> Emoj = new List<string>() { ":kissing_heart:", ":bouquet:", ":grinning:" };
var Matches = Regex.Matches(Text, #":(\w*):").Cast<Match>().Select(x => x.Value);
Emoj.Intersect(Matches).ToList().ForEach(x => Text = Text.Replace(x, string.Empty));
But i'm not sure if it's that big difference for such short chat-strings and it's more important to have code that's easy to read/maintain. OP's question was about reducing redundancy Text.Replace().Text.Replace() and not about the most efficient solution.
I would use a combination of some of the techniques already suggested. Firstly, I'd store the 800+ emoji strings in a database and then load them up at runtime. Use a HashSet to store these in memory, so that we have a O(1) lookup time (very fast). Use Regex to pull out all potential pattern matches from the input and then compare each to our hashed emoji, removing the valid ones and leaving any non-emoji patterns the user has entered themselves...
public class Program
{
//hashset for in memory representation of emoji,
//lookups are O(1), so very fast
private HashSet<string> _emoji = null;
public Program(IEnumerable<string> emojiFromDb)
{
//load emoji from datastore (db/file,etc)
//into memory at startup
_emoji = new HashSet<string>(emojiFromDb);
}
public string RemoveEmoji(string input)
{
//pattern to search for
string pattern = #":(\w*):";
string output = input;
//use regex to find all potential patterns in the input
MatchCollection matches = Regex.Matches(input, pattern);
//only do this if we actually find the
//pattern in the input string...
if (matches.Count > 0)
{
//refine this to a distinct list of unique patterns
IEnumerable<string> distinct =
matches.Cast<Match>().Select(m => m.Value).Distinct();
//then check each one against the hashset, only removing
//registered emoji. This allows non-emoji versions
//of the pattern to survive...
foreach (string match in distinct)
if (_emoji.Contains(match))
output = output.Replace(match, string.Empty);
}
return output;
}
}
public class MainClass
{
static void Main(string[] args)
{
var program = new Program(new string[] { ":grinning:", ":kissing_heart:", ":bouquet:" });
string output = program.RemoveEmoji("Hello:grinning: :imadethis:, how are you?:kissing_heart: Are you fine?:bouquet: This is:a:strange:thing :to type:, but valid :nonetheless:");
Console.WriteLine(output);
}
}
Which results in:
Hello :imadethis:, how are you? Are you fine? This is:a:strange:thing :to type:,
but valid :nonetheless:
You do not have to replace all 856 emoji's. You only have to replace those that appear in the string. So have a look at:
Finding a substring using C# with a twist
Basically you extract all tokens ie the strings between : and : and then replace those with string.Empty()
If you are concerned that the search will return strings that are not emojis such as :some other text: then you could have a hash table lookup to make sure that replacing said found token is appropriate to do.
Finally got around to write something up. I'm combining a couple previously mentioned ideas, with the fact we should only loop over the string once. Based on those requirement, this sound like the perfect job for Linq.
You should probably cache the HashSet. Other than that, this has O(n) performance and only goes over the list once. Would be interesting to benchmark, but this could very well be the most efficient solution.
The approach is pretty straight forwards.
First load all Emoij in a HashSet so we can quickly look them up.
Split the string with input.Split(':') at the :.
Decide if we keep the current element.
If the last element was a match, keep the current element.
If the last element was no match, check if the current element matches.
If it does, ignore it. (This effectively removes the substring from the output).
If it doesn't, append : back and keep it.
Rebuild our string with a StringBuilder.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
static class Program
{
static void Main(string[] args)
{
ISet<string> emojiList = new HashSet<string>(new[] { "kissing_heart", "bouquet", "grinning" });
Console.WriteLine("Hello:grinning: , ho:w: a::re you?:kissing_heart:kissing_heart: Are you fine?:bouquet:".RemoveEmoji(':', emojiList));
Console.ReadLine();
}
public static string RemoveEmoji(this string input, char delimiter, ISet<string> emojiList)
{
StringBuilder sb = new StringBuilder();
input.Split(delimiter).Aggregate(true, (prev, curr) =>
{
if (prev)
{
sb.Append(curr);
return false;
}
if (emojiList.Contains(curr))
{
return true;
}
sb.Append(delimiter);
sb.Append(curr);
return false;
});
return sb.ToString();
}
}
}
Edit: I did something cool using the Rx library, but then realized Aggregate is the IEnumerable counterpart of Scan in Rx, thus simplifying the code even more.
If efficiency is a concern and to avoid processing "false positives", consider rewriting the string using a StringBuilder while skipping the special emoji tokens:
static HashSet<string> emojis = new HashSet<string>()
{
"grinning",
"kissing_heart",
"bouquet"
};
static string RemoveEmojis(string input)
{
StringBuilder sb = new StringBuilder();
int length = input.Length;
int startIndex = 0;
int colonIndex = input.IndexOf(':');
while (colonIndex >= 0 && startIndex < length)
{
//Keep normal text
int substringLength = colonIndex - startIndex;
if (substringLength > 0)
sb.Append(input.Substring(startIndex, substringLength));
//Advance the feed and get the next colon
startIndex = colonIndex + 1;
colonIndex = input.IndexOf(':', startIndex);
if (colonIndex < 0) //No more colons, so no more emojis
{
//Don't forget that first colon we found
sb.Append(':');
//Add the rest of the text
sb.Append(input.Substring(startIndex));
break;
}
else //Possible emoji, let's check
{
string token = input.Substring(startIndex, colonIndex - startIndex);
if (emojis.Contains(token)) //It's a match, so we skip this text
{
//Advance the feed
startIndex = colonIndex + 1;
colonIndex = input.IndexOf(':', startIndex);
}
else //No match, so we keep the normal text
{
//Don't forget the colon
sb.Append(':');
//Instead of doing another substring next loop, let's just use the one we already have
sb.Append(token);
startIndex = colonIndex;
}
}
}
return sb.ToString();
}
static void Main(string[] args)
{
List<string> inputs = new List<string>()
{
"Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet:",
"Tricky test:123:grinning:",
"Hello:grinning: :imadethis:, how are you?:kissing_heart: Are you fine?:bouquet: This is:a:strange:thing :to type:, but valid :nonetheless:"
};
foreach (string input in inputs)
{
Console.WriteLine("In <- " + input);
Console.WriteLine("Out -> " + RemoveEmojis(input));
Console.WriteLine();
}
Console.WriteLine("\r\n\r\nPress enter to exit...");
Console.ReadLine();
}
Outputs:
In <- Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet:
Out -> Hello , how are you? Are you fine?
In <- Tricky test:123:grinning:
Out -> Tricky test:123
In <- Hello:grinning: :imadethis:, how are you?:kissing_heart: Are you fine?:bouquet: This is:a:strange:thing :to type:, but valid :nonetheless:
Out -> Hello :imadethis:, how are you? Are you fine? This is:a:strange:thing :to type:, but valid :nonetheless:
Use this code I put up below I think using this function your problem will be solved.
string s = "Hello:grinning: , how are you?:kissing_heart: Are you fine?:bouquet:";
string rmv = ""; string remove = "";
int i = 0; int k = 0;
A:
rmv = "";
for (i = k; i < s.Length; i++)
{
if (Convert.ToString(s[i]) == ":")
{
for (int j = i + 1; j < s.Length; j++)
{
if (Convert.ToString(s[j]) != ":")
{
rmv += s[j];
}
else
{
remove += rmv + ",";
i = j;
k = j + 1;
goto A;
}
}
}
}
string[] str = remove.Split(',');
for (int x = 0; x < str.Length-1; x++)
{
s = s.Replace(Convert.ToString(":" + str[x] + ":"), "");
}
Console.WriteLine(s);
Console.ReadKey();
I'd use extension method like this:
public static class Helper
{
public static string MyReplace(this string dirty, char separator)
{
string newText = "";
bool replace = false;
for (int i = 0; i < dirty.Length; i++)
{
if(dirty[i] == separator) { replace = !replace ; continue;}
if(replace ) continue;
newText += dirty[i];
}
return newText;
}
}
Usage:
richTextBox2.Text = richTextBox2.Text.MyReplace(':');
This method show be better in terms of performance compare to one with Regex
I would split the text with the ':' and then build the string excluding the found emoji names.
const char marker = ':';
var textSections = text.Split(marker);
var emojiRemovedText = string.Empty;
var notMatchedCount = 0;
textSections.ToList().ForEach(section =>
{
if (emojiNames.Contains(section))
{
notMatchedCount = 0;
}
else
{
if (notMatchedCount++ > 0)
{
emojiRemovedText += marker.ToString();
}
emojiRemovedText += section;
}
});
string sentence = "We know it contains 'camel' word.";
// Camel can be in different cases:
string s1 = "CAMEL";
string s2 = "CaMEL";
string s3 = "CAMeL";
// ...
string s4 = "Camel";
// ...
string s5 = "camel";
How to replace 'camel' in sentence with 'horse' despite of string.Replace doesn't support ignoreCase on left string?
Use a regular expression:
var regex = new Regex( "camel", RegexOptions.IgnoreCase );
var newSentence = regex.Replace( sentence, "horse" );
Of course, this will also match words containing camel, but it's not clear if you want that or not.
If you need exact matches you can use a custom MatchEvaluator.
public static class Evaluators
{
public static string Wrap( Match m, string original, string format )
{
// doesn't match the entire string, otherwise it is a match
if (m.Length != original.Length)
{
// has a preceding letter or digit (i.e., not a real match).
if (m.Index != 0 && char.IsLetterOrDigit( original[m.Index - 1] ))
{
return m.Value;
}
// has a trailing letter or digit (i.e., not a real match).
if (m.Index + m.Length != original.Length && char.IsLetterOrDigit( original[m.Index + m.Length] ))
{
return m.Value;
}
}
// it is a match, apply the format
return string.Format( format, m.Value );
}
}
Used with the previous example to wrap the match in a span as:
var regex = new Regex( highlightedWord, RegexOptions.IgnoreCase );
foreach (var sentence in sentences)
{
var evaluator = new MatchEvaluator( match => Evaluators.Wrap( match, sentence, "<span class='red'>{0}</span>" ) );
Console.WriteLine( regex.Replace( sentence, evaluator ) );
}
Add an extension method for string to do the trick:
Usage:
string yourString = "TEXTTOREPLACE";
yourString.Replace("texttoreplace", "Look, I Got Replaced!", StringComparison.OrdinalIgnoreCase);
Code:
using System;
using System.Collections.Generic;
using System.IO;
public static class Extensions
{
public static string Replace(this string source, string oldString, string newString, StringComparison comp)
{
int index = source.IndexOf(oldString, comp);
// Determine if we found a match
bool MatchFound = index >= 0;
if (MatchFound)
{
// Remove the old text
source = source.Remove(index, oldString.Length);
// Add the replacemenet text
source = source.Insert(index, newString);
}
// recurse for multiple instances of the name
if (source.IndexOf(oldString, comp) != -1)
{
source = Replace(source, oldString, newString, comp);
}
return source;
}
}
Here's an extension method taking a StringComparison, using string.IndexOf:
[Pure]
public static string Replace(this string source, string oldValue, string newValue, StringComparison comparisonType)
{
if (source.Length == 0 || oldValue.Length == 0)
return source;
var result = new System.Text.StringBuilder();
int startingPos = 0;
int nextMatch;
while ((nextMatch = source.IndexOf(oldValue, startingPos, comparisonType)) > -1)
{
result.Append(source, startingPos, nextMatch - startingPos);
result.Append(newValue);
startingPos = nextMatch + oldValue.Length;
}
result.Append(source, startingPos, source.Length - startingPos);
return result.ToString();
}
Btw, here's also a similar Contains-method also taking a StringComparison:
[Pure]
public static bool Contains(this string source, string value, StringComparison comparisonType)
{
return source.IndexOf(value, comparisonType) >= 0;
}
Some tests:
[TestFixture]
public class ExternalTests
{
private static string[] TestReplace_args =
{
"ab/B/c/ac",
"HELLO World/Hello/Goodbye/Goodbye World",
"Hello World/world/there!/Hello there!",
"hello WoRlD/world/there!/hello there!",
"///",
"ab///ab",
"/ab/cd/",
"a|b|c|d|e|f/|//abcdef",
"a|b|c|d|e|f|/|/:/a:b:c:d:e:f:",
};
[Test, TestCaseSource("TestReplace_args")]
public void TestReplace(string teststring)
{
var split = teststring.Split("/");
var source = split[0];
var oldValue = split[1];
var newValue = split[2];
var result = split[3];
Assert.That(source.Replace(oldValue, newValue, StringComparison.OrdinalIgnoreCase), Is.EqualTo(result));
}
}
Here is my extension method, which combines Tom Beech's, with the recursiveness of sntbob's, and a cleaner fix to the bug that ksun pointed out.
Code:
public static string Replace(this string source, string oldString,
string newString, StringComparison comparison)
{
int index = source.IndexOf(oldString, comparison);
while (index > -1)
{
source = source.Remove(index, oldString.Length);
source = source.Insert(index, newString);
index = source.IndexOf(oldString, index + newString.Length, comparison);
}
return source;
}
Usage:
string source = "banana";
Console.WriteLine(source.Replace("AN", "banana", StringComparison.OrdinalIgnoreCase));
Result:
bbananabananaa
And, if you still want the recursive nature to be optional:
Code:
public static string Replace(this string source, string oldString,
string newString, StringComparison comparison,
bool recursive = true)
{
int index = source.IndexOf(oldString, comparison);
while (index > -1)
{
source = source.Remove(index, oldString.Length);
source = source.Insert(index, newString);
if (!recursive)
{
return source;
}
index = source.IndexOf(oldString, index + newString.Length, comparison);
}
return source;
}
Usage:
string source = "banana";
Console.WriteLine(source.Replace("AN", "banana", StringComparison.OrdinalIgnoreCase, false));
Result:
bbananaana
Utiltize StringComparison because of its handy OrdinalIgnoreCase
string sentence = "We know it contains 'camel' word.";
string wordToFind = "camel";
string replacementWord = "horse";
int index = sentence.IndexOf(wordToFind , StringComparison.OrdinalIgnoreCase)
// Did we match the word regardless of case
bool match = index >= 0;
// perform the replace on the matched word
if(match) {
sentence = sentence.Remove(index, wordToFind.Length)
sentence = sentence.Insert(index, replacementWord)
}
Sure would be nice if the C# String class had an ignoreCase() method like Java.
You could also use String.IndexOf
http://msdn.microsoft.com/en-us/library/system.string.indexof.aspx
You may get slightly better performance doing it this way than with RegExpressions (I abhor them because they're not intuitive and easy to screw up, although this simple .Net function call abstracts the actual messy RegEx, and doesn't provide much room for error), but that's probably not a concern for you; computers are REALLY fast these days, right? :) The overload for IndexOf that takes a StringComparison object allows you to optionally ignore case, and because IndexOf returns the first occurrence from at a specified position, you'll have to code a loop to process a string having multiple occurrences.
public static string CustomReplace(string srcText, string toFind, string toReplace, bool matchCase, bool replace0nce)
{
StringComparison sc = StringComparison.OrdinalIgnoreCase;
if (matchCase)
sc = StringComparison.Ordinal;
int pos;
while ((pos = srcText.IndexOf(toFind, sc)) > -1)
{
srcText = srcText.Remove(pos, toFind.Length);
srcText = srcText.Insert(pos, toReplace);
if (replace0nce)
break;
}
return srcText;
}
It may not be as efficient as some of the other answers, but I kind of like the CustomReplace function written by sntbob.
However, there is a flaw in it. If the text replacement is recursive it will cause an infinite loop. For example, CustomReplace("I eat bananas!","an","banana",false,false) would cause an infinite loop and the string would continue growing larger.
For example, after the 4th iteration the string would be "I eat bbbbbananaanaanaanaanas!"
If you want to only replace the two instances of "an" inside "banana" then you'll have to take another approach. I modified sntbob's code to account for this case. I admit that it's much more convoluted, but it handles recursive replacements.
public static string CustomReplace(string srcText, string toFind, string toReplace, bool matchCase, bool replaceOnce)
{
StringComparison sc = StringComparison.OrdinalIgnoreCase;
if (matchCase)
sc = StringComparison.Ordinal;
int pos;
int previousProcessedLength = 0;
string alreadyProcessedTxt = "";
string remainingToProcessTxt = srcText;
while ((pos = remainingToProcessTxt.IndexOf(toFind, sc)) > -1)
{
previousProcessedLength = alreadyProcessedTxt.Length;
//Append processed text up until the end of the found string and perform replacement
alreadyProcessedTxt += remainingToProcessTxt.Substring(0, pos + toFind.Length);
alreadyProcessedTxt = alreadyProcessedTxt.Remove(previousProcessedLength + pos, toFind.Length);
alreadyProcessedTxt = alreadyProcessedTxt.Insert(previousProcessedLength + pos, toReplace);
//Remove processed text from remaining
remainingToProcessTxt = remainingToProcessTxt.Substring(pos + toFind.Length);
if (replaceOnce)
break;
}
return alreadyProcessedTxt + remainingToProcessTxt;
}
Why not just import the Microsoft.VisualBasic namespace and use the VB Strings.Replace method?
https://msdn.microsoft.com/en-us/library/microsoft.visualbasic.strings.replace(v=vs.110).aspx
eg
var newString = Strings.Replace(SourceString, FindTextValue, ReplacementTextValue, 1, -1, Constants.vbTextCompare);
vbTextCompare forces a case-insensitive replacement. Job done.
Okay, it's not 'pure' C#, but it gets you to where you want to go with much less complexity and messing around.
Here's one more alternative that uses StringComparison as an extension method. on a StringBuilder object. I've read some articles indicating that a StringBuilder might be a little more efficient with memory than using strings. You can easily alter this to work with strings if that's what you need.
/// <summary>
/// Extension method to find/replace replaces text in a StringBuilder object
/// </summary>
/// <param name="original">Source StringBuilder object</param>
/// <param name="oldString">String to search for</param>
/// <param name="newString">String to replace each occurrance of oldString</param>
/// <param name="stringComparison">String comparison to use</param>
/// <returns>Original Stringbuilder with replacements made</returns>
public static StringBuilder Replace(this StringBuilder original,
string oldString, string newString, StringComparison stringComparison)
{
//If anything is null, or oldString is blank, exit with original value
if ( newString == null || original == null || string.IsNullOrEmpty(oldString))
return original;
//Convert to a string and get starting position using
//IndexOf which allows us to use StringComparison.
int pos = original.ToString().IndexOf(oldString, 0, stringComparison);
//Loop through until we find and replace all matches
while ( pos >= 0 )
{
//Remove the old string and insert the new one.
original.Remove(pos, oldString.Length).Insert(pos, newString);
//Get the next match starting 1 character after last replacement (to avoid a possible infinite loop)
pos = original.ToString().IndexOf(oldString, pos + newString.Length + 1, stringComparison);
}
return original;
}
I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.
Examples of the file names I'm working with are:
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)
So far I'm using:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");
Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([?\\[+A-Z+\\]])", "");
Which left me with:
urorip (2004)
Instead of:
EuroTrip (2004) [SD]
Any whitespace that is left at the ends are ok as I will just perform
movieTitleToFetch = movieTitleToFetch.Trim();
at the end.
Thanks in advance,
Alex
This regex pattern should work ok... maybe needs a bit of tweaking
"[\[\(].+?[\]\)]"
Regex.Replace(movieTitleToFetch, #"[\[\(].+?[\]\)]", "");
This should match anything from either "[" or "(" until the next occurance of "]" or ")"
If that does not work try removing the escape character for the parentheses, like so...
Regex.Replace(movieTitleToFetch, #"[\[(].+?[\])]", "");
#Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.
([\[].*?[\]]|[\(].*?[\)])
I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.
This runs these steps:
Removes everything in brackets (if find a year it tries to keep the info)
Removes a list of common used words (720p, bdrip, h264 and so on...)
Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
if a year was not found into parenthesis looks at the end of remaining string (as for languages)
Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.
Here's the test in XUnit (i used most of italian titles to test it)
using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;
namespace Grappachu.MoVideo.Test
{
public class TitleCleanerTest
{
[Theory]
[InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
"Avengers Confidential La Vedova Nera E Punisher", 2014)]
[InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
"Fuck You, Prof!", 2013)]
[InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
"Il Libro della Giungla", 2016)]
[InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
[InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
"il viaggio di arlo", 2015)]
[InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
"La Mafia Uccide Solo D'estate", 2013)]
[InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
"Ip Man 3", 2015)]
[InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
"Inferno", 2016)]
[InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
"Ghostbusters", 2016)]
[InlineData("Transcendence.mkv", "Transcendence", null)]
[InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
{
var res = MovieTitleCleaner.Clean(filename);
res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
res.Year.Should().Be.EqualTo(year);
}
}
}
and fisrt version of the code
using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
public class MovieTitleCleanerResult
{
public string Title { get; set; }
public int? Year { get; set; }
public string SubTitle { get; set; }
}
public class MovieTitleCleaner
{
private const string SpecialMarker = "§=§";
private static readonly string[] ReservedWords;
private static readonly string[] SpaceChars;
private static readonly string[] Languages;
static MovieTitleCleaner()
{
ReservedWords = new[]
{
SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
};
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var l = cultures.Select(x => x.EnglishName).ToList();
l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
Languages = l.Distinct().ToArray();
SpaceChars = new[] {".", "_", " "};
}
public static MovieTitleCleanerResult Clean(string filename)
{
var temp = Path.GetFileNameWithoutExtension(filename);
int? maybeYear = null;
// Remove what's inside brackets trying to keep year info.
temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
temp = RemoveBrackets(temp, '(', ')', ref maybeYear);
// Removes special markers (codec, formats, ecc...)
var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var title = string.Empty;
for (var i = 0; i < tokens.Length; i++)
{
var tok = tokens[i];
if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
{
if (title.Length > 0)
break;
}
else
{
title = string.Join(" ", title, tok).Trim();
}
}
temp = title;
// Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
for (var i = tokens.Length - 1; i >= 0; i--)
{
var tok = tokens[i];
if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
tokens[i] = string.Empty;
else
break;
}
title = string.Join(" ", tokens).Trim();
// If year is not found inside parenthesis try to catch at the end, just after the title
if (!maybeYear.HasValue)
{
var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var last = resplit.Last();
if (LooksLikeYear(last))
{
maybeYear = int.Parse(last);
title = title.Replace(last, string.Empty).Trim();
}
}
// TODO: review this. when there's one dash separates main title from subtitle
var res = new MovieTitleCleanerResult();
res.Year = maybeYear;
if (title.Count(x => x == '-') == 1)
{
var sp = title.Split('-');
res.Title = sp[0];
res.SubTitle = sp[1];
}
else
{
res.Title = title;
}
return res;
}
private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
{
var str = inputString;
while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
{
var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
if (LooksLikeYear(dataGraph))
{
maybeYear = int.Parse(dataGraph);
}
else
{
var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
foreach (var part in parts)
if (LooksLikeYear(part))
{
maybeYear = int.Parse(part);
break;
}
}
str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
}
return str;
}
private static bool LooksLikeYear(string dataRound)
{
return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
}
}
public static class StringUtils
{
public static string GetBetween(this string src, string a, string b,
StringComparison comparison = StringComparison.Ordinal)
{
var idxStr = src.IndexOf(a, comparison);
var idxEnd = src.IndexOf(b, comparison);
if (idxStr >= 0 && idxEnd > 0)
{
if (idxStr > idxEnd)
Swap(ref idxStr, ref idxEnd);
return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
}
return src;
}
private static void Swap<T>(ref T idxStr, ref T idxEnd)
{
var temp = idxEnd;
idxEnd = idxStr;
idxStr = temp;
}
public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
{
var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, replacement ?? string.Empty);
}
}
}
This does the trick:
#"(\[[^\]]*\])|(\([^\)]*\))"
It removes anything from "[" to the next "]" and anything from "(" to the next ")".
Can you just use:
string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
MovieTitle.Substring(0,MovieTitle.IndexOf('('));
Cant we use this instead:-
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
Above code will surely return you the perfect movie titles for these strings:-
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
if there occurs a case where you will not have year but only type i.e :-
EuroTrip [SD]
Event Horizon [720]
Fast & Furious [1080p]
Star Trek [Unknown]
then use this
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));
I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.
This pattern translates as:
Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)
Do any of you know of an easy/clean way to find a substring within a string while ignoring some specified characters to find it. I think an example would explain things better:
string: "Hello, -this- is a string"
substring to find: "Hello this"
chars to ignore: "," and "-"
found the substring, result: "Hello, -this"
Using Regex is not a requirement for me, but I added the tag because it feels related.
Update:
To make the requirement clearer: I need the resulting substring with the ignored chars, not just an indication that the given substring exists.
Update 2:
Some of you are reading too much into the example, sorry, i'll give another scenario that should work:
string: "?A&3/3/C)412&"
substring to find: "A41"
chars to ignore: "&", "/", "3", "C", ")"
found the substring, result: "A&3/3/C)41"
And as a bonus (not required per se), it will be great if it's also not safe to assume that the substring to find will not have the ignored chars on it, e.g.: given the last example we should be able to do:
substring to find: "A3C412&"
chars to ignore: "&", "/", "3", "C", ")"
found the substring, result: "A&3/3/C)412&"
Sorry if I wasn't clear before, or still I'm not :).
Update 3:
Thanks to everyone who helped!, this is the implementation I'm working with for now:
http://www.pastebin.com/pYHbb43Z
An here are some tests:
http://www.pastebin.com/qh01GSx2
I'm using some custom extension methods I'm not including but I believe they should be self-explainatory (I will add them if you like)
I've taken a lot of your ideas for the implementation and the tests but I'm giving the answer to #PierrOz because he was one of the firsts, and pointed me in the right direction.
Feel free to keep giving suggestions as alternative solutions or comments on the current state of the impl. if you like.
in your example you would do:
string input = "Hello, -this-, is a string";
string ignore = "[-,]*";
Regex r = new Regex(string.Format("H{0}e{0}l{0}l{0}o{0} {0}t{0}h{0}i{0}s{0}", ignore));
Match m = r.Match(input);
return m.Success ? m.Value : string.Empty;
Dynamically you would build the part [-, ] with all the characters to ignore and you would insert this part between all the characters of your query.
Take care of '-' in the class []: put it at the beginning or at the end
So more generically, it would give something like:
public string Test(string query, string input, char[] ignorelist)
{
string ignorePattern = "[";
for (int i=0; i<ignoreList.Length; i++)
{
if (ignoreList[i] == '-')
{
ignorePattern.Insert(1, "-");
}
else
{
ignorePattern += ignoreList[i];
}
}
ignorePattern += "]*";
for (int i = 0; i < query.Length; i++)
{
pattern += query[0] + ignorepattern;
}
Regex r = new Regex(pattern);
Match m = r.Match(input);
return m.IsSuccess ? m.Value : string.Empty;
}
Here's a non-regex string extension option:
public static class StringExtensions
{
public static bool SubstringSearch(this string s, string value, char[] ignoreChars, out string result)
{
if (String.IsNullOrEmpty(value))
throw new ArgumentException("Search value cannot be null or empty.", "value");
bool found = false;
int matches = 0;
int startIndex = -1;
int length = 0;
for (int i = 0; i < s.Length && !found; i++)
{
if (startIndex == -1)
{
if (s[i] == value[0])
{
startIndex = i;
++matches;
++length;
}
}
else
{
if (s[i] == value[matches])
{
++matches;
++length;
}
else if (ignoreChars != null && ignoreChars.Contains(s[i]))
{
++length;
}
else
{
startIndex = -1;
matches = 0;
length = 0;
}
}
found = (matches == value.Length);
}
if (found)
{
result = s.Substring(startIndex, length);
}
else
{
result = null;
}
return found;
}
}
EDIT: here's an updated solution addressing the points in your recent update. The idea is the same except if you have one substring it will need to insert the ignore pattern between each character. If the substring contains spaces it will split on the spaces and insert the ignore pattern between those words. If you don't have a need for the latter functionality (which was more in line with your original question) then you can remove the Split and if checking that provides that pattern.
Note that this approach is not going to be the most efficient.
string input = #"foo ?A&3/3/C)412& bar A341C2";
string substring = "A41";
string[] ignoredChars = { "&", "/", "3", "C", ")" };
// builds up the ignored pattern and ensures a dash char is placed at the end to avoid unintended ranges
string ignoredPattern = String.Concat("[",
String.Join("", ignoredChars.Where(c => c != "-")
.Select(c => Regex.Escape(c)).ToArray()),
(ignoredChars.Contains("-") ? "-" : ""),
"]*?");
string[] substrings = substring.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
string pattern = "";
if (substrings.Length > 1)
{
pattern = String.Join(ignoredPattern, substrings);
}
else
{
pattern = String.Join(ignoredPattern, substring.Select(c => c.ToString()).ToArray());
}
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine("Index: {0} -- Match: {1}", match.Index, match.Value);
}
Try this solution out:
string input = "Hello, -this- is a string";
string[] searchStrings = { "Hello", "this" };
string pattern = String.Join(#"\W+", searchStrings);
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine(match.Value);
}
The \W+ will match any non-alphanumeric character. If you feel like specifying them yourself, you can replace it with a character class of the characters to ignore, such as [ ,.-]+ (always place the dash character at the start or end to avoid unintended range specifications). Also, if you need case to be ignored use RegexOptions.IgnoreCase:
Regex.Matches(input, pattern, RegexOptions.IgnoreCase)
If your substring is in the form of a complete string, such as "Hello this", you can easily get it into an array form for searchString in this way:
string[] searchString = substring.Split(new[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
This code will do what you want, although I suggest you modify it to fit your needs better:
string resultString = null;
try
{
resultString = Regex.Match(subjectString, "Hello[, -]*this", RegexOptions.IgnoreCase).Value;
}
catch (ArgumentException ex)
{
// Syntax error in the regular expression
}
You could do this with a single Regex but it would be quite tedious as after every character you would need to test for zero or more ignored characters. It is probably easier to strip all the ignored characters with Regex.Replace(subject, "[-,]", ""); then test if the substring is there.
Or the single Regex way
Regex.IsMatch(subject, "H[-,]*e[-,]*l[-,]*l[-,]*o[-,]* [-,]*t[-,]*h[-,]*i[-,]*s[-,]*")
Here's a non-regex way to do it using string parsing.
private string GetSubstring()
{
string searchString = "Hello, -this- is a string";
string searchStringWithoutUnwantedChars = searchString.Replace(",", "").Replace("-", "");
string desiredString = string.Empty;
if(searchStringWithoutUnwantedChars.Contains("Hello this"))
desiredString = searchString.Substring(searchString.IndexOf("Hello"), searchString.IndexOf("this") + 4);
return desiredString;
}
You could do something like this, since most all of these answer require rebuilding the string in some form.
string1 is your string you want to look through
//Create a List(Of string) that contains the ignored characters'
List<string> ignoredCharacters = new List<string>();
//Add all of the characters you wish to ignore in the method you choose
//Use a function here to get a return
public bool subStringExist(List<string> ignoredCharacters, string myString, string toMatch)
{
//Copy Your string to a temp
string tempString = myString;
bool match = false;
//Replace Everything that you don't want
foreach (string item in ignoredCharacters)
{
tempString = tempString.Replace(item, "");
}
//Check if your substring exist
if (tempString.Contains(toMatch))
{
match = true;
}
return match;
}
You could always use a combination of RegEx and string searching
public class RegExpression {
public static void Example(string input, string ignore, string find)
{
string output = string.Format("Input: {1}{0}Ignore: {2}{0}Find: {3}{0}{0}", Environment.NewLine, input, ignore, find);
if (SanitizeText(input, ignore).ToString().Contains(SanitizeText(find, ignore)))
Console.WriteLine(output + "was matched");
else
Console.WriteLine(output + "was NOT matched");
Console.WriteLine();
}
public static string SanitizeText(string input, string ignore)
{
Regex reg = new Regex("[^" + ignore + "]");
StringBuilder newInput = new StringBuilder();
foreach (Match m in reg.Matches(input))
{
newInput.Append(m.Value);
}
return newInput.ToString();
}
}
Usage would be like
RegExpression.Example("Hello, -this- is a string", "-,", "Hello this"); //Should match
RegExpression.Example("Hello, -this- is a string", "-,", "Hello this2"); //Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A41"); // Should match
RegExpression.Example("?A&3/3/C) 412&", "&/3C\\)", "A41"); // Should not match
RegExpression.Example("?A&3/3/C)412&", "&/3C\\)", "A3C412&"); // Should match
Output
Input: Hello, -this- is a string
Ignore: -,
Find: Hello this
was matched
Input: Hello, -this- is a string
Ignore: -,
Find: Hello this2
was NOT matched
Input: ?A&3/3/C)412&
Ignore: &/3C)
Find: A41
was matched
Input: ?A&3/3/C) 412&
Ignore: &/3C)
Find: A41
was NOT matched
Input: ?A&3/3/C)412&
Ignore: &/3C)
Find: A3C412&
was matched