Best approach of word censoring - C# 4.0 - c#

For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf(" censored1 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored2 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored3 ") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.

I would use LINQ or regular expression for this:
LINQ: How to: Query for Sentences that Contain a Specified Set of Words (LINQ)
Regular Expression: Highlight a list of words using a regular expression in c#

You can simplify it. Here listOfCencoredWords will contains all the censored words
if (listOfCensoredWords.Any(item => srMessageTemp.Contains(item)))
return;

If you want to make it really fast, you can use Aho-Corasick automaton. This is how antivirus software checks thousands of viruses at once. But I don't know where you can get the implementation done, so it will require much more work from you compared to using just simple slow methods like regular expressions.
See the theory here: http://en.wikipedia.org/wiki/Aho-Corasick

First, I hope you aren't really "tokenizing" the words as written. You know, just because someone doesn't put a space before a bad word, it doesn't make the word less bad :-) Example ,badword,
I'll say that I would use a Regex here :-) I'm not sure if a Regex or a man-made parser would be faster, but at least a Regex would be a good starting point. As others wrote, you begin by splitting the text in words and then checking an HashSet<string>.
I'm adding a second version of the code, based on ArraySegment<char>. I speak later of this.
class Program
{
class ArraySegmentComparer : IEqualityComparer<ArraySegment<char>>
{
public bool Equals(ArraySegment<char> x, ArraySegment<char> y)
{
if (x.Count != y.Count)
{
return false;
}
int end = x.Offset + x.Count;
for (int i = x.Offset, j = y.Offset; i < end; i++, j++)
{
if (!x.Array[i].ToString().Equals(y.Array[j].ToString(), StringComparison.InvariantCultureIgnoreCase))
{
return false;
}
}
return true;
}
public override int GetHashCode(ArraySegment<char> obj)
{
unchecked
{
int hash = 17;
int end = obj.Offset + obj.Count;
int i;
for (i = obj.Offset; i < end; i++)
{
hash *= 23;
hash += Char.ToUpperInvariant(obj.Array[i]);
}
return hash;
}
}
}
static void Main()
{
var rx = new Regex(#"\b\w+\b", RegexOptions.Compiled);
var sampleText = #"For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf("" censored1 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored2 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored3 "") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.
And now some accented letters àèéìòù and now some letters with unicode combinable diacritics àèéìòù";
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
HashSet<string> prohibitedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase) { "For", "custom", "combinable", "away" };
Stopwatch sw1 = Stopwatch.StartNew();
var words = rx.Matches(sampleText);
foreach (Match word in words)
{
string str = word.Value;
if (prohibitedWords.Contains(str))
{
Console.Write(str);
Console.Write(" ");
}
else
{
//Console.WriteLine(word);
}
}
sw1.Stop();
Console.WriteLine();
Console.WriteLine();
HashSet<ArraySegment<char>> prohibitedWords2 = new HashSet<ArraySegment<char>>(
prohibitedWords.Select(p => new ArraySegment<char>(p.ToCharArray())),
new ArraySegmentComparer());
var sampleText2 = sampleText.ToCharArray();
Stopwatch sw2 = Stopwatch.StartNew();
int startWord = -1;
for (int i = 0; i < sampleText2.Length; i++)
{
if (Char.IsLetter(sampleText2[i]) || Char.IsDigit(sampleText2[i]))
{
if (startWord == -1)
{
startWord = i;
}
}
else
{
if (startWord != -1)
{
int length = i - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
startWord = -1;
}
}
}
if (startWord != -1)
{
int length = sampleText2.Length - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
}
sw2.Stop();
Console.WriteLine();
Console.WriteLine();
Console.WriteLine(sw1.ElapsedTicks);
Console.WriteLine(sw2.ElapsedTicks);
}
}
I'll note that you could go faster doing the parsing "in" the original string. What does this means: if you subdivide the "document" in words and each word is put in a string, clearly you are creating n string, one for each word of your document. But what if you skipped this step and operated directly on the document, simply keeping the current index and the length of the current word? Then it would be faster! Clearly then you would need to create a special comparer for the HashSet<>.
But wait! C# has something similar... It's called ArraySegment. So your document would be a char[] instead of a string and each word would be an ArraySegment<char>. Clearly this is much more complex! You can't simply use Regexes, you have to build "by hand" a parser (but I think converting the \b\w+\b expression would be quite easy). And creating a comparer for HashSet<char> would be a little complex (hint: you would use HashSet<ArraySegment<char>> and the words to be censored would be ArraySegments "pointing" to a char[] of a word and with size equal to the char[].Length, like var word = new ArraySegment<char>("tobecensored".ToCharArray());)
After some simple benchmark, I can see that an unoptimized version of the program using ArraySegment<string> is as much fast as the Regex version for shorter texts. This probably because if a word is 4-6 char long, it's as much "slow" to copy it around than it's to copy around an ArraySegment<char> (an ArraySegment<char> is 12 bytes, a word of 6 characters is 12 bytes. On top of both of these we have to add a little overhead... But in the end the numbers are comparable). But for longer texts (try decommenting the //sampleText += sampleText;) it becomes a little faster (10%) in Release -> Start Without Debugging (CTRL-F5)
I'll note that comparing strings character by character is wrong. You should always use the methods given to you by the string class (or by the OS). They know how to handle "strange" cases much better than you (and in Unicode there isn't any "normal" case :-) )

You can use linq for this but it's not required if you use a list to hold your list of censored values. The solution below uses the build in list functions and allows you to do your searches case insensitive.
private static List<string> _censoredWords = new List<string>()
{
"badwordone1",
"badwordone2",
"badwordone3",
"badwordone4",
};
static void Main(string[] args)
{
string badword1 = "BadWordOne2";
bool censored = ShouldCensorWord(badword1);
}
private static bool ShouldCensorWord(string word)
{
return _censoredWords.Contains(word.ToLower());
}

What you think about this:
string[] censoredWords = new[] { " censored1 ", " censored2 ", " censored3 " };
if (censoredWords.Contains(srMessageTemp))
return;

Related

How to run-length encode 'EEDDDNE' to '2E3DNE'?

Explanation: The task itself is that we have 13 strings (stored in the sor[] array) like the one in the title or 'EEENKDDDDKKKNNKDK'
and we have to shorten it in a way that if there's two or more of the same letter next to eachother then we have to write it in the form of 'NumberoflettersLetter'
So by this rule, 'EEENKDDDDKKKNNKDK' would become '3ENK4D3K2NKDK'
using System;
public class Program
{
public static void Main(string[] args)
{
string[] sor = new string[] { "EEENKDDDDKKKNNKDK", "'EEDDDNE'" };
char holder;
int counter = 0;
string temporary;
int indexholder;
for (int i = 0; i < sor.Length; i++)
{
for (int q = 0; q < sor[i].Length; q++)
{
holder = sor[i][q];
indexholder = q;
counter = 0;
while (sor[i][q] == holder)
{
q++;
counter++;
}
if (counter > 1)
{
temporary = Convert.ToString(counter) + holder;
sor[i].Replace(sor[i].Substring(indexholder, q), temporary); // EX here
}
}
}
Console.ReadLine();
}
}
Sorry I didn't make the error clear, it says that :
"The value of index and length has to represent a place inside the string (System.ArgumentOutOfRangeException) - name of parameter: length"
...but I have no clue what's wrong with it, maybe it's a tiny little mistake, maybe the whole thing is messed up, so this is why I'd like someone to help me with this D:
(Ps 'indexholder' is there because i need it for another exercise)
EDIT:
'sor' is the string array that holds these strings (there are 13 of them) like the one mentioned in the title or in the example
You can use regex for this:
Regex.Replace("EEENKDDDDKKKNNKDK", #"(.)\1+", m => $"{m.Length}{m.Groups[1].Value}")
Explanation:
(.) matches any character and puts it in group #1
\1+ matches group #1 as many times can it can
Shortening the same string inplace is more difficult then construction a new one while iterating the old one char by char. If you plan to iteratively add to a string it is better to use the StringBuilder - class instead of adding directly to a string (performance reasons).
You can streamline your approach by using IEnumerable.Aggregate function wich does the iteration on one string for you automatically:
using System;
using System.Linq;
using System.Text;
public class Program
{
public static string RunLengthEncode(string s)
{
if (string.IsNullOrEmpty(s)) // avoid null ref ex and do simple case
return "";
// we need a "state" between the differenc chars of s that we store here:
char curr_c = s[0]; // our current char, we start with the 1st one
int count = 0; // our char counter, we start with 0 as it will be
// incremented as soon as it is processed by Aggregate
// ( and then incremented to 1)
var agg = s.Aggregate(new StringBuilder(), (acc, c) => // StringBuilder
// performs better for multiple string-"additions" then string itself
{
if (c == curr_c)
count++; // same char, increment
else
{
// other char
if (count > 1) // store count if > 1
acc.AppendFormat("{0}", count);
acc.Append(curr_c); // store char
curr_c = c; // set current char to new one
count = 1; // startcount now is 1
}
return acc;
});
// add last things
if (count > 1) // store count if > 1
agg.AppendFormat("{0}", count);
agg.Append(curr_c); // store char
return agg.ToString(); // return the "simple" string
}
Test with
public static void Main(string[] args)
{
Console.WriteLine(RunLengthEncode("'EEENKDDDDKKKNNKDK' "));
Console.ReadLine();
}
}
Output for "'EEENKDDDDKKKNNKDK' ":
'3ENK4D3K2NKDK'
Your approach without using the same string is more like this:
var data = "'EEENKDDDDKKKNNKDK' ";
char curr_c = '\x0'; // avoid unasssinged warning
int count = 0; // counter for the curr_c occurences in row
string result = string.Empty; // resulting string
foreach (var c in data) // process every character of data in order
{
if (c != curr_c) // new character found
{
if (count > 1) // more then 1, add count as string and the char
result += Convert.ToString(count) + curr_c;
else if (count > 0) // avoid initial `\x0` being put into string
result += curr_c;
curr_c = c; // remember new character
count = 1; // so far we found this one
}
else
count++; // not new, increment counter
}
// add the last counted char as well
if (count > 1)
result += Convert.ToString(count) + curr_c;
else
result += curr_c;
// output
Console.WriteLine(data + " ==> " + result);
Output:
'EEENKDDDDKKKNNKDK' ==> '3ENK4D3K2NKDK'
Instead of using the indexing operator [] on your string and have to struggle with indexes all over I use foreach c in "sometext" ... which will proceed char-wise through the string - much less hassle.
If you need to run-length encode an array/list (your sor) of strings, simply apply the code to each one (preferably by using foreach s in yourStringList ....

C# How to generate a new string based on multiple ranged index

Let's say I have a string like this one, left part is a word, right part is a collection of indices (single or range) used to reference furigana (phonetics) for kanjis in my word:
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす"
The pattern in detail:
word,<startIndex>(-<endIndex>):<furigana>
What would be the best way to achieve something like this (with a space in front of the kanji to mark which part is linked to the [furigana]):
子[こ]で 子[こ]にならぬ 時鳥[ほととぎす]
Edit: (thanks for your comments guys)
Here is what I wrote so far:
static void Main(string[] args)
{
string myString = "ABCDEF,1:test;3:test2";
//Split Kanjis / Indices
string[] tokens = myString.Split(',');
//Extract furigana indices
string[] indices = tokens[1].Split(';');
//Dictionnary to store furigana indices
Dictionary<string, string> furiganaIndices = new Dictionary<string, string>();
//Collect
foreach (string index in indices)
{
string[] splitIndex = index.Split(':');
furiganaIndices.Add(splitIndex[0], splitIndex[1]);
}
//Processing
string result = tokens[0] + ",";
for (int i = 0; i < tokens[0].Length; i++)
{
string currentIndex = i.ToString();
if (furiganaIndices.ContainsKey(currentIndex)) //add [furigana]
{
string currentFurigana = furiganaIndices[currentIndex].ToString();
result = result + " " + tokens[0].ElementAt(i) + string.Format("[{0}]", currentFurigana);
}
else //nothing to add
{
result = result + tokens[0].ElementAt(i);
}
}
File.AppendAllText(#"D:\test.txt", result + Environment.NewLine);
}
Result:
ABCDEF,A B[test]C D[test2]EF
I struggle to find a way to process ranged indices:
string myString = "ABCDEF,1:test;2-3:test2";
Result : ABCDEF,A B[test] CD[test2]EF
I don't have anything against manually manipulating strings per se. But given that you seem to have a regular pattern describing the inputs, it seems to me that a solution that uses regex would be more maintainable and readable. So with that in mind, here's an example program that takes that approach:
class Program
{
private const string _kinvalidFormatException = "Invalid format for edit specification";
private static readonly Regex
regex1 = new Regex(#"(?<word>[^,]+),(?<edit>(?:\d+)(?:-(?:\d+))?:(?:[^;]+);?)+", RegexOptions.Compiled),
regex2 = new Regex(#"(?<start>\d+)(?:-(?<end>\d+))?:(?<furigana>[^;]+);?", RegexOptions.Compiled);
static void Main(string[] args)
{
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす";
string result = EditString(myString);
}
private static string EditString(string myString)
{
Match editsMatch = regex1.Match(myString);
if (!editsMatch.Success)
{
throw new ArgumentException(_kinvalidFormatException);
}
int ichCur = 0;
string input = editsMatch.Groups["word"].Value;
StringBuilder text = new StringBuilder();
foreach (Capture capture in editsMatch.Groups["edit"].Captures)
{
Match oneEditMatch = regex2.Match(capture.Value);
if (!oneEditMatch.Success)
{
throw new ArgumentException(_kinvalidFormatException);
}
int start, end;
if (!int.TryParse(oneEditMatch.Groups["start"].Value, out start))
{
throw new ArgumentException(_kinvalidFormatException);
}
Group endGroup = oneEditMatch.Groups["end"];
if (endGroup.Success)
{
if (!int.TryParse(endGroup.Value, out end))
{
throw new ArgumentException(_kinvalidFormatException);
}
}
else
{
end = start;
}
text.Append(input.Substring(ichCur, start - ichCur));
if (text.Length > 0)
{
text.Append(' ');
}
ichCur = end + 1;
text.Append(input.Substring(start, ichCur - start));
text.Append(string.Format("[{0}]", oneEditMatch.Groups["furigana"]));
}
if (ichCur < input.Length)
{
text.Append(input.Substring(ichCur));
}
return text.ToString();
}
}
Notes:
This implementation assumes that the edit specifications will be listed in order and won't overlap. It makes no attempt to validate that part of the input; depending on where you are getting your input from you may want to add that. If it's valid for the specifications to be listed out of order, you can also extend the above to first store the edits in a list and sort the list by the start index before actually editing the string. (In similar fashion to the way the other proposed answer works; though, why they are using a dictionary instead of a simple list to store the individual edits, I have no idea…that seems arbitrarily complicated to me.)
I included basic input validation, throwing exceptions where failures occur in the pattern matching. A more user-friendly implementation would add more specific information to each exception, describing what part of the input actually was invalid.
The Regex class actually has a Replace() method, which allows for complete customization. The above could have been implemented that way, using Replace() and a MatchEvaluator to provide the replacement text, instead of just appending text to a StringBuilder. Which way to do it is mostly a matter of preference, though the MatchEvaluator might be preferred if you have a need for more flexible implementation options (i.e. if the exact format of the result can vary).
If you do choose to use the other proposed answer, I strongly recommend you use StringBuilder instead of simply concatenating onto the results variable. For short strings it won't matter much, but you should get into the habit of always using StringBuilder when you have a loop that is incrementally adding onto a string value, because for long string the performance implications of using concatenation can be very negative.
This should do it (and even handle ranged indices), based on the formatting of the input string you have-
using System;
using System.Collections.Generic;
public class stringParser
{
private struct IndexElements
{
public int start;
public int end;
public string value;
}
public static void Main()
{
//input string
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす";
int wordIndexSplit = myString.IndexOf(',');
string word = myString.Substring(0,wordIndexSplit);
string indices = myString.Substring(wordIndexSplit + 1);
string[] eachIndex = indices.Split(';');
Dictionary<int,IndexElements> index = new Dictionary<int,IndexElements>();
string[] elements;
IndexElements e;
int dash;
int n = 0;
int last = -1;
string results = "";
foreach (string s in eachIndex)
{
e = new IndexElements();
elements = s.Split(':');
if (elements[0].Contains("-"))
{
dash = elements[0].IndexOf('-');
e.start = int.Parse(elements[0].Substring(0,dash));
e.end = int.Parse(elements[0].Substring(dash + 1));
}
else
{
e.start = int.Parse(elements[0]);
e.end = e.start;
}
e.value = elements[1];
index.Add(n,e);
n++;
}
//this is the part that takes the "setup" from the parts above and forms the result string
//loop through each of the "indices" parsed above
for (int i = 0; i < index.Count; i++)
{
//if this is the first iteration through the loop, and the first "index" does not start
//at position 0, add the beginning characters before its start
if (last == -1 && index[i].start > 0)
{
results += word.Substring(0,index[i].start);
}
//if this is not the first iteration through the loop, and the previous iteration did
//not stop at the position directly before the start of the current iteration, add
//the intermediary chracters
else if (last != -1 && last + 1 != index[i].start)
{
results += word.Substring(last + 1,index[i].start - (last + 1));
}
//add the space before the "index" match, the actual match, and then the formatted "index"
results += " " + word.Substring(index[i].start,(index[i].end - index[i].start) + 1)
+ "[" + index[i].value + "]";
//remember the position of the ending for the next iteration
last = index[i].end;
}
//if the last "index" did not stop at the end of the input string, add the remaining characters
if (index[index.Keys.Count - 1].end + 1 < word.Length)
{
results += word.Substring(index[index.Keys.Count-1].end + 1);
}
//trimming spaces that may be left behind
results = results.Trim();
Console.WriteLine("INPUT - " + myString);
Console.WriteLine("OUTPUT - " + results);
Console.Read();
}
}
input - 子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす
output - 子[こ]で 子[こ]にならぬ 時鳥[ほととぎす]
Note that this should also work with characters the English alphabet if you wanted to use English instead-
input - iliketocodeverymuch,2:A;4-6:B;9-12:CDEFG
output - il i[A]k eto[B]co deve[CDEFG]rymuch

How to get all permutations of groups in a string?

This is not homework, although it may seem like it. I've been browsing through the UK Computing Olympiad's website and found this problem (Question 1): here. I was baffled by it, and I'd want to see what you guys thought of how to do it. I can't think of any neat ways to get everything into groups (checking whether it's a palindrome after that is simple enough, i.e. originalString == new String(groupedString.Reverse.SelectMany(c => c).ToArray), assuming it is a char array).
Any ideas? Thanks!
Text for those at work:
A palindrome is a word that shows the same sequence of letters when
reversed. If a word can have its letters grouped together in two or
more blocks (each containing one or more adjacent letters) then it is
a block palindrome if reversing the order of those blocks results in
the same sequence of blocks.
For example, using brackets to indicate blocks, the following are
block palindromes:
• BONBON can be grouped together as (BON)(BON);
• ONION can be grouped together as (ON)(I)(ON);
• BBACBB can be grouped together as (B)(BACB)(B) or (BB)(AC)(BB) or
(B)(B)(AC)(B)(B)
Note that (BB)(AC)(B)(B) is not valid as the reverse (B)(B)(AC)(BB)
shows the blocks in a different order.
And the question is essentially how to generate all of those groups, to then check whether they are palindromes!
And the question is essentially how to generate all of those groups, to then check whether they are palindromes!
I note that this is not necessarily the best strategy. Generating all the groups first and then checking to see if they are palidromes is considerably more inefficient than generating only those groups which are palindromes.
But in the spirit of answering the question asked, let's solve the problem recursively. I will just generate all the groups; checking whether a set of groups is a palindrome is left as an exercise. I am also going to ignore the requirement that a set of groups contains at least two elements; that is easily checked.
The way to solve this problem elegantly is to reason recursively. As with all recursive solutions, we begin with a trivial base case:
How many groupings are there of the empty string? There is only the empty grouping; that is, the grouping with no elements in it.
Now we assume that we have a solution to a smaller problem, and ask "if we had a solution to a smaller problem, how could we use that solution to solve a larger problem?"
OK, suppose we have a larger problem. We have a string with 6 characters in it and we wish to produce all the groupings. Moreover, the groupings are symmetrical; the first group is the same size as the last group. By assumption we know how to solve the problem for any smaller string.
We solve the problem as follows. Suppose the string is ABCDEF. We peel off A and F from both ends, we solve the problem for BCDE, which remember we know how to do by assumption, and now we prepend A and append F to each of those solutions.
The solutions for BCDE are (B)(C)(D)(E), (B)(CD)(E), (BC)(DE), (BCDE). Again, we assume as our inductive hypothesis that we have the solution to the smaller problem. We then combine those with A and F to produce the solutions for ABCDEF: (A)(B)(C)(D)(E)(F), (A)(B)(CD)(E)(F), (A)(BC)(DE)(F) and (A)(BCDE)(F).
We've made good progress. Are we done? No. Next we peel off AB and EF, and recursively solve the problem for CD. I won't labour how that is done. Are we done? No. We peel off ABC and DEF and recursively solve the problem for the empty string in the middle. Are we done? No. (ABCDEF) is also a solution. Now we're done.
I hope that sketch motivates the solution, which is now straightforward. We begin with a helper function:
public static IEnumerable<T> AffixSequence<T>(T first, IEnumerable<T> body, T last)
{
yield return first;
foreach (T item in body)
yield return item;
yield return last;
}
That should be easy to understand. Now we do the real work:
public static IEnumerable<IEnumerable<string>> GenerateBlocks(string s)
{
// The base case is trivial: the blocks of the empty string
// is the empty set of blocks.
if (s.Length == 0)
{
yield return new string[0];
yield break;
}
// Generate all the sequences for the middle;
// combine them with all possible prefixes and suffixes.
for (int i = 1; s.Length >= 2 * i; ++i)
{
string prefix = s.Substring(0, i);
string suffix = s.Substring(s.Length - i, i);
string middle = s.Substring(i, s.Length - 2 * i);
foreach (var body in GenerateBlocks(middle))
yield return AffixSequence(prefix, body, suffix);
}
// Finally, the set of blocks that contains only this string
// is a solution.
yield return new[] { s };
}
Let's test it.
foreach (var blocks in GenerateBlocks("ABCDEF"))
Console.WriteLine($"({string.Join(")(", blocks)})");
The output is
(A)(B)(C)(D)(E)(F)
(A)(B)(CD)(E)(F)
(A)(BC)(DE)(F)
(A)(BCDE)(F)
(AB)(C)(D)(EF)
(AB)(CD)(EF)
(ABC)(DEF)
(ABCDEF)
So there you go.
You could now check to see whether each grouping is a palindrome, but why? The algorithm presented above can be easily modified to eliminate all non-palindromes by simply not recursing if the prefix and suffix are unequal:
if (prefix != suffix) continue;
The algorithm now enumerates only block palindromes. Let's test it:
foreach (var blocks in GenerateBlocks("BBACBB"))
Console.WriteLine($"({string.Join(")(", blocks)})");
The output is below; again, note that I am not filtering out the "entire string" block but doing so is straightforward.
(B)(B)(AC)(B)(B)
(B)(BACB)(B)
(BB)(AC)(BB)
(BBACBB)
If this subject interests you, consider reading my series of articles on using this same technique to generate every possible tree topology and every possible string in a language. It starts here:
http://blogs.msdn.com/b/ericlippert/archive/2010/04/19/every-binary-tree-there-is.aspx
This should work:
public List<string> BlockPalin(string s) {
var list = new List<string>();
for (int i = 1; i <= s.Length / 2; i++) {
int backInx = s.Length - i;
if (s.Substring(0, i) == s.Substring(backInx, i)) {
var result = string.Format("({0})", s.Substring(0, i));
result += "|" + result;
var rest = s.Substring(i, backInx - i);
if (rest == string.Empty) {
list.Add(result.Replace("|", rest));
return list;
}
else if (rest.Length == 1) {
list.Add(result.Replace("|", string.Format("({0})", rest)));
return list;
}
else {
list.Add(result.Replace("|", string.Format("({0})", rest)));
var recursiveList = BlockPalin(rest);
if (recursiveList.Count > 0) {
foreach (var recursiveResult in recursiveList) {
list.Add(result.Replace("|", recursiveResult));
}
}
else {
//EDIT: Thx to #juharr this list.Add is not needed...
// list.Add(result.Replace("|",string.Format("({0})",rest)));
return list;
}
}
}
}
return list;
}
And call it like this (EDIT: Again thx to #juharr, the distinct is not needed):
var x = BlockPalin("BONBON");//.Distinct().ToList();
var y = BlockPalin("ONION");//.Distinct().ToList();
var z = BlockPalin("BBACBB");//.Distinct().ToList();
The result:
x contains 1 element: (BON)(BON)
y contains 1 element: (ON)(I)(ON)
z contains 3 elements: (B)(BACB)(B),(B)(B)(AC)(B)(B) and (BB)(AC)(BB)
Although not so elegant as the one provided by #Eric Lippert, one might find interesting the following iterative string allocation free solution:
struct Range
{
public int Start, End;
public int Length { get { return End - Start; } }
public Range(int start, int length) { Start = start; End = start + length; }
}
static IEnumerable<Range[]> GetPalindromeBlocks(string input)
{
int maxLength = input.Length / 2;
var ranges = new Range[maxLength];
int count = 0;
for (var range = new Range(0, 1); ; range.End++)
{
if (range.End <= maxLength)
{
if (!IsPalindromeBlock(input, range)) continue;
ranges[count++] = range;
range.Start = range.End;
}
else
{
if (count == 0) break;
yield return GenerateResult(input, ranges, count);
range = ranges[--count];
}
}
}
static bool IsPalindromeBlock(string input, Range range)
{
return string.Compare(input, range.Start, input, input.Length - range.End, range.Length) == 0;
}
static Range[] GenerateResult(string input, Range[] ranges, int count)
{
var last = ranges[count - 1];
int midLength = input.Length - 2 * last.End;
var result = new Range[2 * count + (midLength > 0 ? 1 : 0)];
for (int i = 0; i < count; i++)
{
var range = result[i] = ranges[i];
result[result.Length - 1 - i] = new Range(input.Length - range.End, range.Length);
}
if (midLength > 0)
result[count] = new Range(last.End, midLength);
return result;
}
Test:
foreach (var input in new [] { "BONBON", "ONION", "BBACBB" })
{
Console.WriteLine(input);
var blocks = GetPalindromeBlocks(input);
foreach (var blockList in blocks)
Console.WriteLine(string.Concat(blockList.Select(range => "(" + input.Substring(range.Start, range.Length) + ")")));
}
Removing the line if (!IsPalindromeBlock(input, range)) continue; will produce the answer to the OP question.
It's not clear if you want all possible groupings, or just a possible grouping. This is one way, off the top-of-my-head, that you might get a grouping:
public static IEnumerable<string> GetBlocks(string testString)
{
if (testString.Length == 0)
{
yield break;
}
int mid = testString.Length / 2;
int i = 0;
while (i < mid)
{
if (testString.Take(i + 1).SequenceEqual(testString.Skip(testString.Length - (i + 1))))
{
yield return new String(testString.Take(i+1).ToArray());
break;
}
i++;
}
if (i == mid)
{
yield return testString;
}
else
{
foreach (var block in GetBlocks(new String(testString.Skip(i + 1).Take(testString.Length - (i + 1) * 2).ToArray())))
{
yield return block;
}
}
}
If you give it bonbon, it'll return bon. If you give it onion it'll give you back on, i. If you give it bbacbb, it'll give you b,b,ac.
Here's my solution (didn't have VS so I did it using java):
int matches = 0;
public void findMatch(String pal) {
String st1 = "", st2 = "";
int l = pal.length() - 1;
for (int i = 0; i < (pal.length())/2 ; i ++ ) {
st1 = st1 + pal.charAt(i);
st2 = pal.charAt(l) + st2;
if (st1.equals(st2)) {
matches++;
// DO THE SAME THING FOR THE MATCH
findMatch(st1);
}
l--;
}
}
The logic is pretty simple. I made two array of characters and compare them to find a match in each step. The key is you need to check the same thing for each match too.
findMatch("bonbon"); // 1
findMatch("bbacbb"); // 3
What about something like this for BONBON...
string bonBon = "BONBON";
First check character count for even or odd.
bool isEven = bonBon.Length % 2 == 0;
Now, if it is even, split the string in half.
if (isEven)
{
int halfInd = bonBon.Length / 2;
string firstHalf = bonBon.Substring(0, halfInd );
string secondHalf = bonBon.Substring(halfInd);
}
Now, if it is odd, split the string into 3 string.
else
{
int halfInd = (bonBon.Length - 1) / 2;
string firstHalf = bonBon.Substring(0, halfInd);
string middle = bonBon.Substring(halfInd, bonBon.Length - halfInd);
string secondHalf = bonBon.Substring(firstHalf.Length + middle.length);
}
May not be exactly correct, but it's a start....
Still have to add checking if it is actually a palindrome...
Good luck!!

C# More intuitive way to split a string into tokens?

I have a method which takes in a string, which contains various characters, but I'm only concerned about underscores '_' and dollar signs '$'. I want to split up the string into tokens by underscores as each piece b/w the underscores contains important information.
However, if a $ is contained in an area between underscores, then a token should be created from the last occurrence of an underscore to the end (ignoring any underscores in this last section).
Example
input: Hello_To_The$Great_World
expected tokens: Hello, To, The$Great_World
Question
I have a solution below, but I'm wondering is there a cleaner/more intuitive way of doing this than what I have below?
var aTokens = new List<string>();
var aPos = 0;
for (var aNum = 0; aNum < item.Length; aNum++)
{
if (aNum == item.Length - 1)
{
aTokens.Add(item.Substring(aPos, item.Length - aPos));
break;
}
if (item[aNum] == '$')
{
aTokens.Add(item.Substring(aPos, item.Length - aPos));
break;
}
if (item[aNum] == '_')
{
aTokens.Add(item.Substring(aPos, aNum - aPos));
aPos = aNum + 1;
}
}
You can split string by _ not having $ before them.
For that you can use the following regex:
(?<!\$.*)_
Sample code:
string input = "Hello_To_The$Great_World";
string[] output = Regex.Split(input, #"(?<!\$.*)_");
You also can do the task without regex and without loops, but with the help of 2 splits:
string input = "Hello_To_The$Great_World";
string[] temp = input.Split(new[] { '$' }, 2);
string[] output = temp[0].Split('_');
if (temp.Length > 1)
output[output.Length - 1] = output[output.Length - 1] + "$" + temp[1];
This method is not efficient or clean, but it gives you a general idea of how to do this:
Split your string into tokens
Find the index of the first string to contain $
Return a new array with the first n tokens and the final token is the remaining strings concatenated.
It's probably more useful to take advantage of IEnumerable or do things over a for loop instead of all this Array.Copy stuff... but you get the gist of it.
private string[] SomeMethod(string arg)
{
var strings = arg.Split(new[] { '_' });
var indexedValue = strings.Select((v, i) => new { Value = v, Index = i }).FirstOrDefault(x => x.Value.Contains("$"));
if (indexedValue != null)
{
var count = indexedValue.Index + 1;
string[] final = new string[count];
Array.Copy(strings, 0, final, 0, indexedValue.Index);
final[indexedValue.Index] = String.Join("_", strings, indexedValue.Index, strings.Length - indexedValue.Index);
return final;
}
return strings;
}
Here's my version (loops are so last year...)
const char dollar = '$';
const char underscore = '_';
var item = "Hello_To_The$Great_World";
var aTokens = new List<string>();
int dollarIndex = item.IndexOf(dollar);
if (dollarIndex >= 0)
{
int lastUnderscoreIndex = item.LastIndexOf(underscore, dollarIndex);
if (lastUnderscoreIndex >= 0)
{
aTokens.AddRange(item.Substring(0, lastUnderscoreIndex).Split(underscore));
aTokens.Add(item.Substring(lastUnderscoreIndex + 1));
}
else
{
aTokens.Add(item);
}
}
else
{
aTokens.AddRange(item.Split(underscore));
}
Edit:
I should have added, cleaner/more intuitive is very subjective, as you have found out by the variety of answers provided. From a maintainability point of view, it's much more important that the method you write to do the parsing is unit tested!
It's also an interesting exercise to test the performance of the various methods posted here - it quickly becomes apparent that your original version is much faster than using regular expressions! (Although in a real life situation, it's probably quite unlikely that the performance of this method will make any difference to your application!)

Reverse a String without using Reverse. It works, but why?

Ok, so a friend of mine asked me to help him out with a string reverse method that can be reused without using String.Reverse (it's a homework assignment for him). Now, I did, below is the code. It works. Splendidly actually. Obviously by looking at it you can see the larger the string the longer the time it takes to work. However, my question is WHY does it work? Programming is a lot of trial and error, and I was more pseudocoding than actual coding and it worked lol.
Can someone explain to me how exactly reverse = ch + reverse; is working? I don't understand what is making it go into reverse :/
class Program
{
static void Reverse(string x)
{
string text = x;
string reverse = string.Empty;
foreach (char ch in text)
{
reverse = ch + reverse;
// this shows the building of the new string.
// Console.WriteLine(reverse);
}
Console.WriteLine(reverse);
}
static void Main(string[] args)
{
string comingin;
Console.WriteLine("Write something");
comingin = Console.ReadLine();
Reverse(comingin);
// pause
Console.ReadLine();
}
}
If the string passed through is "hello", the loop will be doing this:
reverse = 'h' + string.Empty
reverse = 'e' + 'h'
reverse = 'l' + 'eh'
until it's equal to
olleh
If your string is My String, then:
Pass 1, reverse = 'M'
Pass 2, reverse = 'yM'
Pass 3, reverse = ' yM'
You're taking each char and saying "that character and tack on what I had before after it".
I think your question has been answered. My reply goes beyond the immediate question and more to the spirit of the exercise. I remember having this task many decades ago in college, when memory and mainframe (yikes!) processing time was at a premium. Our task was to reverse an array or string, which is an array of characters, without creating a 2nd array or string. The spirit of the exercise was to teach one to be mindful of available resources.
In .NET, a string is an immutable object, so I must use a 2nd string. I wrote up 3 more examples to demonstrate different techniques that may be faster than your method, but which shouldn't be used to replace the built-in .NET Replace method. I'm partial to the last one.
// StringBuilder inserting at 0 index
public static string Reverse2(string inputString)
{
var result = new StringBuilder();
foreach (char ch in inputString)
{
result.Insert(0, ch);
}
return result.ToString();
}
// Process inputString backwards and append with StringBuilder
public static string Reverse3(string inputString)
{
var result = new StringBuilder();
for (int i = inputString.Length - 1; i >= 0; i--)
{
result.Append(inputString[i]);
}
return result.ToString();
}
// Convert string to array and swap pertinent items
public static string Reverse4(string inputString)
{
var chars = inputString.ToCharArray();
for (int i = 0; i < (chars.Length/2); i++)
{
var temp = chars[i];
chars[i] = chars[chars.Length - 1 - i];
chars[chars.Length - 1 - i] = temp;
}
return new string(chars);
}
Please imagine that you entrance string is "abc". After that you can see that letters are taken one by one and add to the start of the new string:
reverse = "", ch='a' ==> reverse (ch+reverse) = "a"
reverse= "a", ch='b' ==> reverse (ch+reverse) = b+a = "ba"
reverse= "ba", ch='c' ==> reverse (ch+reverse) = c+ba = "cba"
To test the suggestion by Romoku of using StringBuilder I have produced the following code.
public static void Reverse(string x)
{
string text = x;
string reverse = string.Empty;
foreach (char ch in text)
{
reverse = ch + reverse;
}
Console.WriteLine(reverse);
}
public static void ReverseFast(string x)
{
string text = x;
StringBuilder reverse = new StringBuilder();
for (int i = text.Length - 1; i >= 0; i--)
{
reverse.Append(text[i]);
}
Console.WriteLine(reverse);
}
public static void Main(string[] args)
{
int abcx = 100; // amount of abc's
string abc = "";
for (int i = 0; i < abcx; i++)
abc += "abcdefghijklmnopqrstuvwxyz";
var x = new System.Diagnostics.Stopwatch();
x.Start();
Reverse(abc);
x.Stop();
string ReverseMethod = "Reverse Method: " + x.ElapsedMilliseconds.ToString();
x.Restart();
ReverseFast(abc);
x.Stop();
Console.Clear();
Console.WriteLine("Method | Milliseconds");
Console.WriteLine(ReverseMethod);
Console.WriteLine("ReverseFast Method: " + x.ElapsedMilliseconds.ToString());
System.Console.Read();
}
On my computer these are the speeds I get per amount of alphabet(s).
100 ABC(s)
Reverse ~5-10ms
FastReverse ~5-15ms
1000 ABC(s)
Reverse ~120ms
FastReverse ~20ms
10000 ABC(s)
Reverse ~16,852ms!!!
FastReverse ~262ms
These time results will vary greatly depending on the computer but one thing is for certain if you are processing more than 100k characters you are insane for not using StringBuilder! On the other hand if you are processing less than 2000 characters the overhead from the StringBuilder definitely seems to catch up with its performance boost.

Categories