What is a quick way to force CRLF in C# / .NET? - c#

How would you normalize all new-line sequences in a string to one type?
I'm looking to make them all CRLF for the purpose of email (MIME documents). Ideally this would be wrapped in a static method, executing very quickly, and not using regular expressions (since the variances of line breaks, carriage returns, etc. are limited). Perhaps there's even a BCL method I've overlooked?
ASSUMPTION: After giving this a bit more thought, I think it's a safe assumption to say that CR's are either stand-alone or part of the CRLF sequence. That is, if you see CRLF then you know all CR's can be removed. Otherwise it's difficult to tell how many lines should come out of something like "\r\n\n\r".

input.Replace("\r\n", "\n").Replace("\r", "\n").Replace("\n", "\r\n")
This will work if the input contains only one type of line breaks - either CR, or LF, or CR+LF.

It depends on exactly what the requirements are. In particular, how do you want to handle "\r" on its own? Should that count as a line break or not? As an example, how should "a\n\rb" be treated? Is that one very odd line break, one "\n" break and then a rogue "\r", or two separate linebreaks? If "\r" and "\n" can both be linebreaks on their own, why should "\r\n" not be treated as two linebreaks?
Here's some code which I suspect is reasonably efficient.
using System;
using System.Text;
class LineBreaks
{
static void Main()
{
Test("a\nb");
Test("a\nb\r\nc");
Test("a\r\nb\r\nc");
Test("a\rb\nc");
Test("a\r");
Test("a\n");
Test("a\r\n");
}
static void Test(string input)
{
string normalized = NormalizeLineBreaks(input);
string debug = normalized.Replace("\r", "\\r")
.Replace("\n", "\\n");
Console.WriteLine(debug);
}
static string NormalizeLineBreaks(string input)
{
// Allow 10% as a rough guess of how much the string may grow.
// If we're wrong we'll either waste space or have extra copies -
// it will still work
StringBuilder builder = new StringBuilder((int) (input.Length * 1.1));
bool lastWasCR = false;
foreach (char c in input)
{
if (lastWasCR)
{
lastWasCR = false;
if (c == '\n')
{
continue; // Already written \r\n
}
}
switch (c)
{
case '\r':
builder.Append("\r\n");
lastWasCR = true;
break;
case '\n':
builder.Append("\r\n");
break;
default:
builder.Append(c);
break;
}
}
return builder.ToString();
}
}

Simple variant:
Regex.Replace(input, #"\r\n|\r|\n", "\r\n")
For better performance:
static Regex newline_pattern = new Regex(#"\r\n|\r|\n", RegexOptions.Compiled);
[...]
newline_pattern.Replace(input, "\r\n");

string nonNormalized = "\r\n\n\r";
string normalized = nonNormalized.Replace("\r", "\n").Replace("\n", "\r\n");

This is a quick way to do that, I mean.
It does not use an expensive regex function.
It also does not use multiple replacement functions that each individually did loop over the data with several checks, allocations, etc.
So the search is done directly in one for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops.
In some cases, a larger page size might be more efficient.
public static string NormalizeNewLine(this string val)
{
if (string.IsNullOrEmpty(val))
return val;
const int page = 6;
int a = page;
int j = 0;
int len = val.Length;
char[] res = new char[len];
for (int i = 0; i < len; i++)
{
char ch = val[i];
if (ch == '\r')
{
int ni = i + 1;
if (ni < len && val[ni] == '\n')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else if (ch == '\n')
{
int ni = i + 1;
if (ni < len && val[ni] == '\r')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else
{
res[j++] = ch;
}
}
return new string(res, 0, j);
}
I now that '\n\r' is not actually used on basic platforms. But who would use two types of linebreaks in succession to indicate two linebreaks?
If you want to know that, then you need to take a look before to know if the \n and \r both are used separately in the same document.

Environment.NewLine;
A string containing "\r\n" for non-Unix platforms, or a string containing "\n" for Unix platforms.

str.Replace("\r", "").Replace("\n", "\r\n");
Converts both types of line breaks (\n and \n\r's) into CRLFs
on .NET 6 it's 35% faster than regex (Benchmarked using BenchmarkDotNet)

Related

remove string between "|" and "," in stringbuilder in C#

I use VS2019 in Windows7.
I want to remove string between "|" and "," in a StringBuilder.
That is , I want to convert StringBuilder from
"578.552|0,37.986|317,38.451|356,23"
to
"578.552,37.986,38.451,23"
I have tried Substring but failed, what other method I could use to achieve this?
If you have a huge StringBuilder and that's why converting it into String and applying regular expression is not the option,
you can try implementing Finite State Machine (FSM):
StringBuilder source = new StringBuilder("578.552|0,37.986|317,38.451|356,23");
int state = 0; // 0 - keep character, 1 - discard character
int index = 0;
for (int i = 0; i < source.Length; ++i) {
char c = source[i];
if (state == 0)
if (c == '|')
state = 1;
else
source[index++] = c;
else if (c == ',') {
state = 0;
source[index++] = c;
}
}
source.Length = index;
StringBuilder isn't really setup for much by way of inspection and mutation in the middle. It would be pretty easy to do once you have a string (probably via a Regex), but StringBuilder? not so much. In reality, StringBuilder is mostly intended for forwards-only append, so the answer would be:
if you didn't want those characters, why did you add them?
Maybe just use the string version here; then:
var s = "578.552|0,37.986|317,38.451|356,23";
var t = Regex.Replace(s, #"\|.*?(?=,)", ""); // 578.552,37.986,38.451,23
The regex translation here is "pipe (\|), non-greedy anything (.*?), followed by a comma where the following comma isn't part of the match ((?=,)).
If you don't know very much of Regex patterns, you can write your own custom method to filter out data; its always instructive and a good practicing exercise:
public static String RemoveDelimitedSubstrings(
this StringBuilder s,
char startDelimitter,
char endDelimitter,
char newDelimitter)
{
var buffer = new StringBuilder(s.Length);
var ignore = false;
for (var i = 0; i < s.Length; i++)
{
var currentChar = s[i];
if (currentChar == startDelimitter && !ignore)
{
ignore = true;
}
else if (currentChar == endDelimitter && ignore)
{
ignore = false;
buffer.Append(newDelimitter);
}
else if (!ignore)
buffer.Append(currentChar);
}
return buffer.ToString();
}
And youd obvisouly use it like:
var buffer= new StringBuilder("578.552|0,37.986|317,38.451|356,23");
var filteredBuffer = b.RemoveDelimitedSubstrings('|', ',', ','));

C# Console Word Wrap

I have a string with newline characters and I want to wrap the words. I want to keep the newline characters so that when I display the text it looks like separate paragraphs. Anyone have a good function to do this? Current function and code below.(not my own function). The WordWrap function seems to be stripping out \n characters.
static void Main(string[] args){
StreamReader streamReader = new StreamReader("E:/Adventure Story/Intro.txt");
string intro = "";
string line;
while ((line = streamReader.ReadLine()) != null)
{
intro += line;
if(line == "")
{
intro += "\n\n";
}
}
WordWrap(intro);
public static void WordWrap(string paragraph)
{
paragraph = new Regex(#" {2,}").Replace(paragraph.Trim(), #" ");
var left = Console.CursorLeft; var top = Console.CursorTop; var lines = new List<string>();
for (var i = 0; paragraph.Length > 0; i++)
{
lines.Add(paragraph.Substring(0, Math.Min(Console.WindowWidth, paragraph.Length)));
var length = lines[i].LastIndexOf(" ", StringComparison.Ordinal);
if (length > 0) lines[i] = lines[i].Remove(length);
paragraph = paragraph.Substring(Math.Min(lines[i].Length + 1, paragraph.Length));
Console.SetCursorPosition(left, top + i); Console.WriteLine(lines[i]);
}
}
Here is a word wrap function that works by using regular expressions to find the places that it's ok to break and places where it must break. Then it returns pieces of the original text based on the "break zones". It even allows for breaks at hyphens (and other characters) without removing the hyphens (since the regex uses a zero-width positive lookbehind assertion).
IEnumerable<string> WordWrap(string text, int width)
{
const string forcedBreakZonePattern = #"\n";
const string normalBreakZonePattern = #"\s+|(?<=[-,.;])|$";
var forcedZones = Regex.Matches(text, forcedBreakZonePattern).Cast<Match>().ToList();
var normalZones = Regex.Matches(text, normalBreakZonePattern).Cast<Match>().ToList();
int start = 0;
while (start < text.Length)
{
var zone =
forcedZones.Find(z => z.Index >= start && z.Index <= start + width) ??
normalZones.FindLast(z => z.Index >= start && z.Index <= start + width);
if (zone == null)
{
yield return text.Substring(start, width);
start += width;
}
else
{
yield return text.Substring(start, zone.Index - start);
start = zone.Index + zone.Length;
}
}
}
If you want another newline to make text look-like paragraphs, just use Replace method of your String object.
var str =
"Line 1\n" +
"Line 2\n" +
"Line 3\n";
Console.WriteLine("Before:\n" + str);
str = str.Replace("\n", "\n\n");
Console.WriteLine("After:\n" + str);
Recently I've been working on creating some abstractions that imitate window-like features in a performance- and memory-sensitive console context.
To this end I had to implement word-wrapping functionality without any unnecessary string allocations.
The following is what I managed to simplify it into. This method:
preserves new-lines in the input string,
allows you to specify what characters it should break on (space, hyphen, etc.),
returns the start indices and lengths of the lines via Microsoft.Extensions.Primitives.StringSegment struct instances (but it's very simple to replace this struct with your own, or append directly to a StringBuilder).
public static IEnumerable<StringSegment> WordWrap(string input, int maxLineLength, char[] breakableCharacters)
{
int lastBreakIndex = 0;
while (true)
{
var nextForcedLineBreak = lastBreakIndex + maxLineLength;
// If the remainder is shorter than the allowed line-length, return the remainder. Short-circuits instantly for strings shorter than line-length.
if (nextForcedLineBreak >= input.Length)
{
yield return new StringSegment(input, lastBreakIndex, input.Length - lastBreakIndex);
yield break;
}
// If there are native new lines before the next forced break position, use the last native new line as the starting position of our next line.
int nativeNewlineIndex = input.LastIndexOf(Environment.NewLine, nextForcedLineBreak, maxLineLength);
if (nativeNewlineIndex > -1)
{
nextForcedLineBreak = nativeNewlineIndex + Environment.NewLine.Length + maxLineLength;
}
// Find the last breakable point preceding the next forced break position (and include the breakable character, which might be a hypen).
var nextBreakIndex = input.LastIndexOfAny(breakableCharacters, nextForcedLineBreak, maxLineLength) + 1;
// If there is no breakable point, which means a word is longer than line length, force-break it.
if (nextBreakIndex == 0)
{
nextBreakIndex = nextForcedLineBreak;
}
yield return new StringSegment(input, lastBreakIndex, nextBreakIndex - lastBreakIndex);
lastBreakIndex = nextBreakIndex;
}
}

Parse comma seperated string with a complication in C#

I know how to get substrings from a string which are coma seperated but here's a complication: what if substring contains a coma.
If a substring contains a coma, new line or double quotes the entire substring is encapsulated with double quotes.
If a substring contains a double quote the double quote is escaped with another double quote.
Worst case scenario would be if I have something like this:
first,"second, second","""third"" third","""fourth"", fourth"
In this case substrings are:
first
second, second
"third" third
"fourth", fourth
second, second is encapsulated with double quotes, I don't want those double quotes in a list/array.
"third" third is encapsulated with double quotes because it contains double quotes and those are escaped with aditional double quotes. Again I don't want the encapsulating double quotes in a list/array and i don't want the double quotes that escape double quotes, but I want original double quotes which are a part of the substring.
One way using TextFieldParser:
using (var reader = new StringReader("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
parser.Delimiters = new[] { "," };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
foreach (var field in parser.ReadFields())
Console.WriteLine(field);
}
}
For
first
second, second
"third" third
"fourth", fourth
Try this
string input = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
string[] output = input.Split(new string[] {"\",\""}, StringSplitOptions.RemoveEmptyEntries);
I would suggest you to construct a small state machine for this problem. You would have states like:
Out - before the first field is reached
InQuoted - you were Out and " arrived; now you're in and the field is quoted
InQuotedMaybeOut - you were InQuoted and " arrived; now you wait for the next character to figure whether it is another " or something else; if else, then select the next valid state (character could be space, new line, comma, so you decide the next state); otherwise, if " arrived, you push " to the output and step back to InQuoted
In - after Out, when any character has arrived except , and ", you are automatically inside a new field which is not quoted.
This will certainly read CSV correctly. You can also make the separator configurable, so that you support TSV or semicolon-separated format.
Also keep in mind one very important case in CSV format: Quoted field may contain new line! Another special case to keep an eye on: empty field (like: ,,).
This is not the most elegant solution but it might help you. I would loop through the characters and do an odd-even count of the quotes. For example you have a bool that is true if you have encountered an odd number of quotes and false for an even number of quotes.
Any comma encountered while this bool value is true should not be considered as a separator. If you know it is a separator you can do several things with that information. Below I replaced the delimiter with something more manageable (not very efficient though):
bool odd = false;
char replacementDelimiter = "|"; // Or some very unlikely character
for(int i = 0; i < str.len; ++i)
{
if(str[i] == '\"')
odd = !odd;
else if (str[i] == ',')
{
if(!odd)
str[i] = replacementDelimiter;
}
}
string[] commaSeparatedTokens = str.Split(replacementDelimiter);
At this point you should have an array of strings that are separated on the commas that you have intended. From here on it will be simpler to handle the quotes.
I hope this can help you.
Mini parser
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp
{
class Program
{
private static IEnumerable<string> Parse(string input)
{
if (string.IsNullOrWhiteSpace(input))
{
// empty string => nothing to do
yield break;
}
int count = input.Length;
StringBuilder sb = new StringBuilder();
int j;
for (int i = 0; i < count; i++)
{
char c = input[i];
if (c == ',')
{
yield return sb.ToString();
sb.Clear();
}
else if (c == '"')
{
// begin quoted string
sb.Clear();
for (j = i + 1; j < count; j++)
{
if (input[j] == '"')
{
// quote
if (j < count - 1 && input[j + 1] == '"')
{
// double quote
sb.Append('"');
j++;
}
else
{
break;
}
}
else
{
sb.Append(input[j]);
}
}
yield return sb.ToString();
// clear buffer and skip to next comma
sb.Clear();
for (i = j + 1; i < count && input[i] != ','; i++) ;
}
else
{
sb.Append(c);
}
}
}
[STAThread]
static void Main(string[] args)
{
foreach (string str in Parse("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
{
Console.WriteLine(str);
}
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
Result
first
second, second
"third" third
"fourth", fourth
Thank you for your answers, but before I got to see them I wrote this solution, it's not pretty but it works for me.
string line = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
var substringArray = new List<string>();
string substring = null;
var doubleQuotesCount = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == ',' && (doubleQuotesCount % 2) == 0)
{
substringArray.Add(substring);
substring = null;
doubleQuotesCount = 0;
continue;
}
else
{
if (line[i] == '"')
doubleQuotesCount++;
substring += line[i];
//If it is a last character
if (i == line.Length - 1)
{
substringArray.Add(substring);
substring = null;
doubleQuotesCount = 0;
}
}
}
for(var i = 0; i < substringArray.Count; i++)
{
if (substringArray[i] != null)
{
//remove first double quote
if (substringArray[i][0] == '"')
{
substringArray[i] = substringArray[i].Substring(1);
}
//remove last double quote
if (substringArray[i][substringArray[i].Length - 1] == '"')
{
substringArray[i] = substringArray[i].Remove(substringArray[i].Length - 1);
}
//Replace double double quotes with single double quote
substringArray[i] = substringArray[i].Replace("\"\"", "\"");
}
}

C# loop through array of chars not efficient

I have this code below where I loop through string and compare everything char by char and it's very slow process I wonder how I can improve this code.
//delete anti-xss junk ")]}'\n" (5 chars);
if (trim)
{
googlejson = googlejson.Substring(5);
}
//pass through result and turn empty elements into nulls
//echo strlen( $googlejson ) . '<br>';
bool instring = false;
bool inescape = false;
string lastchar = "";
string output = "";
for ( int x=0; x< googlejson.Length; x++ ) {
string ch = googlejson.Substring(x, 1);
//toss unnecessary whitespace
if ( !instring && ( Regex.IsMatch(ch, #"/\s/"))) {
continue;
}
//handle strings
if ( instring ) {
if (inescape) {
output += ch;
inescape = false;
} else if ( ch == "\\" ) {
output += ch;
inescape = true;
} else if ( ch == "\"") {
output += ch;
instring = false;
} else {
output += ch;
}
lastchar = ch;
continue;
}
switch ( ch ) {
case "\"":
output += ch;
instring = true;
break;
case ",":
if ( lastchar == "," || lastchar == "[" || lastchar == "{" ) {
output += "null";
}
output += ch;
break;
case "]":
case "}":
if ( lastchar == "," ) {
output += "null";
}
output += ch;
break;
default:
output += ch;
break;
}
lastchar = ch;
}
return output;
This is just amazing.
I have changed 2 following lines and gain phenomenal performance increase like 1000% or something
First change this
string ch = googlejson.Substring(x, 1);
to that
string ch = googlejson[x].ToString();
Second I replaced all += ch with String Builder
output.Append(ch);
So those 2 changes had maximum performance impact.
First, you shouldn't use Substrings, when only dealing with single characters. Use
char ch = googlejson[x];
instead.
You could also consider using a StringBuilder for your output variable. If you're working with string, you should always have in mind, that strings are immutable in .NET, so for every
output += ch;
there is a new string instance created.
Use
StringBuilder output = new StringBuilder();
and
output.append(ch);
instead.
As per the other comments, this code's use of strings as characters and Substring() is pretty dire - in terms of performance.
Also, the use of Regex to check for whitespace going to be very inefficient.
If you want to operate on characters, use characters (char) not strings.
The for loop is a bit inefficient, but the JIT compiler probably optimises that away. It would be slightly better to use a local variable instead of accessing Length property.
Doing a switch on strings is pretty inefficient too, when a switch on characters is darn fast.
And as MartinStettner suggested, StringBuilder append will be better for building the result. (#Tom Squires - This question is all about performance, so yes it does matter, and it isn't more complex - it may be a few more characters but that's not complexity.
Finally, I would say that if you have performance problems (apart from this dire code), you should consider measuring it with a profiler before getting carried away with optimisation.
PS This looks like an interview question ... tut tut if this is the case, that's not what SO is for.
Why not use the StringReader instead of SubString
var output = new StringBuilder();
using (var reader = new StringReader(googleJson)
{
var buffer = new char[1]
while (reader.Read(buffer, 0, 1) == 1)
{
var ch = buffer[0];
//your stuff
output.Append(ch);
}
}
return output.ToString();
You could use StringReader.Read() and do all you logic on the integer code value of the charachter which would be fast but a little brittle.
What about:
if ( !instring && ( Regex.IsMatch(ch, #"/\s/")))
to
if ( !instring && ch < 33)
or even better:
if ( !instring && Char.IsWhiteSpace(ch))

Best approach of word censoring - C# 4.0

For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf(" censored1 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored2 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored3 ") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.
I would use LINQ or regular expression for this:
LINQ: How to: Query for Sentences that Contain a Specified Set of Words (LINQ)
Regular Expression: Highlight a list of words using a regular expression in c#
You can simplify it. Here listOfCencoredWords will contains all the censored words
if (listOfCensoredWords.Any(item => srMessageTemp.Contains(item)))
return;
If you want to make it really fast, you can use Aho-Corasick automaton. This is how antivirus software checks thousands of viruses at once. But I don't know where you can get the implementation done, so it will require much more work from you compared to using just simple slow methods like regular expressions.
See the theory here: http://en.wikipedia.org/wiki/Aho-Corasick
First, I hope you aren't really "tokenizing" the words as written. You know, just because someone doesn't put a space before a bad word, it doesn't make the word less bad :-) Example ,badword,
I'll say that I would use a Regex here :-) I'm not sure if a Regex or a man-made parser would be faster, but at least a Regex would be a good starting point. As others wrote, you begin by splitting the text in words and then checking an HashSet<string>.
I'm adding a second version of the code, based on ArraySegment<char>. I speak later of this.
class Program
{
class ArraySegmentComparer : IEqualityComparer<ArraySegment<char>>
{
public bool Equals(ArraySegment<char> x, ArraySegment<char> y)
{
if (x.Count != y.Count)
{
return false;
}
int end = x.Offset + x.Count;
for (int i = x.Offset, j = y.Offset; i < end; i++, j++)
{
if (!x.Array[i].ToString().Equals(y.Array[j].ToString(), StringComparison.InvariantCultureIgnoreCase))
{
return false;
}
}
return true;
}
public override int GetHashCode(ArraySegment<char> obj)
{
unchecked
{
int hash = 17;
int end = obj.Offset + obj.Count;
int i;
for (i = obj.Offset; i < end; i++)
{
hash *= 23;
hash += Char.ToUpperInvariant(obj.Array[i]);
}
return hash;
}
}
}
static void Main()
{
var rx = new Regex(#"\b\w+\b", RegexOptions.Compiled);
var sampleText = #"For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.
if (srMessageTemp.IndexOf("" censored1 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored2 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored3 "") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.
And now some accented letters àèéìòù and now some letters with unicode combinable diacritics àèéìòù";
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
//sampleText += sampleText;
HashSet<string> prohibitedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase) { "For", "custom", "combinable", "away" };
Stopwatch sw1 = Stopwatch.StartNew();
var words = rx.Matches(sampleText);
foreach (Match word in words)
{
string str = word.Value;
if (prohibitedWords.Contains(str))
{
Console.Write(str);
Console.Write(" ");
}
else
{
//Console.WriteLine(word);
}
}
sw1.Stop();
Console.WriteLine();
Console.WriteLine();
HashSet<ArraySegment<char>> prohibitedWords2 = new HashSet<ArraySegment<char>>(
prohibitedWords.Select(p => new ArraySegment<char>(p.ToCharArray())),
new ArraySegmentComparer());
var sampleText2 = sampleText.ToCharArray();
Stopwatch sw2 = Stopwatch.StartNew();
int startWord = -1;
for (int i = 0; i < sampleText2.Length; i++)
{
if (Char.IsLetter(sampleText2[i]) || Char.IsDigit(sampleText2[i]))
{
if (startWord == -1)
{
startWord = i;
}
}
else
{
if (startWord != -1)
{
int length = i - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
startWord = -1;
}
}
}
if (startWord != -1)
{
int length = sampleText2.Length - startWord;
if (length != 0)
{
var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);
if (prohibitedWords2.Contains(wordSegment))
{
Console.Write(sampleText2, startWord, length);
Console.Write(" ");
}
else
{
//Console.WriteLine(sampleText2, startWord, length);
}
}
}
sw2.Stop();
Console.WriteLine();
Console.WriteLine();
Console.WriteLine(sw1.ElapsedTicks);
Console.WriteLine(sw2.ElapsedTicks);
}
}
I'll note that you could go faster doing the parsing "in" the original string. What does this means: if you subdivide the "document" in words and each word is put in a string, clearly you are creating n string, one for each word of your document. But what if you skipped this step and operated directly on the document, simply keeping the current index and the length of the current word? Then it would be faster! Clearly then you would need to create a special comparer for the HashSet<>.
But wait! C# has something similar... It's called ArraySegment. So your document would be a char[] instead of a string and each word would be an ArraySegment<char>. Clearly this is much more complex! You can't simply use Regexes, you have to build "by hand" a parser (but I think converting the \b\w+\b expression would be quite easy). And creating a comparer for HashSet<char> would be a little complex (hint: you would use HashSet<ArraySegment<char>> and the words to be censored would be ArraySegments "pointing" to a char[] of a word and with size equal to the char[].Length, like var word = new ArraySegment<char>("tobecensored".ToCharArray());)
After some simple benchmark, I can see that an unoptimized version of the program using ArraySegment<string> is as much fast as the Regex version for shorter texts. This probably because if a word is 4-6 char long, it's as much "slow" to copy it around than it's to copy around an ArraySegment<char> (an ArraySegment<char> is 12 bytes, a word of 6 characters is 12 bytes. On top of both of these we have to add a little overhead... But in the end the numbers are comparable). But for longer texts (try decommenting the //sampleText += sampleText;) it becomes a little faster (10%) in Release -> Start Without Debugging (CTRL-F5)
I'll note that comparing strings character by character is wrong. You should always use the methods given to you by the string class (or by the OS). They know how to handle "strange" cases much better than you (and in Unicode there isn't any "normal" case :-) )
You can use linq for this but it's not required if you use a list to hold your list of censored values. The solution below uses the build in list functions and allows you to do your searches case insensitive.
private static List<string> _censoredWords = new List<string>()
{
"badwordone1",
"badwordone2",
"badwordone3",
"badwordone4",
};
static void Main(string[] args)
{
string badword1 = "BadWordOne2";
bool censored = ShouldCensorWord(badword1);
}
private static bool ShouldCensorWord(string word)
{
return _censoredWords.Contains(word.ToLower());
}
What you think about this:
string[] censoredWords = new[] { " censored1 ", " censored2 ", " censored3 " };
if (censoredWords.Contains(srMessageTemp))
return;

Categories