string.contains and string.replace in one single line of code - c#

I'm currently writing some software where I have to load a lot of columnnames from an external file. Usually I would do this with some JSON but for reasons of userfriendlyness I can't do it right now. I need to use a textfile which is readable to the users and includes a lot of comments.
So I have created my own file to hold all of these values.
Now when I'm importing these values in my software I essentially run through my configfile line by line and I check for every line if it matches a parameter which I then parse. But this way I end up with a big codeblock with very repetitive code and I was wondering is could not simplify it in a way so that every check is done in just one line.
Here is the code I'm currently using:
if (line.Contains("[myValue]"))
{
myParameter = line.Replace("[myValue]", string.Empty).Trim();
}
I know that using Linq you can simply things and put them in one single line, I'm just not sure if it would work in this case?
Thanks for your help!
Kenneth

Why not just create a method if this piece of code often repeated :
void SetParameter(string line, string name, ref string parameter)
{
if (line.Contains(name))
{
parameter = line.Replace(name, string.Empty).Trim();
}
}
SetParameter(line, "[myValue]", ref myParameter);
If you want to avoid calling both Replace and Contains, which is probably a good idea, you could also just call Replace:
void SetParameter(string line, string name, ref string parameter)
{
var replaced = line.Replace(name, string.Empty);
if (line != replaced)
{
parameter = replaced.Trim();
}
}

Try this way (ternary):
myParameter = line.Contains("[myValue]")?line.Replace("[myValue]", string.Empty).Trim():myParameter;
Actually,
line.IndexOf should be faster.
From your code, look like you are replacing with just empty text, so why not take the entire string (consisting of many lines) and replace at one shot, instead of checking one line at a time.

You could use RegEx. This might possibly relieve you of some repetitive code
string line = "[myvalue1] some string [someotherstring] [myvalue2]";
// All your Keys stored at a single place
string[] keylist = new string[] { #"\[myvalue1]", #"\[myvalue2]" };
var newString = Regex.Replace(line, string.Join("|", keylist), string.Empty);
Hope it helps.

Related

Search String Pattern in Large Text Files C#

I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.
Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.
You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);

Search for string w/delimiter character

I created a little console program that will search text files and return all string lines that matches a variable entered by a user. One issue I ran into is, say I want to look up "1234" which represents a location code, but there is also a phone number that has "555-1234" in the string line, I get that one back too. I am thinking if I input the delimiter (ex: ",") with the variable (",1234,") then maybe I can ensure search is accurate. Am I on the right track, or is there a better way? This is where I am at so far:
string[] file = File.ReadAllLines(sPath);
foreach (string s in file)
{
using (StreamWriter sw = File.AppendText(rPath))
{
if (sFound = Regex.IsMatch(s, string.Format(#"\b{0}\b",
Regex.Escape(searchVariable))))
{
sw.WriteLine(s);
}
}
}
I'd say you are on the right track.
I'd suggest changing the regular expressions so that it uses a negative lookbehind to match "searchVariable" that is not preceeded by "-", so "1234" in "555-1234" wouldn't be matched, but ",1234" (for instance) would.
You will only need to use "Regex.Escape()" if you want to include special regular expression characters in your search, which from your question you don't want to do.
You could change the code to something like this (it's late so I haven't tested this!):
var lines= File.ReadAllLines(sPath);
var regex = new Regex(String.Format("(?<!-){0}\b", searchVariable));
if (lines.Any())
{
using (var streamWriter = File.AppendText(rPath))
{
foreach (var line in lines)
{
if (regex.IsMatch(line))
{
streamWriter.WriteLine(line);
}
}
}
}
A great website for testing these (often tricky!) regular expressions is Regex Hero.
Use Linq to CSV and make your life easier. Just go to Nuget and search Linq to CSV.

Is it possible to convert an array of strings into one string?

In my program, I read in a file using this statement:
string[] allLines = File.ReadAllLines(dataFile);
But I want to apply a Regex to the file as a whole (an example file is shown at the bottom) so I can eliminate certain stuff I'm not concerned with in the file. I can't use ReadAllText as I need to read it line by line for another purpose of the program (removing whitespace from each line).
Regex r = new Regex(#"CREATE TABLE [^\(]+\((.*)\) ON");
(thanks to chiccodoro for this code)
This is the Regex that I want to apply.
Is there any way to change the array back into one text file? Or any other solution to the problem?
Things that pop into my mind is replacing the 'stuff' that I'm not concerned with with string.Empty.
example file
USE [Shelleys Other Database]
CREATE TABLE db.exmpcustomers(
f_name varchar(100) NULL,
l_name varchar(100) NULL,
date_of_birth date NULL,
house_number int NULL,
street_name varchar(100) NULL
) ON [PRIMARY]
You can join string[] into a single string like this
string strmessage=string.Join(",",allLines);
output :-a single , separated string.
You can use String.Join():
string joined = String.Join(Environment.NewLine, allLines);
If you just want to write it back to the file, you can use File.WriteAllLines() and that works with an array.
String.Join will concatenate all the members of your array using any specified seperator.
It's going to be really hard to use regexen to deal with multi-line data a line at a time. So rather than muck about with that, I'm going to suggest that you first read it as one big string, do your multi-line regex business, and then you can split it into an array of strings using String.Split (split on newlines). The reason you want to do it in this order is so that any further operations on your file data will include the changes already made by the regex. If you join the strings, then do the regex, you will either have to split that string again, or lose the changes you've made to it while you operate on the original array.
Remember to use this for your regex matching, so that it will match across newlines:
Regex r = new Regex(#"CREATE TABLE [^(]+((.*)) ON", RegexOptions.SingleLine);
Just change from
string[] allLines = File.ReadAllLines(dataFile);
to
string allLines = File.ReadAllText(dataFile);
;)
Could you build up a buffer as you read in each line? I have the idea that this might be a bit more efficient than getting all the lines as a string array, then joining them (...though I haven't done a full study of the issue and would be interested to hear if there is some reason that it is actually more efficient to go that way).
StringBuilder buffer = new StringBuilder();
string line = null;
using (StreamReader sr = new StreamReader(dataFile))
{
while((line = sr.ReadLine()) != null)
{
// Do whatever you need to do with the individual line...
// ...then append the line to your buffer.
buffer.Append(line);
}
}
// Now, you can do whatever you need to do with the contents of
// the buffer.
string wholeText = buffer.ToString();
public string CreateStringFromArray(string[] allLines)
{
StringBuilder builder = new StringBuilder();
foreach (string item in allLines)
{
builder.Append(item);
//Appending Linebreaks
builder.Append("\n\l");
}
return builder.ToString();
}

Replace Bad words using Regex

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Categories