How to remove escape sequences from stream - c#

is there an quick way to find(and remove) all escape sequences from a Stream/String??

Hope bellow syntax will be help full for you
string inputString = #"hello world]\ ";
StringBuilder sb = new StringBuilder();
string[] parts = inputString.Split(new char[] { ' ', '\n', '\t', '\r', '\f', '\v','\\' }, StringSplitOptions.RemoveEmptyEntries);
int size = parts.Length;
for (int i = 0; i < size; i++)
sb.AppendFormat("{0} ", parts[i]);

The escape sequences that you are referring to are simply text based represntations of characters that are normally either unprintable (such as new lines or tabs) or conflict with other characters used in source code files (such as the backslash "\").
Although when debugging you might see these chracters represented as escaped characters in the debugger, the actual characters in the stream are not "escaped", they are those actual characters (for example a new line character).
If you want to remove certain characters (such as newline characters) then remove them in the same way you would any other character (e.g. the letter "a")
// Removes all newline characters in a string
myString.Replace("\n", "");
If you are actually doing some processing on a string that contains escaped characters (such as a source code file) then you can simply replace the escaped string with its unescaped equivalent:
// Replaces the string "\n" with the newline character
myString.Replace("\\n", "\n");
In the above I use the escape sequence for the backslash so that I match the string "\n", instead of the newline character.

If you're going for fewer lines of code:
string inputString = "\ncheese\a";
char[] escapeChars = new[]{ '\n', '\a', '\r' }; // etc
string cleanedString = new string(inputString.Where(c => !escapeChars.Contains(c)).ToArray());

You can use System.Char.IsControl() to detect control characters.
To filter control characters from a string:
public string RemoveControlCharacters(string input)
{
return
input.Where(character => !char.IsControl(character))
.Aggregate(new StringBuilder(), (builder, character) => builder.Append(character))
.ToString();
}
To filter control characters from a stream you can do something similar, however you will first need a way to convert a Stream to an IEnumerable<char>.
public IEnumerable<char> _ReadCharacters(Stream input)
{
using(var reader = new StreamReader(input))
{
while(!reader.EndOfStream)
{
foreach(var character in reader.ReadLine())
{
yield return character;
}
}
}
}
Then you can use this method to filter control characters:
public string RemoveControlCharacters(Stream input)
{
return
_ReadCharacters(input)
.Where( character => !Char.IsControl(character))
.Aggregate( new StringBuilder(), ( builder, character ) => builder.Append( character ) )
.ToString();
}

Escape sequense is a string of characters usually beginning with ESC-char but can contain any character. They are used on terminals to control cursor position graphics-mode etc.
http://en.wikipedia.org/wiki/Escape_sequence
Here is my implement with python. Should be easy enough to translate to C.
#!/usr/bin/python2.6/python
import sys
Estart="\033" #possible escape start keys
Estop="HfABCDsuJKmhlp" #possible esc end keys
replace="\015" # ^M character
replace_with="\n"
f_in = sys.stdin
parsed = sys.stdout
seqfile= open('sequences','w')#for debug
in_seq = 0
c = f_in.read(1)
while len(c) > 0 and not c=='\0':
while len(c)>0 and c!='\0' and not c in Estart:
if not c in replace :
parsed.write(c)
else:
parsed.write(replace_with[replace.find(c)])
c = f_in.read(1)
while len(c)>0 and c!='\0' and not c in Estop:
seqfile.write(c)
c = f_in.read(1)
seqfile.write(c) #write final character
c = f_in.read(1)
f_in.close()
parsed.close()
seqfile.close()

Related

Regex for removing only specific special characters from string

I'd like to write a regex that would remove the special characters on following basis:
To remove white space character
#, &, ', (, ), <, > or #
I have written this regex which removes whitespaces successfully:
string username = Regex.Replace(_username, #"\s+", "");
But I'd like to upgrade/change it so that it can remove the characters above that I mentioned.
Can someone help me out with this?
string username = Regex.Replace(_username, #"(\s+|#|&|'|\(|\)|<|>|#)", "");
use a character set [charsgohere]
string removableChars = Regex.Escape(#"#&'()<>#");
string pattern = "[" + removableChars + "]";
string username = Regex.Replace(username, pattern, "");
I suggest using Linq instead of regular expressions:
string source = ...
string result = string.Concat(source
.Where(c => !char.IsWhiteSpace(c) &&
c != '(' && c != ')' ...));
In case you have many characters to skip you can organize them into a collection:
HashSet<char> skip = new HashSet<char>() {
'(', ')', ...
};
...
string result = string.Concat(source
.Where(c => !char.IsWhiteSpace(c) && !skip.Contains(c)));
You can easily use the Replace function of the Regex:
string a = "ash&#<>fg fd";
a= Regex.Replace(a, "[#&'(\\s)<>#]","");
import re
string1 = "12#34#adf$c5,6,7,ok"
output = re.sub(r'[^a-zA-Z0-9]','',string1)
^ will use for except mention in brackets(or replace special char with white spaces) will substitute with whitespaces then will return in string
result = 1234adfc567ok

c# replace each instance of a character selection in a string

I've found many references to do similar to this but none seem to be exactly what I'm after, so hoping someone could help.
In simple terms, I want to take a string entered by a user (into a Winform input), and firstly strip out any blanks, then replace any of a list of 'illegal' characters with the UK currency symbol (£). The requirement is for the input to be used but the file that is generated by the process has the modified filename.
I wrote a function (based on an extension method) but it's not working quite as expected:
public static class ExtensionMethods
{
public static string Replace(this string s, char[] separators, string newVal)
{
var temp = s.Split(separators, StringSplitOptions.RemoveEmptyEntries);
return String.Join(newVal, temp);
}
}
public static string RemoveUnwantedChars(string enteredName, char[] unwanted, string rChar)
{
return enteredName.Replace(unwanted, rChar);
}
Which in my code, I've called twice:
char[] blank = { ' ' };
string ename = Utilities.RemoveUnwantedChars(this.txtTableName.Text, blank, string.Empty);
char[] unwanted = { '(', ')', '.', '%', '/', '&', '+' };
string fname = Utilities.RemoveUnwantedChars(ename, unwanted, "£");
If I enter a string that contains at least one space, all of the characters above and some other letters (for example, " (GH) F16.5% M X/Y&1+1"), I get the following results:
ename = "(GH)F16.5%MX/Y&1+1" - this is correct in that it has removed the blanks.
fname = "GH£F16£5£MX£Y£1£1" - this hasn't worked correctly in that it has not replaced the first character but removed it.
The rest of the characters have been correctly replaced. It only occurs when one of the 'illegal' characters is at the start of the string - if my string was "G(H) F16.5% M X/Y&1+1", I would correctly get "G£H£F16£5£MX£Y£1£1". It also replaces multiple 'illegal' characters with one '£', so "M()GX+.1" would become "M£GX£1" but should be "M££GX££1".
I think the problem is in your Replace extension. You are splitting in this line
var temp = s.Split(separators, StringSplitOptions.RemoveEmptyEntries);
You are removing empty entries causing the unexpected result. Use this instead:
var temp = s.Split(separators, StringSplitOptions.None);
The problem is occuring because string.Join() only puts separators between substrings - it will never put one at the start.
One possible solution is to avoid using string.Join() and write Replace() like this instead:
public static class ExtensionMethods
{
public static string Replace(this string s, char[] separators, string newVal)
{
var sb = new StringBuilder(s);
foreach (char ch in separators)
{
string target = new string(ch, 1);
sb.Replace(target, newVal);
}
return sb.ToString();
}
}
When you use split method in your Replace function you get following strings:
GH, F16, 5, MX, Y, 1, 1.
When you join them with your newVal you get:
GH + newVal + F16 + newVal + ... thus omitting first replaced character.
You would probably need some special case to check if first char is "illegal" and put newVal at start of your string.

how to deal with string.split by position

I'd like to ask one question about String.Split
For example:
char[] semicolon=new [] {';'};
char[] bracket=new [] {'[',']'};
string str="AND[Firstpart;Sndpart]";
I can split str by bracket and then split by semicolon.
Finally,I get the Firstpart and Sndpart in the bracket.
But If str="AND[AND[Firstpart;Sndpart];sndpart];
How can I get AND[Firpart;Sndpart] and sndpart?
Is there a way to tell c# to split by second semicolon?
Thanks for your help
One way is to hide characters inside bracket with a character that is not used in any of your strings.
Method HideSplit: This method will change separator characters inside brackets with fake ones. Then it will perform split and will give back the result with original characters.
This method maybe an overkill if you want to do this many times. but you should be able to optimize it easily if you got the idea.
private static void Main()
{
char[] semicolon = new[] { ';' };
char[] bracket = new[] { '[', ']' };
string str = "AND[AND[Firstpart;Sndpart];sndpart]";
string[] splitbyBracket = HideSplit(str, bracket);
}
private static string[] HideSplit(string str,char[] separator)
{
int counter = 0; // When counter is more than 0 it means we are inside brackets
StringBuilder result = new StringBuilder(); // To build up string as result
foreach (char ch in str)
{
if(ch == ']') counter--;
if (counter > 0) // if we are inside brackets perform hide
{
if (ch == '[') result.Append('\uFFF0'); // add '\uFFF0' instead of '['
else if (ch == ']') result.Append('\uFFF1');
else if (ch == ';') result.Append('\uFFF2');
else result.Append(ch);
}
else result.Append(ch);
if (ch == '[') counter++;
}
string[] split = result.ToString().Split(separator); // Perform split. (characters are hidden now)
return split.Select(x => x
.Replace('\uFFF0', '[')
.Replace('\uFFF1', ']')
.Replace('\uFFF2', ';')).ToArray(); // unhide characters and give back result.
// dont forget: using System.Linq;
}
Some examples :
string[] a1 = HideSplit("AND[AND[Firstpart;Sndpart];sndpart]", bracket);
// Will give you this array { AND , AND[Firstpart;Sndpart];sndpart }
string[] a2 = HideSplit("AND[Firstpart;Sndpart];sndpart", semicolon);
// Will give you this array { AND[Firstpart;Sndpart] , sndpart }
string[] a3 = HideSplit("AND[Firstpart;Sndpart]", bracket);
// Will give you this array { AND , Firstpart;Sndpart }
string[] a4 = HideSplit("Firstpart;Sndpart", semicolon);
// Will give you this array { Firstpart , Sndpart }
And you can continue splitting this way.
Is there a way to tell c# to split by second semicolon?
There is no direct way to do that, but if that is precisely what you want, it's not hard to achieve:
string str="AND[AND[Firstpart;Sndpart];sndpart];
string[] tSplits = str.Split(';', 3);
string[] splits = { tSplits[0] + ";" + tSplits[1], tSplits[2] };
You could achieve the same result using a combination of IndexOf() and Substring(), however that is most likely not what you'll end up using as it's too specific and not very helpful for various inputs.
For your case, you need something that understands context.
In real-world complex cases you'd probably use a lexer / parser, but that seems like an overkill here.
Your best effort would probably be to use a loop, walk through all characters while counting +/- square brackets and spliting when you find a semicolon & the count is 1.
You can use Regex.Split, which is a more flexible form of String.Split:
string str = "AND[AND[Firstpart;Sndpart];sndpart]";
string[] arr = Regex.Split(str, #"(.*?;.*?;)");
foreach (var s in arr)
Console.WriteLine("'{0}'", s);
// output: ''
// 'AND[AND[Firstpart;Sndpart];'
// 'sndpart]'
Regex.Split splits not by chars, but by a string matching a regex expression, so it comes down to constructing a regex pattern meeting particular requirements. Splitting by a second semicolon is in practice splitting by a string that ends in a semicolon and that contains another semicolon before, so the matching pattern by which you split the input string could be for example: (.*?;.*?;).
The returned array has three elements instead of two because the splitting regex matches the beginning of the input string, in this case the empty string is returned as the first element.
You can read more on Regex.Split on msdn.

Cannot remove a set of chars in a string

I have a set of characters I want to remove from a string : "/\[]:|<>+=;,?*'#
I'm trying with :
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
private string Clean(string stringToClean)
{
return Regex.Replace(stringToClean, "[" + Regex.Escape(CHARS_TO_REPLACE) + "]", "");
}
However, the result is strictly identical to the input with something like "Foo, bar and other".
What is wrong in my code ?
This looks like a lot to this question, but with a black list instead of a white list of chars, so I removed the not in ^ char.
You didn't escape the closing square bracket in CHARS_TO_REPLACE
The problem is a misunderstanding of how Regex.Escape works. From MSDN:
Escapes a minimal set of characters (\, *, +, ?, |, {, [, (,), ^, $,., #, and white space) by replacing them with their escape codes.
It works as expected, but you need to think of Regex.Escape as escaping metacharacters outside of a character class. When you use a character class, the things you want to escape inside are different. For example, inside a character class - should be escaped to be literal, otherwise it could act as a range of characters (e.g., [A-Z]).
In your case, as others have mentioned, the ] was not escaped. For any character that holds a special meaning within the character class, you will need to handle them separately after calling Regex.Escape. This should do what you need:
string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
string pattern = "[" + Regex.Escape(CHARS_TO_REPLACE).Replace("]", #"\]") + "]";
string input = "hi\" there\\ [i love regex];#";
string result = Regex.Replace(input, pattern, "");
Console.WriteLine(result);
Otherwise, you were ending up with ["/\\\[]:\|<>\+=;,\?\*'#], which doesn't have ] escaped, so it was really ["/\\\[] as a character class, then :\|<>\+=;,\?\*'#] as the rest of the pattern, which wouldn't match unless your string matched exactly those remaining characters.
As already mentioned (but the answer has suddenly disappeared), Regex.Escape does not escape ], so you need to tweak your code:
return Regex.Replace(stringToClean, "[" + Regex.Escape(CHARS_TO_REPLACE)
.Replace("]", #"\]") + "]", " ");
There are a number of characters within CHARS_TO_REPLACE which are special to Regex's and need to be escaped with a slash \.
This should work:
"/\[]:\|<>\+=;,\?\*'#
Why not just do:
private static string Clean(string stringToClean)
{
string[] disallowedChars = new string[] {//YOUR CHARS HERE};
for (int i = 0; i < disallowedChars.Length; i++)
{
stringToClean= stringToClean.Replace(disallowedChars[i],"");
}
return stringToClean;
}
Single-statement linq solution:
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
private string Clean(string stringToClean) {
return CHARS_TO_REPLACE
.Aggregate(stringToClean, (str, l) => str.Replace(""+l, ""));
}
For the sake of knowledge, here is a variant suited for very large strings (or even streams). No regex here, simply a loop over each chars with a stringbuilder for storing the result :
class Program
{
private const string CHARS_TO_REPLACE = #"""/\[]:|<>+=;,?*'#";
static void Main(string[] args)
{
var wc = new WebClient();
var veryLargeString = wc.DownloadString("http://msdn.microsoft.com");
using (var sr = new StringReader(veryLargeString))
{
var sb = new StringBuilder();
int readVal;
while ((readVal = sr.Read()) != -1)
{
var c = (char)readVal;
if (!CHARS_TO_REPLACE.Contains(c))
{
sb.Append(c);
}
}
Console.WriteLine(sb.ToString());
}
}
}

Read text file word-by-word using LINQ

I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.
This is wht I could come up with:
static void Main()
{
string[] content = File.ReadAllLines("text.txt");
var query = (from c in content
select content);
foreach (var line in content)
{
Console.Write(line+"\n");
}
}
This reads the file line by line. If i change ReadAllLines to ReadAllText, the file is read letter by letter.
Any ideas?
string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}
You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.
In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.
string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' }; // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
string content = File.ReadAllText("Text.txt");
var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries)
select word;
You will need to define the array of whitespace chars with your own values like so:
List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};
This code assumes that panctuation is a part of the word (like a comma).
It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:
Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
string word = m.Groups[1].Value;
// do your processing here
m = m.NextMatch();
}
The following uses iterator blocks, and therefore uses deferred loading. Other solutions have you loading the entire file into memory before being able to iterate over the words.
static IEnumerable<string> GetWords(string path){
foreach (var line in File.ReadLines(path)){
foreach (var word in line.Split(null)){
yield return word;
}
}
}
(Split(null) automatically removes whitespace)
Use it like this:
foreach (var word in GetWords(#"text.txt")){
Console.WriteLine(word);
}
Works with standard Linq funness too:
GetWords(#"text.txt").Take(25);
GetWords(#"text.txt").Where(w => w.Length > 3)
Of course error handling etc. left out for sake of learning.
You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine)) but that's not a lot of linq.

Categories