How to skip `\r \n ` in string - c#

I am working on a simple converter which converts a text to another language,
suppose i have two textboxes and in 1st box you enter the word Index and press the convert button.
I will replace your text with this فہرست an alternative of Index in urdu language but i have a problem if you enter word index and gives some spaces or gives some returns then i get text of that textbox in c# like this Index \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n now how can i get rid of this i want to get simple Index always .
Thanks for answer and please feel free to comment if you have any question

Try using the Trim method if the new lines are only at the end of beginning:
input = input.Trim();
You can use Replace, if you want to remove new lines anywhere in the string:
// Replace line break with spaces
input = input.Replace("\r\n", " ");
// (Optionally) Combine consecutive spaces to one space (probalby not most efficient but should work)
while (input.Contains(" ")) { input = input.Replace(" ", " "); }
If you want to prevent newlines completely, most TextBox Controls have a property like MultiLine or similar, that, when set, prevents entering more than one line.

input.Replace(Environment.NewLine, string.Empty).Replace(" ", string.Empty);
User Replace to remove characters from the 'inside' of the string. Trim removes characters only at the begining and end of string.

This should suffice to remove whitespaces as defined by Char.IsWhiteSpace (blanks, newlines etc)
string wordToTranslate = textBox1.Text.Trim();
however, if your textbox contains multiple words then you should use a different approach
string[] words = textBox1.Text.Split((char[]) null, StringSplitOptions.RemoveEmptyEntries);
foreach(string wordToTranslate in words)
ExecTranslation(wordToTranslate);
using Split with char[] null as separator allows to identify every whitespaces as valid word separator

Add all chars you want to ignore to the string:
var cleanChars = text.Where(c => !"\n\r".Contains(c));
string cleanText = new string(cleanChars.ToArray());
That works because string implements IEnumerable<char>.

Related

Replace text which contain line breaks without dropping them

We have a text which goes like this ..
This is text
i want
to keep
but
Replace this sentence
because i dont like it.
Now i want to replace this sentence Replace this sentence because i dont like it.
Of course going like this
text = text.Replace(#"Replace this sentence because i dont like it.", "");
Wont solve my problem. I can't drop line breaks and replace them with one line.
My output should be
This is text
i want
to keep
but
Please keep in mind there is a lot variations and line breaks for sentence i don't like.
I.E it may go like
Replace this
sentence
because i dont like it.
or
Some text before. Replace this
sentence
because i dont like it.
You can use Regex to find any kind of whitespace. This includes regular spaces but also carriage returns and linefeeds as well as tabulators or half-spaces and so on.
string input = #"This is text
i want
to keep
but
Replace this sentence
because i dont like it.";
string dontLike = #"Replace this sentence because i dont like it.";
string pattern = Regex.Escape(dontLike).Replace(#"\ ", #"\s+");
Console.WriteLine("Pattern:");
Console.WriteLine(pattern);
string clean = Regex.Replace(input, pattern, "");
Console.WriteLine();
Console.WriteLine("Result:");
Console.WriteLine(clean);
Console.ReadKey();
Output:
Pattern:
Replace\s+this\s+sentence\s+because\s+i\s+dont\s+like\s+it\.
Result:
This is text
i want
to keep
but
Regex.Escape escapes any character that would otherwise have a special meaning in Regex. E.g., the period "." means "any number of repetitions". It also replaces the spaces " " with #"\ ". We in turn replace #"\ " in the search pattern by #"\s+". \s+ in Regex means "one or more white spaces".
Use regex to match "any whitespace" instead of just space in your search string. Roughly
escape search string to be safe for regex -Escape Special Character in Regex
replace spaces with "\s+" (reference)
run regex matching multiple lines - Multiline regular expression in C#
Or, use LINQ to accomplish this:
var text = "Drones " + Environment.NewLine + "are great to fly, " + Environment.NewLine + "yes, very fun!";
var textToReplace = "Drones are great".Split(" ").ToList();
textToReplace.ForEach(f => text = text.Replace(f, ""));
Output:
to fly,
yes, very fun!
Whatever method you choose, you are going to deal with extra line breaks, too many spaces and other formatting issues... Good luck!
You can use something like this, if output format of string is optional here:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string textToReplace = #"Replace this sentence because i dont like it.";
string text = #"This is text
i want
to keep
but
Replace this sentence
because i dont like it.";
text = Regex.Replace(text, #"\s+", " ", RegexOptions.Multiline);
text = text.Replace(textToReplace, string.Empty);
Console.WriteLine(text);
}
}
Output:
"This is text i want to keep but"

Replace Unicode character "�" with a space

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "�" for a normal space, " ".
The character "�" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.
I'm trying do the replace this way:
public string Errors = "";
public void test(){
string textFromCsvCell = "";
string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
if (Regex.IsMatch(cleaned, validCharacters ))
//All code for insert
else
Errors=cleaned;
//print Errors
}
The test method shows me this text:
"This is my�texto from csv file"
I try some solutions too:
Trying solution 1: Using Trim
Regex.Replace(value.Trim(), #"[^\S\r\n]+", " ");
Try solution 2: Using Replace
System.Text.RegularExpressions.Regex.Replace(str, #"\s+", " ");
Try solution 3: Using Trim
String.Trim(new char[]{'\uFEFF', '\u200B'});
Try solution 4: Add [\S\r\n] to validCharacters
string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";
Nothing works.
How can I replace it?
Sources:
Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
Trying to replace all white space with a single space
Strip the byte order mark from string in C#
Remove extra whitespaces, but keep new lines using a regular expression in C#
EDITED
This is the original string:
"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"
in 0x... notation
SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE
Solution
Go to the Unicode code converter. Look at the conversions and do the replace.
In my case, I do a simple replace:
string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
//value contains non-breaking whitespace
//value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE"
string cleaned = "";
string pattern = #"[^\u0000-\u007F]+";
string replacement = " ";
Regex rgx = new Regex(pattern);
cleaned = rgx.Replace(value, replacement);
if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
//all code for insert
else
//Error messages
This expression represents all possible spaces: space, tab, page break, line break and carriage return
[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]
References
Regular expressions (MDN)
Using String.Replace:
Use a simple String.Replace().
I've assumed that the only characters you want to remove are the ones you've mentioned in the question: � and you want to replace them by a normal space.
string text = "imp�ortant";
string cleaned = text.Replace('\u00ef', ' ')
.Replace('\u00bf', ' ')
.Replace('\u00bd', ' ');
// Returns 'imp ortant'
Or using Regex.Replace:
string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp ortant'
Try it out: Dotnet Fiddle
Define a range of ASCII characters, and replace anything that is not within that range.
We want to find only Unicode characters, so we will match on a Unicode character and replace.
Regex.Replace("This is my te\uFFFDxt from csv file", #"[^\u0000-\u007F]+", " ")
The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.
Result
This is my te xt from csv file
You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.
If you just want ASCII then try the following:
var ascii = new ASCIIEncoding();
byte[] encodedBytes = ascii.GetBytes(text);
var cleaned = ascii.GetString(encodedBytes).Replace("?", " ");

Removing numbers from text using C#

I have a text file for processing, which has some numbers. I want JUST text in it, and nothing else. I managed to remove the punctuation marks, but how do I remove the numbers? I want this using C# code.
Also, I want to remove words with length greater than 10. How do I do that using Reg Expressions?
You can do this with a regex:
string withNumbers = // string with numbers
string withoutNumbers = Regex.Replace(withNumbers, "[0-9]", "");
Use this regex to remove words with more than 10 characters:
[\w]{10, 100}
100 defines the max length to match. I don't know if there is a quantifier for min length...
Only letters and nothing else (because I see you also want to remove the punctuation marks)
Regex.IsMatch(input, #"^[a-zA-Z]+$");
You can also use string.Join:
string s = "asdasdad34534t3sdf43534";
s = string.Join(null, System.Text.RegularExpressions.Regex.Split(s, "[\\d]"));
The Regex.Replace method should do the trick.
// regex to match any digit
var regex = new Regex("\d");
// replace all matches in input with empty string
var output = regex.Replace(input, String.Empty);

Searching for a RegEx to split a text in it words

I am searching for a RegularExpression to split a text in it words.
I have tested
Regex.Split(text, #"\s+")
But this gives me for example for
this (is a) text. and
this
(is
a)
text
and
But I search for a solution, that gives me only the words - without the (, ), . etc.
It should also split a text like
end.begin
in two words.
Try this:
Regex.Split(text, #"\W+")
\W is the counterpart to \w, which means alpha-numeric.
You're probably better off matching the words rather than splitting.
If you use Split (with \W as Regexident suggested), then you could get an extra string at the beginning and end. For example, the input string (a b) would give you four outputs: "", "a", "b", and another "", because you're using the ( and ) as separators.
What you probably want to do is just match the words. You can do that like this:
Regex.Matches(text, "\\w+").Cast<Match>().Select(match => match.Value)
Then you'll get just the words, and no extra empty strings at the beginning and end.
You can do:
var text = "this (is a) text. and";
// to replace unwanted characters with space
text = System.Text.RegularExpressions.Regex.Replace(text, "[(),.]", " ");
// to split the text with SPACE delimiter
var splitted = text.Split(null as char[], StringSplitOptions.RemoveEmptyEntries);
foreach (var token in splitted)
{
Console.WriteLine(token);
}
See this Demo

How to remove extra returns and spaces in a string by regex?

I convert a HTML code to plain text.But there are many extra returns and spaces.How to remove them?
string new_string = Regex.Replace(orig_string, #"\s", "") will remove all whitespace
string new_string = Regex.Replace(orig_string, #"\s+", " ") will just collapse multiple whitespaces into one
I'm assuming that you want to
find two or more consecutive spaces and replace them with a single space, and
find two or more consecutive newlines and replace them with a single newline.
If that's correct, then you could use
resultString = Regex.Replace(subjectString, #"( |\r?\n)\1+", "$1");
This keeps the original "type" of whitespace intact and also preserves Windows line endings correctly. If you also want to "condense" multiple tabs into one, use
resultString = Regex.Replace(subjectString, #"( |\t|\r?\n)\1+", "$1");
To condense a string of newlines and spaces (any number of each) into a single newline, use
resultString = Regex.Replace(subjectString, #"(?:(?:\r?\n)+ +){2,}", #"\n");
I used a lot of algorithm for that. Every loop was good but this was clear and absolute.
//define what you want to remove as char
char tb = (char)9; //Tab char ascii code
spc = (char)32; //space char ascii code
nwln = (char)10; //New line char ascii char
yourstring.Replace(tb,"");
yourstring.Replace(spc,"");
yourstring.Replace(nwln,"");
//by defining chars, result was better.
You can use Trim() to remove the spaces and returns. In HTML the spaces is not important so you can omit them by using the Trim() method in System.String class.

Categories