Replace text which contain line breaks without dropping them - c#

We have a text which goes like this ..
This is text
i want
to keep
but
Replace this sentence
because i dont like it.
Now i want to replace this sentence Replace this sentence because i dont like it.
Of course going like this
text = text.Replace(#"Replace this sentence because i dont like it.", "");
Wont solve my problem. I can't drop line breaks and replace them with one line.
My output should be
This is text
i want
to keep
but
Please keep in mind there is a lot variations and line breaks for sentence i don't like.
I.E it may go like
Replace this
sentence
because i dont like it.
or
Some text before. Replace this
sentence
because i dont like it.

You can use Regex to find any kind of whitespace. This includes regular spaces but also carriage returns and linefeeds as well as tabulators or half-spaces and so on.
string input = #"This is text
i want
to keep
but
Replace this sentence
because i dont like it.";
string dontLike = #"Replace this sentence because i dont like it.";
string pattern = Regex.Escape(dontLike).Replace(#"\ ", #"\s+");
Console.WriteLine("Pattern:");
Console.WriteLine(pattern);
string clean = Regex.Replace(input, pattern, "");
Console.WriteLine();
Console.WriteLine("Result:");
Console.WriteLine(clean);
Console.ReadKey();
Output:
Pattern:
Replace\s+this\s+sentence\s+because\s+i\s+dont\s+like\s+it\.
Result:
This is text
i want
to keep
but
Regex.Escape escapes any character that would otherwise have a special meaning in Regex. E.g., the period "." means "any number of repetitions". It also replaces the spaces " " with #"\ ". We in turn replace #"\ " in the search pattern by #"\s+". \s+ in Regex means "one or more white spaces".

Use regex to match "any whitespace" instead of just space in your search string. Roughly
escape search string to be safe for regex -Escape Special Character in Regex
replace spaces with "\s+" (reference)
run regex matching multiple lines - Multiline regular expression in C#

Or, use LINQ to accomplish this:
var text = "Drones " + Environment.NewLine + "are great to fly, " + Environment.NewLine + "yes, very fun!";
var textToReplace = "Drones are great".Split(" ").ToList();
textToReplace.ForEach(f => text = text.Replace(f, ""));
Output:
to fly,
yes, very fun!
Whatever method you choose, you are going to deal with extra line breaks, too many spaces and other formatting issues... Good luck!

You can use something like this, if output format of string is optional here:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string textToReplace = #"Replace this sentence because i dont like it.";
string text = #"This is text
i want
to keep
but
Replace this sentence
because i dont like it.";
text = Regex.Replace(text, #"\s+", " ", RegexOptions.Multiline);
text = text.Replace(textToReplace, string.Empty);
Console.WriteLine(text);
}
}
Output:
"This is text i want to keep but"

Related

How can I use lookbehind in a C# Regex in order to remove line breaks?

I have a text file with the repetitve structure as a header and a detail records such as
StopService::
697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to
A#gmail.com::0::::
I want to remove the line break between the header and the detail record so as to process them as a single record, as the detail record can contain line breaks as well I need to remove only the line breaks that follow directly the :: sign.
I'm not a pro when using regular expressions so I searched and tried to use this approach but it doesn't work:
string text = File.ReadAllText(path);
Regex.Replace(text, #"(?<=(:))(?!\1):\n", String.Empty);
File.WriteAllText(path, text);
I also tried this:
Regex.Replace(text, #"(?<=::)\n", String.Empty);
Any idea how I can use a regex look-behind in this case?
My output should look like this:
StopService::697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to
A#gmail.com::0::::
Non-regex Way
Read a file line by line. Check the first line and if it is equal to StopService:: do not add a newline (Environment.Newline) after it.
Regex way
You can match the line break after the first :: using a (?<=^[^:]*::) look-behind:
var str = "StopService::\r\n697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to\r\nA#gmail.com::0::::";
var rgx = new Regex(#"(?<=^[^:]*::)[\r\n]+");
Console.WriteLine(rgx.Replace(str, string.Empty));
Output:
StopService::697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to
A#gmail.com::0::::
See IDEONE demo
The look-behind ((?<=...)) matches:
^ - Start of string
[^:]* - 0 or more characters other than :
:: - 2 colons
The [\r\n]+ pattern makes sure we match all newline symbols, even if there is more than one.
Try this:
Regex.Replace(yourtext, #"(?<=[::])[\r\n|\n|\r]", string.empty);
You were on the right track with the lookbehind idea. But you need to look for a newline and/or/both a carriage return...
Here's my quick attempt. It may need some tweaks, as I just dummied up two records for input.
The approach is to define a Regex that identifies the header, line break, and detail (which may include line breaks). Then, just run a replace that puts the header back together with the detail, throwing out the header/detail line break.
The RegexOptions.IgnorePatternWhitespace option is used to allow whitespace in the expression for better readability.
var text = "StopService::" + Environment.NewLine;
text += "697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to" + Environment.NewLine;
text += "A#gmail.com::0::::" + Environment.NewLine;
text += "StopService::" + Environment.NewLine;
text += "697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to" + Environment.NewLine;
text += "A#gmail.com::0::::" + Environment.NewLine;
var options = RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace;
var matchRegex = new Regex("(?<header>\\w+?::) \\r\\n (?<detail>.+?::::)", options );
var replacement = "${header}${detail}";
var newText = matchRegex.Replace(text,replacement);
Produces:
StopService::697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to
A#gmail.com::0::::
StopService::697::12::test::20::a#yahoo.com::20 Main Rd::Alcatraz::CA::1200::Please send me Information to
A#gmail.com::0::::
Javascript:
yourtext.replace(/(\r\n|\n|\r)/gm," ");
I haven't tested C# one. It should work something like below.
C#:
Regex.Replace(yourtext, #"/(\r\n|\n|\r)/gm", " ");

Using Regex Replace instead of String Replace

I am not clued up on Regex as much as I should be, so this may seem like a silly question.
I am splitting a string into a string[] with .Split(' ').
The purpose is to check the words, or replace any.
The problem I'm having now, is that for the word to be replaces, it has to be an exact match, but with the way I'm splitting it, there might be a ( or [ with the split word.
So far, to counter that, I'm using something like this:
formattedText.Replace(">", "> ").Replace("<", " <").Split(' ').
This works fine for now, but I want to incorporate more special chars, such as [;\\/:*?\"<>|&'].
Is there a quicker way than the method of my replacing, such as Regex? I have a feeling my route is far from the best answer.
EDIT
This is an (example) string
would be replaced to
This is an ( example ) string
If you want to replace whole words, you can do that with a regular expression like this.
string text = "This is an example (example) noexample";
string newText = Regex.Replace(text, #"\bexample\b", "!foo!");
newText will contain "This an !foo! (!foo!) noexample"
The key here is that the \b is the word break metacharacter. So it will match at the beginning or end of a line, and the transitions between word characters (\w) and non-word characters (\W). The biggest difference between it and using \w or \W is that those won't match at the beginning or end of lines.
I thing this is the right thing you want
if you want these -> ;\/:*?"<>|&' symbols to replace
string input = "(exam;\\/:*?\"<>|&'ple)";
Regex reg = new Regex("[;\\/:*?\"<>|&']");
string result = reg.Replace(input, delegate(Match m)
{
return " " + m.Value + " ";
});
if you want to replace all characters except a-zA-Z0-9_
string input = "(example)";
Regex reg = new Regex(#"\W");
string result = reg.Replace(input, delegate(Match m)
{
return " " + m.Value + " ";
});

How to skip `\r \n ` in string

I am working on a simple converter which converts a text to another language,
suppose i have two textboxes and in 1st box you enter the word Index and press the convert button.
I will replace your text with this فہرست an alternative of Index in urdu language but i have a problem if you enter word index and gives some spaces or gives some returns then i get text of that textbox in c# like this Index \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n now how can i get rid of this i want to get simple Index always .
Thanks for answer and please feel free to comment if you have any question
Try using the Trim method if the new lines are only at the end of beginning:
input = input.Trim();
You can use Replace, if you want to remove new lines anywhere in the string:
// Replace line break with spaces
input = input.Replace("\r\n", " ");
// (Optionally) Combine consecutive spaces to one space (probalby not most efficient but should work)
while (input.Contains(" ")) { input = input.Replace(" ", " "); }
If you want to prevent newlines completely, most TextBox Controls have a property like MultiLine or similar, that, when set, prevents entering more than one line.
input.Replace(Environment.NewLine, string.Empty).Replace(" ", string.Empty);
User Replace to remove characters from the 'inside' of the string. Trim removes characters only at the begining and end of string.
This should suffice to remove whitespaces as defined by Char.IsWhiteSpace (blanks, newlines etc)
string wordToTranslate = textBox1.Text.Trim();
however, if your textbox contains multiple words then you should use a different approach
string[] words = textBox1.Text.Split((char[]) null, StringSplitOptions.RemoveEmptyEntries);
foreach(string wordToTranslate in words)
ExecTranslation(wordToTranslate);
using Split with char[] null as separator allows to identify every whitespaces as valid word separator
Add all chars you want to ignore to the string:
var cleanChars = text.Where(c => !"\n\r".Contains(c));
string cleanText = new string(cleanChars.ToArray());
That works because string implements IEnumerable<char>.

Using Regex to match quoted string with embedded, non-escaped quotes

I am trying to match a string in the following pattern with a regex.
string text = "'Emma','The Last Leaf','Gulliver's travels'";
string pattern = #"'(.*?)',?";
foreach (Match match in Regex.Matches(text,pattern,RegexOptions.IgnoreCase))
{
Console.WriteLine(match + " " + match.Index);
Console.WriteLine(match.Groups[1].Captures[0]);
}
This matches "Emma" and "The Last leaf" correctly, however the third match is "Gulliver". But the desired match is "Gulliver's travels". How can I build a regex for a patterns like this?
Since , is your delimiter, you can try changing your pattern like this. It should work.
string pattern = #"'(.*?)'(?:,|$)";
The way this works is, it looks for a single quote followed by a comma or end of the line.
I think this can work '(.*?)',|'(.*)' as regular expression.
you may consider to use look behind /look ahead:
"(?<=^'|',').*?(?='$|',')"
test with grep:
kent$ echo "'Emma','The Last Leaf','Gulliver's travels'"|grep -Po "(?<=^'|',').*?(?='$|',')"
Emma
The Last Leaf
Gulliver's travels
You can't, if you have single-quote delimited strings and Gulliver's contains a single, unescaped quote there's no way to distinguish it from the end of a string. You could always just split it by commas and trim 's from either side but I'm not sure that's what you want:
string text = "'Emma','The Last Leaf','Gulliver's travels'";
foreach(string s in text.split(new char[] {','})) {
Console.WriteLine(s.Trim('\''));
}

Regex - Find from both sides only if it has spaces

I need some help on Regex. I need to find a word that is surrounded by whatever element, for example - *. But I need to match it only if it has spaces or nothing on the ether sides. For example if it is at start of the text I can't really have space there, same for end.
Here is what I came up to
string myString = "You will find *me*, and *me* also!";
string findString = #"(\*(.*?)\*)";
string foundText;
MatchCollection matchCollection = Regex.Matches(myString, findString);
foreach (Match match in matchCollection)
{
foundText = match.Value.Replace("*", "");
myString = myString.Replace(match.Value, "->" + foundText + "<-");
match.NextMatch();
}
Console.WriteLine(myString);
You will find ->me<-, and ->me<- also!
Works correct, the problem is when I add * in the middle of text, I don't want it to match then.
Example: You will find *m*e*, and *me* also!
Output: You will find ->m<-e->, and <-me* also!
How can I fix that?
Try the following pattern:
string findString = #"(?<=\s|^)\*(.*?)\*(?=\s|$)";
(?<=\s|^)X will match any X only if preceded by a space-char (\s), or the start-of-input, and
X(?=\s|$) matches any X if followed by a space-char (\s), or the end-of-input.
Note that it will not match *me* in foo *me*, bar since the second * has a , after it! If you want to match that too, you need to include the comma like this:
string findString = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
You'll need to expand the set [\s,] as you see fit, of course. You might want to add !, ? and . at the very least: [\s,!?.] (and no, . and ? do not need to be escaped inside a character-set!).
EDIT
A small demo:
string Txt = "foo *m*e*, bar";
string Pattern = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
Console.WriteLine(Regex.Replace(Txt, Pattern, ">$1<"));
which would print:
>m*e<
You can add "beginning of line or space" and "space or end of line" around your match:
(^|\s)\*(.*?)\*(\s|$)
You'll now need to refer to the middle capture group for the match string.

Categories