Want to form a string with given hex code values - c#

I want to replace certain characters in an input string with other characters.
The input text has Microsoft left and right smart quotes which I would like to convert to just a single ".
I was planning on using the Replace operation, but am having trouble forming the text string to be searched for.
I would like to replace the input sequence (in hex) \xE2809C, and change that sequence to just a single ". Ditto with \xE2809D.
How do I form the string to use in the Replace operation?
I'm thinking of something like (in a loop):
tempTxt = tempTxt.Replace(charsToRemove[i], charsToSubstitute[i]);
but I'm having trouble creating the charsToRemove array.
Maybe a bigger question is whether the whole input file can be read and converted to plain ASCII using some read/write and string conversions in C#.
Thanks, Mike

Something like this?
char [] charsToRemove = {
'\u201C', // These are the Unicode code points (not the UTF representation)
'\u201D'
};
char [] charsToSubstitute = {
'"',
'"'
};

You may want to give Regex a shot. Here's an example that will replace smart-quoted text with the single ".
string tempTxt = "I am going to “test” this. “Hope” it works";
string formattedText = Regex.Replace(tempTxt, "s/“|”|“|”/", #"""");

I'm using a ReqPro40.dll to read data. The data is stored as text. Hope I didn't lose too much on copy/paste below. The stuff below works to the best of my knowledge. But I want to get rid of longer sequences of bad characters. E2809C should become a quote, but I'm having trouble matching it.
string tempTxt = Req.get_Tag(ReqPro40.enumTagFormat.eTagFormat_ReqNameOrReqText);
tempTxt=tempTxt.Substring(1, tempTxt.Length-1);
char[] charsToRemoveForXMLLegality = new char[]
{ '\x000a', '\x000b', '\x0002', '\x001e', // NL, VT, STX, RS
'\x0034', '\x8220', '\x8221', // ", left double, right double quote
'\x8216', '\x8217', // left single quote, right single quote
'x8211', '\x8212', // en-dash, em-dash
'\x0188', '\x0177', // 1/4 fraction, plus/minus
'\x8230', '\x0160' // ellipsis, non-breaking space
};
string[] charsToSubstituteForXMLLegality = new string[]
{ " ", " ", "", "-",
"\"", "\"", "\"",
"\'", "\'",
"-", "-",
"1/4", "+/-",
"...", " "
};
for (int i = 0; i < charsToRemoveForXMLLegality.Length; i++)
{
tempTxt = tempTxt.Replace(charsToRemoveForXMLLegality[i].ToString(), charsToSubstituteForXMLLegality[i]);
}

Related

How can I remove the spaces that appear between the words even after splitting the string? [duplicate]

I have the following input:
string txt = " i am a string "
I want to remove space from start of starting and end from a string.
The result should be: "i am a string"
How can I do this in c#?
String.Trim
Removes all leading and trailing white-space characters from the current String object.
Usage:
txt = txt.Trim();
If this isn't working then it highly likely that the "spaces" aren't spaces but some other non printing or white space character, possibly tabs. In this case you need to use the String.Trim method which takes an array of characters:
char[] charsToTrim = { ' ', '\t' };
string result = txt.Trim(charsToTrim);
Source
You can add to this list as and when you come across more space like characters that are in your input data. Storing this list of characters in your database or configuration file would also mean that you don't have to rebuild your application each time you come across a new character to check for.
NOTE
As of .NET 4 .Trim() removes any character that Char.IsWhiteSpace returns true for so it should work for most cases you come across. Given this, it's probably not a good idea to replace this call with the one that takes a list of characters you have to maintain.
It would be better to call the default .Trim() and then call the method with your list of characters.
You can use:
String.TrimStart - Removes all leading occurrences of a set of characters specified in an array from the current String object.
String.TrimEnd - Removes all trailing occurrences of a set of characters specified in an array from the current String object.
String.Trim - combination of the two functions above
Usage:
string txt = " i am a string ";
char[] charsToTrim = { ' ' };
txt = txt.Trim(charsToTrim)); // txt = "i am a string"
EDIT:
txt = txt.Replace(" ", ""); // txt = "iamastring"
I really don't understand some of the hoops the other answers are jumping through.
var myString = " this is my String ";
var newstring = myString.Trim(); // results in "this is my String"
var noSpaceString = myString.Replace(" ", ""); // results in "thisismyString";
It's not rocket science.
txt = txt.Trim();
Or you can split your string to string array, splitting by space and then add every item of string array to empty string.
May be this is not the best and fastest method, but you can try, if other answer aren't what you whant.
text.Trim() is to be used
string txt = " i am a string ";
txt = txt.Trim();
Use the Trim method.
static void Main()
{
// A.
// Example strings with multiple whitespaces.
string s1 = "He saw a cute\tdog.";
string s2 = "There\n\twas another sentence.";
// B.
// Create the Regex.
Regex r = new Regex(#"\s+");
// C.
// Strip multiple spaces.
string s3 = r.Replace(s1, #" ");
Console.WriteLine(s3);
// D.
// Strip multiple spaces.
string s4 = r.Replace(s2, #" ");
Console.WriteLine(s4);
Console.ReadLine();
}
OUTPUT:
He saw a cute dog.
There was another sentence.
He saw a cute dog.
You Can Use
string txt = " i am a string ";
txt = txt.TrimStart().TrimEnd();
Output is "i am a string"

C# Unable to split string by new lines \n

I'm facing a odd problem were C# is unable to split a string for new lines. I tried many combinations like use only Split.('\n') but all lead to return the whole string unsplited on first position of the array so lines[0] is the same as the input string to be splited, that never happen before with other strings i had to parse.
Image bellow:
String:
Don't remove the following keywords! These keywords are used in the
"compatible printer" condition of the print and filament profiles to
link the particular print and filament profiles to this printer
profile.\nPRINTER_VENDOR_PRUSA3D\nPRINTER_MODEL_SL1\nPRINTER_VENDOR_EPAX\nPRINTER_MODEL_X1\n\nSTART_CUSTOM_VALUES\nFLIP_XY\nLayerOffTime_0\nBottomLightOffDelay_2\nBottomLiftHeight_5\nLiftHeight_5.5\nBottomLiftSpeed_40.2\nLiftSpeed_60\nRetractSpeed_150\nBottomLightPWM_255\nLightPWM_255\nAntiAliasing_4
; Use 0 or 1 for disable AntiAliasing with "printer gamma correction"
set to 0, otherwise use multiples of 2 and "gamma correction" set to 1
for enable\nEND_CUSTOM_VALUES
Code:
var lines = previousString.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.RemoveEmptyEntries);
Output:
An array of lenght = 1 producing lines[0] == previousString
string[] lines = theText.Split(
new[] { Environment.NewLine },
StringSplitOptions.None
);
edit:
string[] lines = theText.Split(
new[] { "\r\n", "\r", "\n" },
StringSplitOptions.None
);
working fiddle: https://dotnetfiddle.net/HNY8a6
See: this SO post
Sometimes when you see a \n on screen it really is a backslash (ASCII 92 and an en(ASCII 110) not a placeholder/escape sequence for new line (ASCII 10) A big hint for that here is that text boxes will usually not display newlines with escape codes but will put in actual new lines.
To split on \n use the string "\\n" which represents a string of two characters: the two backslashes produce a single character ASCII 92 = '' in the string and then a lowercase n.
Alternately you could use #"\n". The # sign tells C# not to use escape codes in the quoted string.
I'm not quite sure why you are using the Printer methods but I hope you don't require them.
string test = "Hello \nTest \n123"; //Create Test String
string[] seperated = test.Split('\n'); //Splite String by '\n'
for(int i = 0; i < seperated.Length; i++){ //Output substrings
Console.WriteLine(seperated[i]);
}
Output:
Hello
Test
123
I hope this solution works for you!
Edit: Added \r\n and \r support
If you also need to split strings by '\r' or '\r\n' then this code is the one to go with.
string test = "Hello \r\nTest \n123 \rEnd"; //Create Test String
test = test.Replace("\r\n","\n");
test = test.Replace("\r","\n");
string[] seperated = test.Split('\n'); //Splite String by '\n'
for(int i = 0; i < seperated.Length; i++){ //Output substrings
Console.WriteLine(seperated[i]);
}
Output:
Hello
Test
123
End
Edit2: Hopefully Solution
So you are saying that
\nPRINTER_VENDOR_PRUSA3D\nPRINTER_MODEL_SL1\nPRINTER_VENDOR_EPAX\nPRINTER_MODEL_X1\n\nSTART_CUSTOM_VALUES\nFLIP_XY\nLayerOffTime_0\nBottomLightOffDelay_2\nBottomLiftHeight_5\nLiftHeight_5.5\nBottomLiftSpeed_40.2\nLiftSpeed_60\nRetractSpeed_150\nBottomLightPWM_255\nLightPWM_255\nAntiAliasing_4 ; Use 0 or 1 for disable AntiAliasing with "printer gamma correction" set to 0, otherwise use multiples of 2 and "gamma correction" set to 1 for enable\nEND_CUSTOM_VALUES
is the string then the problem might be that this string contains some " which will interfere with the .Split method
If you're able to input the string manually you should replace a simple " with a "

Replace Unicode character "�" with a space

I'm a doing an massive uploading of information from a .csv file and I need replace this character non ASCII "�" for a normal space, " ".
The character "�" corresponds to "\uFFFD" for C, C++, and Java, which it seems that it is called REPLACEMENT CHARACTER. There are others, such as spaces type like U+FEFF, U+205F, U+200B, U+180E, and U+202F in the C# official documentation.
I'm trying do the replace this way:
public string Errors = "";
public void test(){
string textFromCsvCell = "";
string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
if (Regex.IsMatch(cleaned, validCharacters ))
//All code for insert
else
Errors=cleaned;
//print Errors
}
The test method shows me this text:
"This is my�texto from csv file"
I try some solutions too:
Trying solution 1: Using Trim
Regex.Replace(value.Trim(), #"[^\S\r\n]+", " ");
Try solution 2: Using Replace
System.Text.RegularExpressions.Regex.Replace(str, #"\s+", " ");
Try solution 3: Using Trim
String.Trim(new char[]{'\uFEFF', '\u200B'});
Try solution 4: Add [\S\r\n] to validCharacters
string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";
Nothing works.
How can I replace it?
Sources:
Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)
Trying to replace all white space with a single space
Strip the byte order mark from string in C#
Remove extra whitespaces, but keep new lines using a regular expression in C#
EDITED
This is the original string:
"SYSTEM OF MONITORING CONTINUES OF GLUCOSE"
in 0x... notation
SYSTEM OF0xA0MONITORING CONTINUES OF GLUCOSE
Solution
Go to the Unicode code converter. Look at the conversions and do the replace.
In my case, I do a simple replace:
string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
//value contains non-breaking whitespace
//value is "SYSTEM OF�MONITORING CONTINUES OF GLUCOSE"
string cleaned = "";
string pattern = #"[^\u0000-\u007F]+";
string replacement = " ";
Regex rgx = new Regex(pattern);
cleaned = rgx.Replace(value, replacement);
if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
//all code for insert
else
//Error messages
This expression represents all possible spaces: space, tab, page break, line break and carriage return
[ \f\n\r\t\v​\u00a0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​​\u202f\u205f​\u3000]
References
Regular expressions (MDN)
Using String.Replace:
Use a simple String.Replace().
I've assumed that the only characters you want to remove are the ones you've mentioned in the question: � and you want to replace them by a normal space.
string text = "imp�ortant";
string cleaned = text.Replace('\u00ef', ' ')
.Replace('\u00bf', ' ')
.Replace('\u00bd', ' ');
// Returns 'imp ortant'
Or using Regex.Replace:
string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp ortant'
Try it out: Dotnet Fiddle
Define a range of ASCII characters, and replace anything that is not within that range.
We want to find only Unicode characters, so we will match on a Unicode character and replace.
Regex.Replace("This is my te\uFFFDxt from csv file", #"[^\u0000-\u007F]+", " ")
The above pattern will match anything that is not ^ in the set [ ] of this range \u0000-\u007F (ASCII characters (everything past \u007F is Unicode)) and replace it with a space.
Result
This is my te xt from csv file
You can adjust the range provided \u0000-\u007F as needed to expand the range of allowed characters to suit your needs.
If you just want ASCII then try the following:
var ascii = new ASCIIEncoding();
byte[] encodedBytes = ascii.GetBytes(text);
var cleaned = ascii.GetString(encodedBytes).Replace("?", " ");

Complex string split C#

I have input file like this:
input.txt
aa#aa.com bb#bb.com "Information" "Hi there"
cc#cc.com dd#dd.com "Follow up" "Interview"
I have used this method:
string[] words = item.Split(' ');
However, it splits every words with space. I also have spaces in quotes strings but I won't split those spaces.
Basically I want to parse this input from file to this output:
From = aa#aa.com
To = bb#bb.com
Subject = Information
Body = Hi there
How do I split these strings in C#?
Simply you can use Regex as it is said in this question
var stringValue = "aa#aa.com bb#bb.com \"Information\" \"Hi there\"";
var parts = Regex.Matches(stringValue, #"[\""].+?[\""]|[^ ]+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
//parts: aa#aa.com
bb#bb.com
"Information"
"Hi there"
Also you may try Replace function to remove those " characters.
The String.Split() method has an overload that allows you to specify the number of splits required. You can get what you want like this:
Read one line at a time
Call input.Split(new string[" "], 3, StringSplitOptions.None) - this returns an array of strings with 3 parts. Since email addresses don't have spaces in them, the first two strings will be the from/to addresses, and the third string will be the subject and message. Assume the result of this call is stored in firstSplit[], then firstSplit[0] is the from address, firstSplit[1] is the to address, and firstSplit[2] is the subject and message combined.
Call firstSplit[2].Split(new string[""" """], 2, StringSplitOptions.None) - this searches for the string " " in the concatenated subject+message from the previous call, which should pinpoint the separator between the end of the subject and the start of the message. This will give you the subject and message in another array. (The double-quotes inside are doubled to escape them)
This assumes you disallow double quotes in your subject and message. If you do allow double quotes, then you need to ensure you escape them before putting it in the file in the first place.
You can do this without using regex by just using IndexOf and SubString just put it in a loop if you have multiple emails to parse.
It's not pretty but it would be faster than RegEx if you're doing a lot of them.
string content = #"abba#aa.com dddb#bdd.com ""Information"" ""Hi there""";
string firstEmail = content.Substring(0, content.IndexOf(" ", StringComparison.Ordinal));
string secondEmail = content.Substring(firstEmail.Length, content.IndexOf(" ", firstEmail.Length + 1) - firstEmail.Length);
int firstQuote = content.IndexOf("\"", StringComparison.Ordinal);
string subjectandMessage = content.Substring(firstQuote, content.Length - content.IndexOf("\"", firstQuote, StringComparison.Ordinal));
String[] words = subjectandMessage.Split(new string[] { "\" \"" }, StringSplitOptions.None);
Console.WriteLine(firstEmail);
Console.WriteLine(secondEmail);
Console.WriteLine(words[0].Remove(0,1));
Console.WriteLine(words[1].Remove(words[1].Length -1));
Output:
aa#aa.com
bb#bb.com
Information
Hi there
As Spencer pointed out, read this file line by line using File.ReadAllLines() method and then apply String.Split[] method with spaces using something like this:
string[] elements = string.Split(new char[0]);
UPDATE
Not a pretty solution, but this is how I think it can work:
string[] readText = File.ReadAllLines(' ');
//Take value of first 3 fields by simple readText[index]; (index: 0-2)
string temp = "";
for(int i=3; i<readText.Length; i++)
{
temp += readText[i];
}
Requires reference to Microsoft.VisualBasic, but a bit more reliable than Regex:
using (var tfp = new Microsoft.VisualBasic.FileIO.TextFieldParser("input.txt")) {
for (tfp.SetDelimiters(" "); !tfp.EndOfData;) {
string[] fields = tfp.ReadFields();
Debug.Print(string.Join(",", fields)); // "aa#aa.com,bb#bb.com,Information,Hi there"
}
}

Regex to find embedded quotes in a quotes string

Original string:
11235485|56987|0|2010|05|"This is my sample
"text""|"01J400B"|""|1|"Sample "text" number two"|""sample text number
three""|""|""|
Desired string:
11235485|56987|0|2010|05|"This is my sample
""text"""|"01J400B"|""|1|"Sample ""text"" number two"|"""sample text
number three"""|""|""|
The desired string unfortunately is a requirement that is out of my control, all nested quotes MUST be qualified with quotes (I KNOW).
Try as I might I have not been able to create the desired string from the original.
A regex match/replace seems to be the way to go, I need help. Any help is appreciated.
I'd actually split the string and evaluate each piece:
public string Escape(string input)
{
string[] pieces = input.Split('|');
for (int i = 0; i < pieces.Length; i++)
{
string piece = pieces[i];
if (piece.StartsWith("\"") && piece.EndsWith("\""))
{
pieces[i] = "\"" + piece.Trim('\"').Replace("\"", "\"\"") + "\"";
}
}
return string.Join("|", pieces);
}
This is making several assumptions about the input:
Items are delimited by pipes (|)
Items are well formed and will begin and end with quotation marks
This will also break if you have |s inside of quoted strings.
You may be able to just use the normal string.Replace() method. You know that | is what starts the column, so you can replace all " to "" and then fix the column start and end by replacing |"" to |" and ""| to "|.
It'd look like this:
var input = YOUR_ORIGINAL_STRING;
input.Replace("\"", "\"\"").Replace("|\"\"", "|\"").Replace("\"\"|", "\"|"));
It's not pretty, but it gets the job done.

Categories