Trim not working on null characters

Trim not working on null characters - c#

I have a really bizarre problem with trim method. I'm trying to trim a string received from database. Here's my current method:
string debug = row["PLC_ADDR1_RESULT"].ToString();
SPCFileLog.WriteToLog(String.Format("Debug: ${0}${1}",debug,Environment.NewLine));
debug = debug.Trim();
SPCFileLog.WriteToLog(String.Format("Debug2: ${0}${1}", debug, Environment.NewLine));
debug = debug.Replace(" ", "");
SPCFileLog.WriteToLog(String.Format("Debug3: ${0}${1}", debug, Environment.NewLine));
Which produces file output as following:
Debug: $ $
Debug2: $ $
Debug3: $ $
Examining the hex codes in file revealed something interesting. The supposedly empty spaces aren't hex 20 (whitespace), but they are set as 00 (null?)
How our database contains such data is another mystery, but regardless, I need to trim those invalid (?) null characters. How can I do this?

If you just want to remove all null characters from a string, try this:
debug = debug.Replace("\0", string.Empty);
If you only want to remove them from the ends of the string:
debug = debug.Trim('\0');
There's nothing special about null characters, but they aren't considered white space.

String.Trim() just doesn't consider the NUL character (\0) to be whitespace. Ultimately, it calls this function to determine whitespace, which doesn't treat it as such.
Frankly, I think that makes sense. Typically \0 is not whitespace.

#Will Vousden got me on the right track...
https://stackoverflow.com/a/32624301/12157575
--but instead of trying to rewrite or remove the line, I filtered out lines before hitting the StreamReader / StreamWriter that start with the control character in the linq statement:
string ctrlChar = "\0"; // "NUL" in notepad++
// linq statement: "where"
!line.StartsWith(ctrlChar)
// could also easily do "Contains" instead of "StartsWith"
for more context:
internal class Program
{
private static void Main(string[] args)
{
// dbl space writelines
Out.NewLine = "\r\n\r\n";
WriteLine("Starting Parse Mode...");
string inputFilePath = #"C:\_logs\_input";
string outputFilePath = #"C:\_logs\_output\";
string ouputFileName = #"consolidated_logs.txt";
// chars starting lines we don't want to parse
string hashtag = "#"; // logs notes
string whtSpace = " "; // white space char
string ctrlChar = "\0"; // "NUL" in notepad++
try
{
var files =
from file in Directory.EnumerateFiles(inputFilePath, "*.log", SearchOption.TopDirectoryOnly)
from line in File.ReadLines(file)
where !line.StartsWith(hashtag) &&
!line.StartsWith(whtSpace) &&
line != null &&
!string.IsNullOrWhiteSpace(line) &&
!line.StartsWith(ctrlChar) // CTRL CHAR FILTER
select new
{
File = file,
Line = line
};
using (StreamWriter writer = new StreamWriter(outputFilePath + ouputFileName, true))
{
foreach (var f in files)
{
writer.WriteLine($"{f.File},{f.Line}");
WriteLine($"{f.File},{f.Line}"); // see console
}
WriteLine($"{files.Count()} lines found.");
ReadLine(); // keep console open
}
}
catch (UnauthorizedAccessException uAEx)
{
Console.WriteLine(uAEx.Message);
}
catch (PathTooLongException pathEx)
{
Console.WriteLine(pathEx.Message);
}
}
}

Related

Removing Escape Characters for a string

I am having a bit of a problem with Escape characters is a string that I am reading from a txt file,
They are causing an error later in my program, they need to be removed but I can't seem to filter them out
public static List<string> loadData(string type)
{
List<string> dataList = new List<string>();
try
{
string path = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "Data");
string text = File.ReadAllText(path + type);
string[] dataArray = text.Split(',');
foreach (var data in dataArray)
{
string dataUnescaped = Regex.Unescape(data);
if (!string.IsNullOrEmpty(dataUnescaped) && (!dataUnescaped.Contains(#"\r") || (!dataUnescaped.Contains(#"\n"))))
{
dataList.Add(data);
}
}
return dataList;
}
catch(Exception e)
{
Console.WriteLine(e);
return dataList;
}
}
I have tried text.Replace(#"\r\n")
and an if statement but I just cant seem to remove them from my string
Any ideas will be appreciated

If you add the # Sign before a string that means you specify that you want a string without having to escape any characters.
So if you wanted a path without # you would need to do this:
string s = "c:\\myfolder\\myfile.txt"
But if you add the # before your \n\r isntead of the escaped sequence Windows New Line you would instead get the string "\n\r".
So this will result in you removing all occurrences of the string "\n\r". Instead of NewLines like you want to:
text.Replace(#"\r\n")
To fix that you would need to use:
text = text.Replace(Environment.NewLine, string.Empty);
You can use Environment.NewLine as well instead of \r and \n, because Environment knows which OS you are currently on and change the replaced character depeding on that.

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.

I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.

sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

How to contact whole text from file into the string avoiding empty lines beetwen strings

How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document

I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items

Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;

I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());

I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}

How to check if char that was read from file is "/n"?

Hi I `m trying to grep through file and count number of lines, maximum number of spaces per line, and longest line.
How I can determine "/n" character if i iterate char by char trough given file?
Thanks a lot.
Here is my code that I used for this:
using (StreamReader sr = new StreamReader(p_FileName))
{
char currentChar;
int current_length=0,current_MaximumSpaces=0;
p_LongestLine=0;
p_NumOfLines=0;
p_MaximumSpaces=0;
while (!sr.EndOfStream){
currentChar=Convert.ToChar(sr.Read());
current_length++;
if(Char.IsWhiteSpace(currentChar) || currentChar==null){
current_MaximumSpaces++;
}
if(currentChar == '\n'){
p_NumOfLines++;
}
if(current_length>p_LongestLine){
p_LongestLine=current_length;
}
if(current_MaximumSpaces>p_MaximumSpaces){
p_MaximumSpaces=current_MaximumSpaces;
}
current_length=0;
current_MaximumSpaces=0;
}
sr.Close();
}

if(currentChar == '\n')
count++;

You do not need to go character by character: for your purposes, going line-by-line is sufficient, and you get the .NET to deal with system-dependent line breaks for you as an added bonus.
int maxLen = -1, maxSpaces = -1;
foreach ( var line in File.ReadLines("c:\\data\\myfile.txt")) {
maxLen = Math.Max(maxLen, line.Length);
maxSpaces = Math.Max(maxSpaces, line.Count(c => c == ' '));
}
EDIT: Your program does not work because of an error unrelated to you checking the '\n': you are zeroing out the current_length and current_MaximumSpaces after each character, instead of clearing them only when you see a newline character.

Try comparing to Environment.NewLine
bool is_newline = currentChar.ToString().Equals(Environment.NewLine);
I'm guessing that you newline is actually \r\n (non Unix) ending. You'll need to keep track of the previous/current char's and look for either \r\n or Environment.NewLine.

Get parameters out of text file

I have a C# asp.net page that has to get username/password info from a text file.
Could someone please tell me how.
The text file looks as follows: (it is actually a lot larger, I just got a few lines)
DATASOURCEFILE=D:\folder\folder
var1= etc
var2= more
var3 = misc
var4 = stuff
USERID = user1
PASSWORD = pwd1
all I need is the UserID and password out of that file.
Thank you for your help,
Steve

This would work:
var dic = File.ReadAllLines("test.txt")
.Select(l => l.Split(new[] { '=' }))
.ToDictionary( s => s[0].Trim(), s => s[1].Trim());
dic is a dictionary, so you easily extract your values, i.e.:
string myUser = dic["USERID"];
string myPassword = dic["PASSWORD"];

Open the file, split on the newline, split again on the = for each item and then add it to a dictionary.
string contents = String.Empty;
using (FileStream fs = File.Open("path", FileMode.OpenRead))
using (StreamReader reader = new StreamReader(fs))
{
contents = reader.ReadToEnd();
}
if (contents.Length > 0)
{
string[] lines = contents.Split(new char[] { '\n' });
Dictionary<string, string> mysettings = new Dictionary<string, string>();
foreach (string line in lines)
{
string[] keyAndValue = line.Split(new char[] { '=' });
mysettings.Add(keyAndValue[0].Trim(), keyAndValue[1].Trim());
}
string test = mysettings["USERID"]; // example of getting userid
}

You can use Regular expressions to extract each variable. You can read one line at a time, or the entire file into one string. If the latter, you just look for a newline in the expression.
Regards,
Morten

Dictionary is not needed.
Old-fashioned parsing can do more, with less executable code, the same amount of compiled data, and less processing:
public string MyPath1;
public string MyPath2;
...
public void ReadConfig(string sConfigFile)
{
MyPath1 = MyPath2 = ""; // Clear the external values (in case the file does not set every parameter).
using (StreamReader sr = new StreamReader(sConfigFile)) // Open the file for reading (and auto-close).
{
while (!sr.EndOfStream)
{
string sLine = sr.ReadLine().Trim(); // Read the next line. Trim leading and trailing whitespace.
// Treat lines with NO "=" as comments (ignore; no syntax checking).
// Treat lines with "=" as the first character as comments too.
// Treat lines with "=" as the 2nd character or after as parameter lines.
// Side-benefit: Values containing "=" are processed correctly.
int i = sLine.IndexOf("="); // Find the first "=" in the line.
if (i <= 0) // IF the first "=" in the line is the first character (or not present),
continue; // the line is not a parameter line. Ignore it. (Iterate the while.)
string sParameter = sLine.Remove(i).TrimEnd(); // All before the "=" is the parameter name. Trim whitespace.
string sValue = sLine.Substring(i + 1).TrimStart(); // All after the "=" is the value. Trim whitespace.
// Extra characters before a parameter name are usually intended to comment it out. Here, we keep them (with or without whitespace between). That makes an unrecognized parameter name, which is ignored (acts as a comment, as intended).
// Extra characters after a value are usually intended as comments. Here, we trim them only if whitespace separates. (Parsing contiguous comments is too complex: need delimiter(s) and then a way to escape delimiters (when needed) within values.) Side-drawback: Values cannot contain " ".
i = sValue.IndexOfAny(new char[] {' ', '\t'}); // Find the first " " or tab in the value.
if (i > 1) // IF the first " " or tab is the second character or after,
sValue = sValue.Remove(i); // All before the " " or tab is the parameter. (Discard the rest.)
// IF a desired parameter is specified, collect it:
// (Could detect here if any parameter is set more than once.)
if (sParameter == "MyPathOne")
MyPath1 = sValue;
else if (sParameter == "MyPathTwo")
MyPath2 = sValue;
// (Could detect here if an invalid parameter name is specified.)
// (Could exit the loop here if every parameter has been set.)
} // end while
// (Could detect here if the config file set neither parameter or only one parameter.)
} // end using
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.