ReadLine is changing the text read? - c#

I have a text file that I need to read and modify. This file come from another program so I can not modify its format. I need to use it as a template and make a bunch of replacements for specific cases. One of the lines I am reading is delimited with 0xFF characters. But when I call ReadLine the string returns the line delimited with 0x3F characters. I have tried different encodings. ASCII where it comes back as 0x3f and UTF-8 where it comes back as 3bytes 0xEF 0xBF 0xBD. The original text file seems to be ANSI format and the 0xFF character shows up as a "ÿ". How can I get my ReadLine (and the subsequent WriteLine) to keep this character intact?
var replacements = new Dictionary<string, string> { {"to_replace1", "replacement1"}, {"to_replace2", "replacement2"}, {"etc etc", "more replaces"} };
using (var writer = new StreamWriter(projectfileSpecific, false, Encoding.ASCII))
{
foreach (var line in File.ReadLines(projectfileTemplate, Encoding.ASCII))
{
foreach (var replacement in replacements)
{
if (line.Contains(replacement.Key))
{
var replaceLine = line;
writer.WriteLine(replaceLine.Replace(replacement.Key, replacement.Value));
}
else
{
writer.WriteLine(line);
}
}
}
}

Related

Get binary representation of ASCII symbol (C#)

Sorry for asking a question like that, but I'm really stuck.
I have this method for reading data from file:
public void ReadFromFile()
{
string fileName = #"my .txt file path";
StreamReader sr;
List<char> encoded = new List<char>();
List<byte> converted = new List<byte>();
using (StreamReader sr = new StreamReader(fileName))
{
string line = sr.ReadToEnd();
string[] lines = line.Split('\n');
foreach (var v in lines[2])
{
encoded.Add(v); // just get data I need
}
} }
Now in encoded I have F and # symbols.
I want to get 01000110 (F representation) and 01000000 (# representation)
I tried to convert every item in List<char> encoded into bytes and then use Convert.ToString(value, 2)
But it's not a good idea, because there's a mistake "Value was either too large or too small for an unsigned byte."
in the output file I have something like this:
s,01;w,000;e,1;t,001; // dictionary of character and its code
6 // number of zeros
F# // encoded string
So what I want to do is to DECODE this thing into the input string (that is 'sweet'). For this, I need to decode F# into 0100011001000000

c# StreamReader, differentiate between new line, and new with with carriage return

I've dealing with a large text file (6.5GB) and I'm reading it using a StreamReader.
The file has portions that are separated using CRLF, however it sometimes uses just a LF.
Is there an easy way to determine if the line read in was just a newline '\n', or a carriage return with line feed '\r\n'. The StreamReader.ReadLine() seems to treat them both as null or empty.
you can just replace crlf and lf to cr.
using (var reader = new StreamReader(#"yourfile"))
using (var writer = new StreamWriter(#"outfile")) // or any other TextWriter
{
while (!reader.EndOfStream) {
string currentLine = reader.ReadLine();
string newLine = currentLine.Replace('\r\n','\n');
newLine = newLine.Replace('\r','\n');
newLine = newLine.Replace('\n','\r\n');
writer.Write(newLine);
}
}

Read lines in C# without losing newline characters

At the moment, I am using C#'s inbuilt StreamReader to read lines from a file.
As is well known, if the last line is blank, the stream reader does not acknowledge this as a separate line. That is, a line must contain text, and the newline character at the end is optional for the last line.
This is having the effect on some of my files that I am losing (important, for reasons I don't want to get into) whitespace at the end of the file each time my program consumes and re-writes specific files.
Is there an implementation of TextReader available either as a part of the language or as a NuGet package which provides the ReadLine functionality, but retains the new line characters (whichever they may be) as a part of the line so that I can exactly reproduce the output? I would prefer not to have to roll my own method to consume line-based input.
Edit: it should be noted that I cannot read the whole file into memory.
You can combine ReadToEnd() with Split to get in an array the content of your file, including the empty lines.
I don't recommend you to use ReadToEnd() if your file is big.
In example :
string[] lines;
using (StreamReader sr = new StreamReader(path))
{
var WholeFile = sr.ReadToEnd();
lines = WholeFile.Split('\n');
}
private readonly char newLineMarker = Environment.NewLine.Last();
private readonly char[] newLine = Environment.NewLine.ToCharArray();
private readonly char eof = '\uffff';
private IEnumerable<string> EnumerateLines(string path)
{
using (var sr = new StreamReader(path))
{
char c;
string line;
var sb = new StringBuilder();
while ((c = (char)sr.Read()) != eof)
{
sb.Append(c);
if (c == newLineMarker &&
(line = sb.ToString()).EndsWith(Environment.NewLine))
{
yield return line.Trim(newLine);
sb.Clear();
sb.Append(Environment.NewLine);
}
}
if (sb.Length > 0)
yield return sb.ToString().Trim(newLine);
}
}

Remove control characters sequence from string EOT comma ETX

I have some xml files where some control sequences are included in the text: EOT,ETX(anotherchar)
The other char following EOT comma ETX is not always present and not always the same.
Actual example:
<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>
Where <EOT> is the 04 char and <ETX> is 03. As I have to parse the xml this is actually a big issue.
Is this some kind of encoding I never heard about?
I have tried to remove all the control characters from my string but it will leave the comma that is still unwanted.
If I use Encoding.ASCII.GetString(file); the unwanted characters will be replaced with a '?' that is easy to remove but it will still leave some unwanted characters causing parse issues:
<BIC></WBIC> something like this.
string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());
I hence need to remove all this kind of control character sequences to be able to parse this kind of files and I'm unsure about how to programmatically check if a character is part of a control sequence or not.
I have find out that there are 2 wrong patterns in my files: the first is the one in the title and the second is EOT<.
In order to make it work I looked at this thread: Remove substring that starts with SOT and ends EOT, from string
and modified the code a little
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0004');
if (start == -1) break;
if (input[start + 1] == '<')
{
input = input.Remove(start, 2);
continue;
}
if (input[start + 2] == '\u0003')
{
input = input.Remove(start, 4);
}
}
return input;
}
A further cleanup with this code:
static string StripExtended(string arg)
{
StringBuilder buffer = new StringBuilder(arg.Length); //Max length
foreach (char ch in arg)
{
UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
//The basic characters have the same code points as ASCII, and the extended characters are bigger
if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
}
return buffer.ToString();
}
And now everything looks fine to parse.
sorry for the delay in responding,
but in my opinion the root of the problem might be an incorrect decoding of a p7m file.
I think originally the xml file you are trying to sanitize was a .xml.p7m file.
I believe the correct way to sanitize the file is by using a library such as Buoncycastle in java or dotnet and the class CmsSignedData.
CmsSignedData cmsObj = new CmsSignedData(content);
if (cmsObj.SignedContent != null)
{
using (var stream = new MemoryStream())
{
cmsObj.SignedContent.Write(stream);
content = stream.ToArray();
}
}

How to read a file into a string with CR/LF preserved?

If I asked the question "how to read a file into a string" the answer would be obvious. However -- here is the catch with CR/LF preserved.
The problem is, File.ReadAllText strips those characters. StreamReader.ReadToEnd just converted LF into CR for me which led to long investigation where I have bug in pretty obvious code ;-)
So, in short, if I have file containing foo\n\r\nbar I would like to get foo\n\r\nbar (i.e. exactly the same content), not foo bar, foobar, or foo\n\n\nbar. Is there some ready to use way in .Net space?
The outcome should be always single string, containing entire file.
Are you sure that those methods are the culprits that are stripping out your characters?
I tried to write up a quick test; StreamReader.ReadToEnd preserves all newline characters.
string str = "foo\n\r\nbar";
using (Stream ms = new MemoryStream(Encoding.ASCII.GetBytes(str)))
using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
{
string str2 = sr.ReadToEnd();
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
}
// Output: 102,111,111,10,13,10,98,97,114
// f o o \n \r \n b a r
An identical result is achieved when writing to and reading from a temporary file:
string str = "foo\n\r\nbar";
string temp = Path.GetTempFileName();
File.WriteAllText(temp, str);
string str2 = File.ReadAllText(temp);
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
It appears that your newlines are getting lost elsewhere.
This piece of code will preserve LR and CR
string r = File.ReadAllText(#".\TestData\TR120119.TRX", Encoding.ASCII);
The outcome should be always single string, containing entire file.
It takes two hops. First one is File.ReadAllBytes() to get all the bytes in the file. Which doesn't try to translate anything, you get the raw data in the file so the weirdo line-endings are preserved as-is.
But that's bytes, you asked for a string. So second hop is to apply Encoding.GetString() to convert the bytes to a string. The one thing you have to do is pick the right Encoding class, the one that matches the encoding used by the program that wrote the file. Given that the file is pretty messed up if it contains \n\r\n sequences, and you didn't document anything else about the file, your best bet is to use Encoding.Default. Tweak as necessary.
You can read the contents of a file using File.ReadAllLines, which will return an array of the lines. Then use String.Join to merge the lines together using a separator.
string[] lines = File.ReadAllLines(#"C:\Users\User\file.txt");
string allLines = String.Join("\r\n", lines);
Note that this will lose the precision of the actual line terminator characters. For example, if the lines end in only \n or \r, the resulting string allLines will have replaced them with \r\n line terminators.
There are of course other ways of acheiving this without losing the true EOL terminator, however ReadAllLines is handy in that it can detect many types of text encoding by itself, and it also takes up very few lines of code.
ReadAllText doesn't return carriage returns.
This method opens a file, reads each line of the file, and then adds each line as an element of a string. It then closes the file. A line is defined as a sequence of characters followed by a carriage return ('\r'), a line feed ('\n'), or a carriage return immediately followed by a line feed. The resulting string does not contain the terminating carriage return and/or line feed.
From MSDN - https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
This is similar to the accepted answer, but wanted to be more to the point. sr.ReadToEnd() will read the bytes like is desired:
string myFilePath = #"C:\temp\somefile.txt";
string myEvents = String.Empty;
FileStream fs = new FileStream(myFilePath, FileMode.Open);
StreamReader sr = new StreamReader(fs);
myEvents = sr.ReadToEnd();
sr.Close();
fs.Close();
You could even also do those in cascaded using statements. But I wanted to describe how the way you write to that file in the first place will determine how to read the content from the myEvents string, and might really be where the problem lies. I wrote to my file like this:
using System.Reflection;
using System.IO;
private static void RecordEvents(string someEvent)
{
string folderLoc = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
if (!folderLoc.EndsWith(#"\")) folderLoc += #"\";
folderLoc = folderLoc.Replace(#"\\", #"\"); // replace double-slashes with single slashes
string myFilePath = folderLoc + "myEventFile.txt";
if (!File.Exists(myFilePath))
File.Create(myFilePath).Close(); // must .Close() since will conflict with opening FileStream, below
FileStream fs = new FileStream(myFilePath, FileMode.Append);
StreamWriter sr = new StreamWriter(fs);
sr.Write(someEvent + Environment.NewLine);
sr.Close();
fs.Close();
}
Then I could use the code farther above to get the string of the contents. Because I was going further and looking for the individual strings, I put this code after THAT code, up there:
if (myEvents != String.Empty) // we have something
{
// (char)2660 is ♠ -- I could have chosen any delimiter I did not
// expect to find in my text
myEvents = myEvents.Replace(Environment.NewLine, ((char)2660).ToString());
string[] eventArray = myEvents.Split((char)2660);
foreach (string s in eventArray)
{
if (!String.IsNullOrEmpty(s))
// do whatever with the individual strings from your file
}
}
And this worked fine. So I know that myEvents had to have the Environment.NewLine characters preserved because I was able to replace it with (char)2660 and do a .Split() on that string using that character to divide it into the individual segments.

Categories