Using control character (\x1f) with string and/or StringBuilder

Using control character (\x1f) with string and/or StringBuilder - c#

I just want to use the ASCII Unit Separator character (decimal 31 and hex 1F) instead of a tab for a delimited file. I assume the problem is encoding but I sure can't find how to change it. In the following, I get the desired output on the console in the first line of output in my StreamWriter file but the second line is missing the '\x1f'.
static StreamWriter sw = null;
static void Main(string[] args)
{
try
{
sw = new StreamWriter(OutFilename, false, Encoding.UTF8);
}
catch (Exception ex)
{
Console.WriteLine("File open error: " + ex.Message);
return;
}
// This works
Output("From▼To"); // Has a '\x1f' in it
// This does not work
StringBuilder sb = new StringBuilder();
sb.Append("From");
sb.Append('\x1f');
sb.Append("To");
Output(sb.ToString());
//
sw.Close();
}
static void Output(string s)
{
Console.WriteLine(s);
sw.WriteLine(s);
}
The output file has:
From▼To
FromTo
I want to build a string using StringBuilder except with the '\x1f' in the output.

Seems like there is a lot of confusion here. Let me see if I can clear things up somewhat.
First of all, let's agree on the following points that are easily verifiable:
'\x1f' == '\u001F'
'\x1f' == (char)31
'\x1f' != '▼' // <-- here appears to be your mistaken assumption.
'▼' == (char)9660
'▼' == '\u25BC'
So this...
// This works
Output("From▼To"); // Has a '\x1f' in it
... ironically is the exact line that does not work. There is no '\x1f' in this string. The triangle character is not '\x1f'. Not sure where you got that impression.
Which leads us to the last point: '\x1f' is not a visible character. So when you try to display it in the console, you will not see it, and that is 100% normal.
However, be assured that when you have a string with '\x1f' and write that out to a file, the character is still there. But you will never be able to "see" it, unless you read the bytes directly.
So whether or not you can use '\x1f' as a delimiter depends on whether you need the delimiter to be visible. If yes, then you need to pick another character. But if you only need it as a delimiter for when you programmatically parse the file, then using '\x1f' is appropriate.

Just in case you want to try your luck with such trick, you could write exactly the bytes you expect in the following way:
Output(Encoding.UTF8.GetBytes(sb.ToString()));
if you have another Output method like this:
static void Output(string s)
{
Console.WriteLine(s);
sw.WriteLine(s);
}
static void Output(byte[] bytes)
{
int dataLength = bytes.Length;
List<byte> modified = new List<byte>();
for (int i = 0; i < dataLength; i++)
{
if (bytes[i] == 0xBC && (i < dataLength - 1) && bytes[i + 1] == 0x25)
{
modified.Add(0x1F);
i++;
}
else
{
modified.Add(bytes[i]);
}
}
byte[] data = modified.ToArray();
Console.WriteLine(Encoding.UTF8.GetString(bytes)); // Use this or the next line
// Console.WriteLine(Encoding.UTF8.GetString(data));
sw.BaseStream.Write(data, 0, data.Length);
sw.WriteLine();
}

Related

How to optimize is List have an element check? C# [duplicate]

I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.
This is what I'm trying so far:
var filestream = new System.IO.FileStream(textFilePath,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);
while ((lineOfText = file.ReadLine()) != null)
{
//Do something with the lineOfText
}

To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.
Using StreamReader.ReadLine
This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.
const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
String line;
while ((line = streamReader.ReadLine()) != null)
{
// Process line
}
}
The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.
Using File.ReadLines
This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.
var lines = File.ReadLines(fileName);
foreach (var line in lines)
// Process line
Using File.ReadAllLines
This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.
var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
var line = lines[i];
// Process line
}
Using String.Split
This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.
using (var streamReader = File.OpenText(fileName)) {
var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
// Process line
}
My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.

If you're using .NET 4, simply use File.ReadLines which does it all for you. I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).

While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.
If you're just wanting to read lines in a file without doing much, according to these benchmarks, the fastest way to read a file is the age old method of:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
//do minimal amount of work here
}
}
However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :
AllLines = new string[MAX]; //only allocate memory here
using (StreamReader sr = File.OpenText(fileName))
{
int x = 0;
while (!sr.EndOfStream)
{
AllLines[x] = sr.ReadLine();
x += 1;
}
} //Finished. Close the file
//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
DoYourStuff(AllLines[x]); //do your work here
});

Use the following code:
foreach (string line in File.ReadAllLines(fileName))
This was a HUGE difference in reading performance.
It comes at the cost of memory consumption, but totally worth it!

If the file size is not big, then it is faster to read the entire file and split it afterwards
var filestreams = sr.ReadToEnd().Split(Environment.NewLine,
StringSplitOptions.RemoveEmptyEntries);

There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?.
It says:
ReadAllLines loads all of the lines into memory and returns a
string[]. All well and good if the file is small. If the file is
larger than will fit in memory, you'll run out of memory.
ReadLines, on the other hand, uses yield return to return one line at
a time. With it, you can read any size file. It doesn't load the whole
file into memory.
Say you wanted to find the first line that contains the word "foo",
and then exit. Using ReadAllLines, you'd have to read the entire file
into memory, even if "foo" occurs on the first line. With ReadLines,
you only read one line. Which one would be faster?

If you have enough memory, I've found some performance gains by reading the entire file into a memory stream, and then opening a stream reader on that to read the lines. As long as you actually plan on reading the whole file anyway, this can yield some improvements.

You can't get any faster if you want to use an existing API to read the lines. But reading larger chunks and manually find each new line in the read buffer would probably be faster.

When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case. On the other hand, reading each line separately would take ages. The solution was to read the file in blocks, like below.
The class:
//can return empty lines sometimes
class LinePortionTextReader
{
private const int BUFFER_SIZE = 100000000; //100M characters
StreamReader sr = null;
string remainder = "";
public LinePortionTextReader(string filePath)
{
if (File.Exists(filePath))
{
sr = new StreamReader(filePath);
remainder = "";
}
}
~LinePortionTextReader()
{
if(null != sr) { sr.Close(); }
}
public string[] ReadBlock()
{
if(null==sr) { return new string[] { }; }
char[] buffer = new char[BUFFER_SIZE];
int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
if (charactersRead < 1) { return new string[] { }; }
bool lastPart = (charactersRead < BUFFER_SIZE);
if (lastPart)
{
char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
buffer = buffer2;
}
string s = new string(buffer);
string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
sresult[0] = remainder + sresult[0];
if (!lastPart)
{
remainder = sresult[sresult.Length - 1];
sresult[sresult.Length - 1] = "";
}
return sresult;
}
public bool EOS
{
get
{
return (null == sr) ? true: sr.EndOfStream;
}
}
}
Example of use:
class Program
{
static void Main(string[] args)
{
if (args.Length < 3)
{
Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
return;
}
if (!File.Exists(args[0]))
{
Console.WriteLine("source file not found");
return;
}
if (!File.Exists(args[1]))
{
Console.WriteLine("reference file not found");
return;
}
TextWriter tw = new StreamWriter(args[2], false);
string[] refLines = File.ReadAllLines(args[1]);
LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
int blockCounter = 0;
while (!lptr.EOS)
{
string[] srcLines = lptr.ReadBlock();
for (int i = 0; i < srcLines.Length; i += 1)
{
string theLine = srcLines[i];
if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
{
for (int j = 0; j < refLines.Length; j += 1)
{
if (theLine.Contains(refLines[j]))
{
tw.WriteLine(theLine);
break;
}
}
}
}
blockCounter += 1;
Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
}
tw.Close();
}
}
I believe splitting strings and array handling can be significantly improved,
yet the goal here was to minimize number of disk reads.

In C#, How can I copy a file with arbitrary encoding, reading line by line, without adding or deleting a newline

I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.
What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:
Rec1<newline>
Rec2<newline>
And a file with these:
Rec1<newline>
Rec2
How can I tell the difference in my code so that I can take appropriate action?
using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
bool isFirstLine = true;
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (isFirstLine)
{
writer.Write(line);
isFirstLine = false;
}
else
{
writer.Write("\r\n" + line);
}
}
//if (LastLineHasNewline)
//{
// writer.Write("\n");
//}
writer.Flush();
}
The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.

Remember, I have no a priori knowledge of the input file encoding.
That's the fundamental problem to solve.
If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.
I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.

As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:
if (LastLineHasNewline(reader))
{
writer.Write("\n");
}
And the function looks like this:
private static bool LastLineHasNewline(StreamReader reader)
{
byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
int newlineByteCount = newlineBytes.Length;
reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);
byte[] inputBytes = new byte[newlineByteCount];
reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
for (int i = 0; i < newlineByteCount; i++)
{
if (newlineBytes[i] != inputBytes[i])
return false;
}
return true;
}

How Can I Handle This Xml Parsing Error?

Consider the following C# code:
using System.Xml.Linq;
namespace TestXmlParse
{
class Program
{
static void Main(string[] args)
{
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
XDocument.Parse(testxml);
}
}
}
I get a System.Xml.XmlException on the parse, of course, complaining about elem3. The error message is this:
System.Xml.XmlException was unhandled
Message='aaa' is an unexpected token. The expected token is '='. Line 4, position 59.
Source=System.Xml
LineNumber=4
LinePosition=59
Obviously this is not the real Xml (we get the xml from a third party) and while the best answer would be for the third party to clean up their xml before they send it to us, is there any other way I might fix this xml before I hand it off to the parser? I've devised a hacky way to fix this; catch the exception and use that to tell me where I need to look for characters which should be escaped. I was hoping for something a bit more elegant and comprehensive.
Any suggestions are welcome.
If this is a dupe, please point me to the other questions; I'll close this myself. I am more interested in an answer than any karma gain.
EDIT:
I guess I didn't make my question as clear as I had hoped. I know the "<" in elem3 is incorrect; I'm trying to find an elegant way to detect (and correct) any badly formed xml of that sort before I attempt the parse. As I say, I get this xml from a third-party and I can't control what they give me.

I would recommend that you do not manipulate the data you receive. If it is invalid it's your client's problem.
Editing the input so it is valid xml can cause serious problems, e.g. instead of throwing an error you may end up processing wrong data (because you tried your best to make the xml valid, but this may lead to different data).
[EDIT]
I still think it's not a good idea, but sometimes you have to do what you have to do.
Here is a very simple class that parses the input and replaces the invald opening tag. You could do this with a regex (which I am not good at) and this solution is not complete, e.g. depending on your requirements (or lets say the bad xml you get) you will have to adopt it (e.g. scan for complete xml elements instead of only the "<" and ">" brackets, put CDATA around the inner text of a node and so on).
I just wanted to illustrate how you could do it, so please don't complain if it is slow/has bugs (as I mentioned, I would not do it).
class XmlCleaner
{
public void Clean(Stream sourceStream, Stream targetStream)
{
const char openingIndicator = '<';
const char closingIndicator = '>';
const int bufferSize = 1024;
long length = sourceStream.Length;
char[] buffer = new char[bufferSize];
bool startTagFound = false;
StringBuilder writeBuffer = new StringBuilder();
using(var reader = new StreamReader(sourceStream))
{
var writer = new StreamWriter(targetStream);
try
{
while (reader.Read(buffer, 0, bufferSize) > 0)
{
foreach (var c in buffer)
{
if (c == openingIndicator)
{
if (startTagFound)
{
// we have 2 following opening tags without a closing one
// just replace the first one
writeBuffer = writeBuffer.Replace("<", "<");
// append the new one
writeBuffer.Append(c);
}
else
{
startTagFound = true;
writeBuffer.Append(c);
}
}
else if (c == closingIndicator)
{
startTagFound = false;
// write writebuffer...
writeBuffer.Append(c);
writer.Write(writeBuffer.ToString());
writeBuffer.Clear();
}
else
{
writeBuffer.Append(c);
}
}
}
}
finally
{
// unfortunately the streamwriter's dispose method closes the underlying stream, so e just flush it
writer.Flush();
}
}
}
To test it:
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
string result;
using (var source = new MemoryStream(Encoding.ASCII.GetBytes(testxml)))
using(var target = new MemoryStream()) {
XmlCleaner cleaner = new XmlCleaner();
cleaner.Clean(source, target);
target.Position = 0;
using (var reader = new StreamReader(target))
{
result = reader.ReadToEnd();
}
}
XDocument.Parse(result);
var expectedResult =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
Debug.Assert(result == expectedResult);

Why StreamReader.EndOfStream property change the BaseStream.Position value

I wrote this small program which reads every 5th character from Random.txt
In random.txt I have one line of text: ABCDEFGHIJKLMNOPRST. I got the expected result:
Position of A is 0
Position of F is 5
Position of K is 10
Position of P is 15
Here is the code:
static void Main(string[] args)
{
StreamReader fp;
int n;
fp = new StreamReader("d:\\RANDOM.txt");
long previousBSposition = fp.BaseStream.Position;
//In this point BaseStream.Position is 0, as expected
n = 0;
while (!fp.EndOfStream)
{
//After !fp.EndOfStream were executed, BaseStream.Position is changed to 19,
//so I have to reset it to a previous position :S
fp.BaseStream.Seek(previousBSposition, SeekOrigin.Begin);
Console.WriteLine("Position of " + Convert.ToChar(fp.Read()) + " is " + fp.BaseStream.Position);
n = n + 5;
fp.DiscardBufferedData();
fp.BaseStream.Seek(n, SeekOrigin.Begin);
previousBSposition = fp.BaseStream.Position;
}
}
My question is, why after line while (!fp.EndOfStream) BaseStream.Position is changed to 19, e.g. end of a BaseStream. I expected, obviously wrong, that BaseStream.Position will stay the same when I call EndOfStream check?
Thanks.

Thre only certain way to find out whether a Stream is at its end is to actually read something from it and check whether the return value is 0. (StreamReader has another way – checking its internal buffer, but you correctly don't let it do that by calling DiscardBufferedData.)
So, EndOfStream has to read at least one byte from the base stream. And since reading byte by byte is inefficient, it reads more. That's the reason why the call to EndOfStream changes the position to the end (it woulnd't be the end of file for bigger files).
It seems you don't actually need to use StreamReader, so you should use Stream (or specifically FileStream) directly:
using (Stream fp = new FileStream(#"d:\RANDOM.txt", FileMode.Open))
{
int n = 0;
while (true)
{
int read = fp.ReadByte();
if (read == -1)
break;
char c = (char)read;
Console.WriteLine("Position of {0} is {1}.", c, fp.Position);
n += 5;
fp.Position = n;
}
}
(I'm not sure what does setting the position beyond the end of file do in this situation, you may need to add a check for that.)

The base stream's Position property refers to the position of the last read byte in the buffer, not the actual position of the StreamReader's cursor.

You are right and I could reproduce your issue as well, anyway according to (MSDN: Read Text from a File) the proper way to read a text file with a StreamReader is the following, not yours (this also always closes and disposes the stream by using a using block):
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader("TestFile.txt"))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
Console.WriteLine(line);
}
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}

C# - Check if File is Text Based

How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.
But not open such things as .doc or .pdf or .exe, etc.

In general: there is no way to tell.
A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).
While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).
If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.

I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?
Whatever you do is going to be a guess.

As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.
Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.
public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
const int charsToCheck = 8000;
const char nulChar = '\0';
int nulCount = 0;
using (var streamReader = new StreamReader(filePath))
{
for (var i = 0; i < charsToCheck; i++)
{
if (streamReader.EndOfStream)
return false;
if ((char) streamReader.Read() == nulChar)
{
nulCount++;
if (nulCount >= requiredConsecutiveNul)
return true;
}
else
{
nulCount = 0;
}
}
}
return false;
}

To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using(var reader = new BinaryReader(stream))
{
// read the first X bytes of the file
// In this example I want to check if the file is a BMP
// whose header is 424D in hex(2 bytes 6677)
string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
if (code.Equals("6677"))
{
//it's a BMP file
}
}
}

I have a below solution which works for me.This is general solution which check all types of Binary file.
/// <summary>
/// This method checks whether selected file is Binary file or not.
/// </summary>
public bool CheckForBinary()
{
Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
bool bFlag = true;
// Iterate through stream & check ASCII value of each byte.
for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
{
int a = objStream.ReadByte();
if (!(a >= 0 && a <= 127))
{
break; // Binary File
}
else if (objStream.Position == (objStream.Length))
{
bFlag = false; // Text File
}
}
objStream.Dispose();
return bFlag;
}

public bool IsTextFile(string FilePath)
using (StreamReader reader = new StreamReader(FilePath))
{
int Character;
while ((Character = reader.Read()) != -1)
{
if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
{
return false;
}
}
}
return true;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using control character (\x1f) with string and/or StringBuilder - c#

Related

How to optimize is List have an element check? C# [duplicate]

In C#, How can I copy a file with arbitrary encoding, reading line by line, without adding or deleting a newline

How Can I Handle This Xml Parsing Error?

Why StreamReader.EndOfStream property change the BaseStream.Position value

C# - Check if File is Text Based

Categories

Resources