Sometimes we need to read lines from a stream, but considering only specific char sequence as newline (CRLF, but not CR or LF).
StreamReader.ReadLine, as documented, treats as newline sequence CRLF, CR and LF. That may be unacceptable if the line can contain single CR ("\r") or single LF ("\n") as business-valued data.
Need to have ability to read stream line-by-line, but delimited by certain character sequence.
Here is a method that reads line from stream and returns it as a string:
public static string ReadLineWithFixedNewlineDelimeter(StreamReader reader, string delim)
{
if (reader.EndOfStream)
return null;
if (string.IsNullOrEmpty(delim))
{
return reader.ReadToEnd();
}
var sb = new StringBuilder();
var delimCandidatePosition = 0;
while (!reader.EndOfStream && delimCandidatePosition < delim.Length)
{
var c = (char)reader.Read();
if (c == delim[delimCandidatePosition])
{
delimCandidatePosition ++;
}
else
{
delimCandidatePosition = 0;
}
sb.Append(c);
}
return sb.ToString(0, sb.Length - (delimCandidatePosition == delim.Length ? delim.Length : 0));
}
Related
I want to replace the delimiter comma with tabs in a CSV file
Input
Output
Note that commas shouldn't be replaced for words enclosed by quotes. Also in the output, we want to omit the double quotes
I tried the following, but the code also replaces commas for words enclosed by quotes
public void Replace_comma_with_tabs(string path)
{
var file = File
.ReadLines(path)
.SkipWhile(line => string.IsNullOrWhiteSpace(line)) // To be on the safe side
.Select((line, index) => line.Replace(',', '\t')) // replace ',' with '\t'
.ToList(); // Materialization, since we write into the same file
File.WriteAllLines(path, file);
}
How can I skip commas for the words enclosed by quotes?
Here is one way of doing it. It uses flag quotesStarted to check if comma should be treated as delimiter or part of the text in column. I also used StringBuilder since that class has good performance with string concatenation. It reads lines and then for each line it iterates through its characters and checks for those with special meaning (comma, single quote, tab, comma between single quotes):
static void Main(string[] args)
{
var path = "data.txt";
var file = File.ReadLines(path).ToArray();
StringBuilder sbFile = new StringBuilder();
foreach (string line in file)
{
if (String.IsNullOrWhiteSpace(line) == false)
{
bool quotesStarted = false;
StringBuilder sbLine = new StringBuilder();
foreach (char currentChar in line)
{
if (currentChar == '"')
{
quotesStarted = !quotesStarted;
sbLine.Append(currentChar);
}
else if (currentChar == ',')
{
if (quotesStarted)
sbLine.Append(currentChar);
else
sbLine.Append("\t");
}
else if (currentChar == '\t')
throw new Exception("Tab found");
else
sbLine.Append(currentChar);
}
sbFile.AppendLine(sbLine.ToString());
}
}
File.WriteAllText("Result-" + path, sbFile.ToString());
}
There's a lot of ways to do this but here's one. This only includes the code to transform a string that has comma delimited text with quoted text. You'd use "ToTabs" instead of "Replace" inside your Select statement. You'll have to harden this to add some error checking.
This will handle escaped quotes inside of quoted fields and it transforms existing tabs to spaces, but it's not a full blown CSV parser.
static class CsvHelper
{
public static string ToTabs(this string source)
{
Func<char,char> getState = NotInQuotes;
char last = ' ';
char InQuotes(char ch)
{
if ('"' == ch && last != '"')
getState = NotInQuotes;
else if ('\t' == ch)
ch = ' ';
last = ch;
return ch;
}
char NotInQuotes(char ch)
{
last = ch;
if ('"' == ch)
getState = InQuotes;
else if (',' == ch)
return '\t';
else if ('\t' == ch)
ch = ' ';
return ch;
}
return string.Create(source.Length, getState, (buffer,_) =>
{
for (int i = 0; i < source.Length; ++i)
{
buffer[i] = getState(source[i]);
}
});
}
}
static void Main(string[] _)
{
const string Source = "a,string,with,commas,\"field,with,\"\"commas\", and, another";
var withTabs = Source.ToTabs();
Console.WriteLine(Source);
Console.WriteLine(withTabs);
}
To change commas in a string to tabs, use Replace method.
Example:
str2.Replace(",", "hit tab key");
string str = "Lucy, John, Mark, Grace";
string str2 = str.Replace(",", " ");
So I have a string that I need to split by semicolon's
Email address: "one#tw;,.'o"#hotmail.com;"some;thing"#example.com
Both of the email addresses are valid
So I want to have a List<string> of the following:
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
But the way I am currently splitting the addresses is not working:
var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()).ToList();
Because of the multiple ; characters I end up with invalid email addresses.
I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.
Does anyone have any better suggestions?
Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign #, you can use this regular expression to capture e-mail addresses:
((?:[^#"]+|"[^"]*")#[^;]+)(?:;|$)
The idea is to capture either an unquoted [^#"]+ or a quoted "[^"]*" part prior to #, and then capture everything up to semicolon ; or the end anchor $.
Demo of the regex.
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world";
var mm = Regex.Matches(input, "((?:[^#\"]+|\"[^\"]*\")#[^;]+)(?:;|$)");
foreach (Match m in mm) {
Console.WriteLine(m.Groups[1].Value);
}
This code prints
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
Demo 1.
If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:
((?:(?:[^#\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")#[^;]+)(?:;|$)
Everything else remains the same.
Demo 2.
I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.
public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
{
var startIndex = 0;
var delimiterIndex = 0;
while (delimiterIndex >= 0)
{
delimiterIndex = input.IndexOf(';', startIndex);
string substring = input;
if (delimiterIndex > 0)
{
substring = input.Substring(0, delimiterIndex);
}
if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
{
yield return substring;
input = input.Substring(delimiterIndex + 1);
startIndex = 0;
}
else
{
startIndex = delimiterIndex + 1;
}
}
}
Then the following
var input = "blah#blah.com;\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;asdasd#asd.co.uk;";
foreach (var email in SplitEmailsByDelimiter(input, ';'))
{
Console.WriteLine(email);
}
Would give this output
blah#blah.com
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
asdasd#asd.co.uk
You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape)
{
int beginIndex = 0;
int length = 0;
bool escaped = false;
foreach (char c in str)
{
if (c == beginEndEscape)
{
escaped = !escaped;
}
if (!escaped && c == delimiter)
{
yield return str.Substring(beginIndex, length);
beginIndex += length + 1;
length = 0;
continue;
}
length++;
}
yield return str.Substring(beginIndex, length);
}
Then the following
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;\"D;D#blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"'))
Console.WriteLine(v);
While give this output
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
"D;D#blah;blah.com"
Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape, char singleEscape)
{
StringBuilder builder = new StringBuilder();
bool escapedSequence = false;
bool previousEscapeChar = false;
foreach (char c in str)
{
if (c == singleEscape && !previousEscapeChar)
{
previousEscapeChar = true;
continue;
}
if (c == beginEndEscape && !previousEscapeChar)
{
escapedSequence = !escapedSequence;
}
if (!escapedSequence && !previousEscapeChar && c == delimiter)
{
yield return builder.ToString();
builder.Clear();
continue;
}
builder.Append(c);
previousEscapeChar = false;
}
yield return builder.ToString();
}
Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.
How does StreamReader read all chars, including 0x0D 0x0A chars?
I have an old .txt file I am trying to covert. Many lines (but not all) end with "0x0D 0x0D 0x0A".
This code reads all of the lines.
StreamReader srFile = new StreamReader(gstPathFileName);
while (!srFile.EndOfStream) {
string stFileContents = srFile.ReadLine();
...
}
This results in extra "" strings between each .txt line. As there are some blank lines between the paragraphs, removing all "" strings removes those blank lines.
Is there a way to have StreamReader read all of the chars including the "0x0D 0x0D 0x0A"?
Edited two hours later ... the file is huge, 1.6MB.
A very simple reimplementation of ReadLine. I have done a version that returns an IEnumerable<string> because it's easier. I've put it in an extension method, so the static class. The code is heavily commented, so it should be easy to read.
public static class StreamEx
{
public static string[] ReadAllLines(this TextReader tr, string separator)
{
return tr.ReadLines(separator).ToArray();
}
// StreamReader is based on TextReader
public static IEnumerable<string> ReadLines(this TextReader tr, string separator)
{
// Handling of empty file: old remains null
string old = null;
// Read buffer
var buffer = new char[128];
while (true)
{
// If we already read something
if (old != null)
{
// Look for the separator
int ix = old.IndexOf(separator);
// If found
if (ix != -1)
{
// Return the piece of line before the separator
yield return old.Remove(ix);
// Then remove the piece of line before the separator plus the separator
old = old.Substring(ix + separator.Length);
// And continue
continue;
}
}
// old doesn't contain any separator, let's read some more chars
int read = tr.ReadBlock(buffer, 0, buffer.Length);
// If there is no more chars to read, break the cycle
if (read == 0)
{
break;
}
// Add the just read chars to the old chars
// note that null + "somestring" == "somestring"
old += new string(buffer, 0, read);
// A new "round" of the while cycle will search for the separator
}
// Now we have to handle chars after the last separator
// If we read something
if (old != null)
{
// Return all the remaining characters
yield return old;
}
}
}
Note that, as written, it won't directly handle your problem :-) But it lets you select the separator you want to use. So you use "\r\n" and then you trim the excess '\r'.
Use it like this:
using (var sr = new StreamReader("somefile"))
{
// Little LINQ to strip excess \r and to make an array
// (note that by making an array you'll put all the file
// in memory)
string[] lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r')).ToArray();
}
or
using (var sr = new StreamReader("somefile"))
{
// Little LINQ to strip excess \r
// (note that the file will be read line by line, so only
// a line at a time is in memory (plus some remaining characters
// of the next line in the old buffer)
IEnumerable<string> lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r'));
foreach (string line in lines)
{
// Do something
}
}
You could always use a BinaryReader and manually read in lines a byte at a time. Keep hold of the bytes, then when you come across 0x0d 0x0d 0x0a, make a new string of the bytes for the current line.
Note:
I'm assuming that your encoding is Encoding.UTF8 but your case might be different. Accessing bytes directly, I don't know off-hand how to interpret the encoding.
If your file has extra information, e.g. a byte order mark, that will be returned too.
Here it is:
public static IEnumerable<string> ReadLinesFromStream(string fileName)
{
using ( var fileStream = File.Open(gstPathFileName) )
using ( BinaryReader binaryReader = new BinaryReader(fileStream) )
{
var bytes = new List<byte>();
while ( binaryReader.PeekChar() != -1 )
{
bytes.Add(binaryReader.ReadByte());
bool newLine = bytes.Count > 2
&& bytes[bytes.Count - 3] == 0x0d
&& bytes[bytes.Count - 2] == 0x0d
&& bytes[bytes.Count - 1] == 0x0a;
if ( newLine )
{
yield return Encoding.UTF8.GetString(bytes.Take(bytes.Count - 3).ToArray());
bytes.Clear();
}
}
if ( bytes.Count > 0 )
yield return Encoding.UTF8.GetString(bytes.ToArray());
}
}
A very easy solution (not optimized for memory consumption) could be:
var allLines = File.ReadAllText(gstPathFileName)
.Split('\n');
The if you need to remove trailing carriage return characters, then do:
for(var i = 0; i < allLines.Length; ++i)
allLines[i] = allLines[i].TrimEnd('\r');
You can put relevant processing into that for link if you want. Or if you do not want to keep the array, use this instead of the for:
foreach(var line in allLines.Select(x => x.TrimEnd('\r')))
{
// use 'line' here ...
}
This code works well ... reads every char.
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}
I have a stream reader line by line (sr.ReadLine()). My code counts the line-end with both line endings \r\n and/or \n.
StreamReader sr = new System.IO.StreamReader(sPath, enc);
while (!sr.EndOfStream)
{
// reading 1 line of datafile
string sLine = sr.ReadLine();
...
How to tell to code (instead of universal sr.ReadLine()) that I want to count new line only a full \r\n and not the \n?
It is not possible to do this using StreamReader.ReadLine.
As per msdn:
A line is defined as a sequence of characters followed by a line feed
("\n"), a carriage return ("\r"), or a carriage return immediately
followed by a line feed ("\r\n"). The string that is returned does not
contain the terminating carriage return or line feed. The returned
value is null if the end of the input stream is reached.
So yoг have to read this stream byte-by-byte and return line only if you've captured \r\n
EDIT
Here is some code sample
private static IEnumerable<string> ReadLines(StreamReader stream)
{
StringBuilder sb = new StringBuilder();
int symbol = stream.Peek();
while (symbol != -1)
{
symbol = stream.Read();
if (symbol == 13 && stream.Peek() == 10)
{
stream.Read();
string line = sb.ToString();
sb.Clear();
yield return line;
}
else
sb.Append((char)symbol);
}
yield return sb.ToString();
}
You can use it like
foreach (string line in ReadLines(stream))
{
//do something
}
you cannot do it with ReadLine, but you can do instead:
stream.ReadToEnd().Split(new[] {"\r\n"}, StringSplitOptions.None)
For simplification, let's work over a byte array:
static int NumberOfNewLines(byte[] data)
{
int count = 0;
for (int i = 0; i < data.Length - 1; i++)
{
if (data[i] == '\r' && data[i + 1] == '\n')
count++;
}
return count;
}
If you care about efficiency, optimize away, but this should work.
You can get the bytes of a file by using System.IO.File.ReadBytes(string filename).
What is the easiest way to read a file character by character in C#?
Currently, I am reading line by line by calling System.io.file.ReadLine(). I see that there is a Read() function but it doesn;t return a character...
I would also like to know how to detect the end of a line using such an approach...The input file in question is a CSV file....
Open a TextReader (e.g. by File.OpenText - note that File is a static class, so you can't create an instance of it) and repeatedly call Read. That returns int rather than char so it can also indicate end of file:
int readResult = reader.Read();
if (readResult != -1)
{
char nextChar = (char) readResult;
// ...
}
Or to loop:
int readResult;
while ((readResult = reader.Read()) != -1)
{
char nextChar = (char) readResult;
// ...
}
Or for more funky goodness:
public static IEnumerable<char> ReadCharacters(string filename)
{
using (var reader = File.OpenText(filename))
{
int readResult;
while ((readResult = reader.Read()) != -1)
{
yield return (char) readResult;
}
}
}
...
foreach (char c in ReadCharacters("foo.txt"))
{
...
}
Note that all by default, File.OpenText will use an encoding of UTF-8. Specify an encoding explicitly if that isn't what you want.
EDIT: To find the end of a line, you'd check whether the character is \n... you'd potentially want to handle \r specially too, if this is a Windows text file.
But if you want each line, why not just call ReadLine? You can always iterate over the characters in the line afterwards...
Here is a snippet from msdn
using (StreamReader sr = new StreamReader(path))
{
char[] c = null;
while (sr.Peek() >= 0)
{
c = new char[1];
sr.Read(c, 0, c.Length);
// do something with c[0]
}
}