Related
How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document
I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items
Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;
I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());
I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}
I have a text file that has several hundred configuration values. The general format of the configuration data is "Label:Value". Using C# .net, I would like to read these configurations, and use the Values in other portions of the code. My first thought is that I would use a string search to look for the Labels then parse out the values following the labels and add them to a dictionary, but this seems rather tedious considering the number of labels/values that I would have to search for. I am interested to hear some thoughts on a possible architecture to perform this task. I have included a small section of a sample text file that contains some of the labels and values (below). A couple of notes: The Values are not always numeric (as seen in the AUX Serial Number); For whatever reason, the text files were formatted using spaces (\s) rather than tabs (\t). Thanks in advance for any time you spend thinking about this.
Sample Text:
AUX Serial Number: 445P000023 AUX Hardware Rev: 1
Barometric Pressure Slope: -1.452153E-02
Barometric Pressure Intercept: 9.524336E+02
This is a nice little brain tickler. I think this code might be able to point you in the right direction. Keep in mind, this fills a Dictionary<string, string>, so there are no conversions of values into ints or the like. Also, please excuse the mess (and the poor naming conventions). It was a quick write-up based on my train of thought.
Dictionary<string, string> allTheThings = new Dictionary<string, string>();
public void ReadIt()
{
// Open the file into a streamreader
using (System.IO.StreamReader sr = new System.IO.StreamReader("text_path_here.txt"))
{
while (!sr.EndOfStream) // Keep reading until we get to the end
{
string splitMe = sr.ReadLine();
string[] bananaSplits = splitMe.Split(new char[] { ':' }); //Split at the colons
if (bananaSplits.Length < 2) // If we get less than 2 results, discard them
continue;
else if (bananaSplits.Length == 2) // Easy part. If there are 2 results, add them to the dictionary
allTheThings.Add(bananaSplits[0].Trim(), bananaSplits[1].Trim());
else if (bananaSplits.Length > 2)
SplitItGood(splitMe, allTheThings); // Hard part. If there are more than 2 results, use the method below.
}
}
}
public void SplitItGood(string stringInput, Dictionary<string, string> dictInput)
{
StringBuilder sb = new StringBuilder();
List<string> fish = new List<string>(); // This list will hold the keys and values as we find them
bool hasFirstValue = false;
foreach (char c in stringInput) // Iterate through each character in the input
{
if (c != ':') // Keep building the string until we reach a colon
sb.Append(c);
else if (c == ':' && !hasFirstValue)
{
fish.Add(sb.ToString().Trim());
sb.Clear();
hasFirstValue = true;
}
else if (c == ':' && hasFirstValue)
{
// Below, the StringBuilder currently has something like this:
// " 235235 Some Text Here"
// We trim the leading whitespace, then split at the first sign of a double space
string[] bananaSplit = sb.ToString()
.Trim()
.Split(new string[] { " " },
StringSplitOptions.RemoveEmptyEntries);
// Add both results to the list
fish.Add(bananaSplit[0].Trim());
fish.Add(bananaSplit[1].Trim());
sb.Clear();
}
}
fish.Add(sb.ToString().Trim()); // Add the last result to the list
for (int i = 0; i < fish.Count; i += 2)
{
// This for loop assumes that the amount of keys and values added together
// is an even number. If it comes out odd, then one of the lines on the input
// text file wasn't parsed correctly or wasn't generated correctly.
dictInput.Add(fish[i], fish[i + 1]);
}
}
So the only general approach that I can think of, given the format that you're limited to, is to first find the first colon on the line and take everything before it as the label. Skip all whilespace characters until you get to the first non-whitespace character. Take all non-whitespace characters as the value of the label. If there is a colon after the end of that value take everything after the end of the previous value to the colon as the next value and repeat. You'll also probably need to trim whitespace around the labels.
You might be able to capture that meaning with a regex, but it wouldn't likely be a pretty one if you could; I'd avoid it for something this complex unless you're entire development team is very proficient with them.
I would try something like this:
While string contains triple space, replace it with double space.
Replace all ": " and ": " (: with double space) with ":".
Replace all " " (double space) with '\n' (new line).
If line don't contain ':' than skip the line. Else, use string.Split(':'). This way you receive arrays of 2 strings (key and value). Some of them may contain empty characters at the beginning or at the end.
Use string.Trim() to get rid of those empty characters.
Add received key and value to Dictionary.
I am not sure if it solves all your cases but it's a general clue how I would try to do it.
If it works you could think about performance (use StringBuilder instead of string wherever it is possible etc.).
This is probably the dirtiest function I´ve ever written, but it works.
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
string[] data = line.Split(':');
if (line != String.Empty)
{
for (int i = 0; i < data.Length - 1; i++)
{
if (i != 0)
{
bool isPair;
if (i % 2 == 0)
{
isPair = true;
}
else
{
isPair = false;
}
if (isPair)
{
string keyOdd = data[i].Trim();
try { keyOdd = keyOdd.Substring(keyOdd.IndexOf(' ')).TrimStart(); }
catch { }
string valueOdd = data[i + 1].TrimStart();
try { valueOdd = valueOdd.Remove(valueOdd.IndexOf(' ')); } catch{}
yourDic.Add(keyOdd, valueOdd);
}
else
{
string keyPair = data[i].TrimStart();
keyPair = keyPair.Substring(keyPair.IndexOf(' ')).Trim();
string valuePair = data[i + 1].TrimStart();
try { valuePair = valuePair.Remove(valuePair.IndexOf(' ')); } catch { }
yourDic.Add(keyPair, valuePair);
}
}
else
{
string key = data[i].Trim();
string value = data[i + 1].TrimStart();
try { value = value.Remove(value.IndexOf(' ')); } catch{}
yourDic.Add(key, value);
}
}
}
}
How does it works?, well splitting the line you can know what you can get in every position of the array, so I just play with the even and odd values.
You will understand me when you debug this function :D. It fills the Dictionary that you need.
I have another idea. Does values contain spaces? If not you could do like this:
Ignore white spaces until you read some other char (first char of key).
Read string until ':' occures.
Trim key that you get.
Ignore white spaces until you read some other char (first char of value).
Read until you get empty char.
Trim value that you get.
If it is the end than stop. Else, go back to step 1.
Good luck.
Maybe something like this would work, be careful with the ':' character
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
yourDic.Add(line.Split(':')[0], line.Split(':')[1]);
}
Anyway, I recommend to organize that file in some way that you´ll always know in what format it comes.
I have html encoded strings in a database, but many of the character entities are not just the standard & and <. Entities like “ and —. Unfortunately we need to feed this data into a flash based rss reader and flash doesn't read these entities, but they do read the unicode equivalent (ex “).
Using .Net 4.0, is there any utility method that will convert the html encoded string to use unicode encoded character entities?
Here is a better example of what I need. The db has html strings like: <p>John & Sarah went to see $ldquo;Scream 4$rdquo;.</p> and what I need to output in the rss/xml document with in the <description> tag is: <p>John & Sarah went to see “Scream 4”.</p>
I'm using an XmlTextWriter to create the xml document from the database records similar to this example code http://www.dotnettutorials.com/tutorials/advanced/rss-feed-asp-net-csharp.aspx
So I need to replace all of the character entities within the html string from the db with their unicode equivilant because the flash based rss reader doesn't recognize any entities beyond the most common like &.
My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.
If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.
EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):
string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("&#" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}
I may have an off-by-one error somewhere in there, but it should be close.
would HttpUtility.HtmlDecode work for you?
I realize it doesn't convert to unicode equivalent entities, but instead converts it to unicode. Is there a specific reason you want the unicode equivalent entities?
updated edit
string test = "<p>John & Sarah went to see “Scream 4”.</p>";
string decode = HttpUtility.HtmlDecode(test);
string encode = HttpUtility.HtmlEncode(decode);
StringBuilder builder = new StringBuilder();
foreach (char c in encode)
{
if ((int)c > 127)
{
builder.Append("&#");
builder.Append((int)c);
builder.Append(";");
}
else
{
builder.Append(c);
}
}
string result = builder.ToString();
you can download a local copy of the appropriate HTML and/or XHTML DTDs from the W3C. Then set up an XmlResolver and use it to expand any entities found in the document.
You could use a regular expression to find/expand the entities, but that won't know anything about context (e.g., anything in a CDATA section shouldn't be expanded).
this might help you put input path in textbox
try
{
FileInfo n = new FileInfo(textBox1.Text);
string initContent = File.ReadAllText(textBox1.Text);
int contentLength = initContent.Length;
Match m;
while ((m = Regex.Match(initContent, "[^a-zA-Z0-9<>/\\s(&#\\d+;)-]")).Value != String.Empty)
initContent = initContent.Remove(m.Index, 1).Insert(m.Index, string.Format("&#{0};", (int)m.Value[0]));
File.WriteAllText("outputpath", initContent);
}
catch (System.Exception excep)
{
MessageBox.Show(excep.Message);
}
}
I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.
I have a big string (let's call it a CSV file, though it isn't actually one, it'll just be easier for now) that I have to parse in C# code.
The first step of the parsing process splits the file into individual lines by just using a StreamReader object and calling ReadLine until it's through the file. However, any given line might contain a quoted (in single quotes) literal with embedded newlines. I need to find those newlines and convert them temporarily into some other kind of token or escape sequence until I've split the file into an array of lines..then I can change them back.
Example input data:
1,2,10,99,'Some text without a newline', true, false, 90
2,1,11,98,'This text has an embedded newline
and continues here', true, true, 90
I could write all of the C# code needed to do this by using string.IndexOf to find the quoted sections and look within them for newlines, but I'm thinking a Regex might be a better choice (i.e. now I have two problems)
Since this isn't a true CSV file, does it have any sort of schema?
From your example, it looks like you have:
int, int, int, int, string , bool, bool, int
With that making up your record / object.
Assuming that your data is well formed (I don't know enough about your source to know how valid this assumption is); you could:
Read your line.
Use a state machine to parse your data.
If your line ends, and you're parsing a string, read the next line..and keep parsing.
I'd avoid using a regex if possible.
State-machines for doing such a job are made easy using C# 2.0 iterators. Here's hopefully the last CSV parser I'll ever write. The whole file is treated as a enumerable bunch of enumerable strings, i.e. rows/columns. IEnumerable is great because it can then be processed by LINQ operators.
public class CsvParser
{
public char FieldDelimiter { get; set; }
public CsvParser()
: this(',')
{
}
public CsvParser(char fieldDelimiter)
{
FieldDelimiter = fieldDelimiter;
}
public IEnumerable<IEnumerable<string>> Parse(string text)
{
return Parse(new StringReader(text));
}
public IEnumerable<IEnumerable<string>> Parse(TextReader reader)
{
while (reader.Peek() != -1)
yield return parseLine(reader);
}
IEnumerable<string> parseLine(TextReader reader)
{
bool insideQuotes = false;
StringBuilder item = new StringBuilder();
while (reader.Peek() != -1)
{
char ch = (char)reader.Read();
char? nextCh = reader.Peek() > -1 ? (char)reader.Peek() : (char?)null;
if (!insideQuotes && ch == FieldDelimiter)
{
yield return item.ToString();
item.Length = 0;
}
else if (!insideQuotes && ch == '\r' && nextCh == '\n') //CRLF
{
reader.Read(); // skip LF
break;
}
else if (!insideQuotes && ch == '\n') //LF for *nix-style line endings
break;
else if (ch == '"' && nextCh == '"') // escaped quotes ""
{
item.Append('"');
reader.Read(); // skip next "
}
else if (ch == '"')
insideQuotes = !insideQuotes;
else
item.Append(ch);
}
// last one
yield return item.ToString();
}
}
Note that the file is read character by character with the code deciding when newlines are to be treated as row delimiters or part of a quoted string.
What if you got the whole file into a variable then split that based on non-quoted newlines?
EDIT: Sorry, I've misinterpreted your post. If you're looking for a regex, then here is one:
content = Regex.Replace(content, "'([^']*)\n([^']*)'", "'\1TOKEN\2'");
There might be edge cases and that two problems but I think it should be ok most of the time. What the Regex does is that it first finds any pair of single quotes that has \n between it and replace that \n with TOKEN preserving any text in-between.
But still, I'd go state machine like what #bryansh explained below.