Deleting line breaks in a text file with a condition - c#

I have a text file. Some of the lines in it end with lf and some end with crlf. I only need to delete lfs and leave all crlfs.
Basically, my file looks like this
Mary had a lf
dog.crlf
She liked her lf
dog very much. crlf
I need it to be
Mary had a dog.crlf
She liked her dog very much.crlf
Now, I tried just deleting all lfs unconditionally, but then I couldn't figure out how to write it back into the text file. If I use File.WriteAllLines and put a string array into it, it automatically creates line breaks all over again. If I use File.WriteAllText, it just forms one single line.
So the question is - how do I take a text file like the first and make it look like the second? Thank you very much for your time.
BTW, I checked similar questions, but still have trouble figuring it out.

Use regex with a negative look-behind and only replace the \n not preceded by a \r:
DEMO
var result = Regex.Replace(sampleFileContent, #"(?<!\r)\n", String.Empty);
The (?<! ... ) is a negative look-behind. It means that we only want to replace instances of \n when there isn't a \r directly behind it.
Disclaimer: This may or may not be as viable an option depending on the size of your file(s). This is a good solution if you're not concerned with overhead or you're doing some quick fixes, but I'd look in to a more robust parser if the files are going to be huge.

This is an alternative to Brad Christie's answer, which doesn't use Regex.
String result = sampleFileContent.Replace("\r\n", "**newline**")
.Replace("\n","")
.Replace("**newline**","\r\n");
Here's a demo. Seems faster than the regex solution according to this site, but uses a bit more memory.

Just tested it:
string file = File.ReadAllText("test.txt");
file = file.Replace("\r", "");
File.WriteAllText("test_replaced.txt", file);

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.
For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

Can't replace single whitespace with string.Replace()

I have run into a problem I do not understand. I am reading data from a file and have run into a situation where string.Replace(" ", "<whatever>") on an entry from the file will not replace the occurence of a single whitespace. I cannot help but to feel there is something very basic that I have missed, since the same kind of string declared as a literal works fine.
A typical line from the file (each entry is separated by a tab):
"2016-feb-08 09:54:00" "2016-feb-08 17:28:00" "Short" "227" "5 170,00" "+3,90%" "0,00"
The data from the file is read into an array using File.ReadAllLines().Split(new[] {"\t" }, StringSplitOptions.None);.
I then want to clean up the fifth entry for further processing, and this is when I run into the problem:
entries[4].Replace(" ", string.Empty).Replace("\"", string.Empty); gives "5 170,00"
Regex.Replace(entries[4], #"\s+", string.Empty).Replace("\"", string.Empty); gives "5170,00", which is the result I am looking for.
Running the first Replace() on a literal with a single space works fine, so I am curious if the whitespace inside the strings from the file are different somehow? And while the Regex solution works, I really want to know what my "issue" is.
You can use code like below to check hex values of the character. A normal space is 0x20 which the value showing between the five and the one in the code you posted.
string input = "2016-feb-08 09:54:00 2016-feb-08 17:28:00 Short 227 5 170,00 +3,90% 0,00";
byte[] output = Encoding.UTF8.GetBytes(input);

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?
Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.
This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm
foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))
I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

How can I delete all the comments in my code?

I'm new to C#, and want to develope a program with which I could delete the comments after // in my code. Is there any simple code recommended for this purpose?
It has been suggested that you just search for "//" and trim.
Because you have limited yourself to single-line commands this seems like a relatively simple exercise however it has some tricky cases you need to be thinking about if you intend for the output of the program to be a valid C# application with identical behavior to the input program.
Here are some examples where just searching for "//" and trimming won't work.
Comment in Literal:
string foo = "this is // not a comment";
Comment in Comment
/* you should not trim // this one */
Comment in Comment Part Deux
// This is a comment // so don't just remove this!
Multi-line Comment Adjacency
/* you should not *//* trim this these */
There are certainly other edge cases but these are some low-hanging fruit to think about.
First point, this seems like a bad idea. Comments are useful.
Taking it as an exercise,
Edit: This is a simple solution that will fail on all the case #Bubbafat mentions (and propbably some more). It would still work OK on most source files.
read the text one line at a time.
find the last occurrence of //, if any using String.LastIndexOf()
remove the text after (including) the '//' when found
write the line to the output
ad 1: You can open an TextReader using System.IO.File.OpenText(), or File.ReadLines() if you can use Fx4
Also open an output file using System.IO.File.WriteText()
ad 3: int pos = line.LastIndexOf("//"); if (pos >= 0) { line = line.Substring(0, pos); }

Categories