Properly editing a text or source file from C#/.NET

Properly editing a text or source file from C#/.NET - c#

The typical way to edit a text or source file from code is to read the file using File.ReadAllLines or File.ReadAllText, make your changes, and then write it out using WriteAllLines or WriteAllText.
However, if you were to open the text file (say some source code file) in Visual Studio or Notepad++, scroll down a few lines, make a change, and save, a lot more is handled.
It seems that what is handled, at least on Windows, is a complicated set of rules and heuristics that takes into account, at a minimum:
The inferred encoding of the text file.
The line-endings
Whether the last line is an "incomplete line" (as described in diffutils/manual, namely a line with no line ending character(s)
I'll discuss these partially just to illustrate the complexity. My question is, is there a full set of heuristics, an already established algorithm that can be used or an existing component that encapsulates this.
Inferred Encoding
Most common for source / text files:
UTF-16 with BOM
UTF-8 with BOM
UTF-8 without BOM
When there's no BOM, the encoding is inferred using some heuristics.
It could be ASCII or Windows1252 (Encoding.GetEncoding(1252)), or BOMless UTF-8
It depends on what the rest of the data looks like. If there's some known upper-ascii or what might look UTF-8.
When you save, you need to keep the same encoding.
Line endings
You have to keep the same line-endings. So if the file uses CR/LF, then keep it at CR/LF.
But when it's just LF, then keep that.
But it can get more complicated then that as given text file may have both, and one would need to maintain that as well.
For example, a source file that's CR/LF may, inside of it, have a section that's only LF line-ended only.
This can happen when someone pastes text from another tool into a literal multi-line string, such as C#'s #"" strings.
Visual Studio handles this correctly.
Incomplete lines
If the last line is incomplete, that has to be maintained as well. That means, if the last line doesn't end with end-of-line character(s)
Possible approach
I think one way to get around all of these problems from the start is to treat the file as binary instead of text. This means the normal text-file processing in .NET cannot be used. A new set of APIs will be needed to handle editing such files.
I can imaging a component that requires you to open the file as a memory stream and pass that to the component. The component then can read the stream and provide a line-oriented view to clients, such that client code can iterate over the lines for processing. Each element through the iteration will be an object of a type that looks something like this:
class LineElement
{
int originalLineNumber;
string[] lines;
string[] lineEndings;
}
As an example for a normal text file on Windows:
originalLineNumber will be 1
lines will be a one-dimensional array with the first line of the file, without line-endings
lineEndings[0] will be "\x0D\x0A"
the lines field can be modified. It can be replaced with empty array to delete the line
or it can be replaced with with a multi-element array to insert lines (replacing the existing line)
lineEndings array handled similarly.
In many cases, new lines aren't removed or inserted, in which case the application code never has to deal with line-endings at all. They simply operate on the lines[] array, ignoring the lineEndings[] array.
I'm open to other suggestions.

Related

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.

From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

Get only strings from binary file

I try to print to screen a string from a binary file using xaml labels, but when i display the file content I got a beautiful "corrupted" character instead of the entire file content.
I think the problem is reading the file, I already can change the label content using the most basic technique it work pretty well till today....
label.Text = mystring ;
The fact is : I have data in my binaries files that inst text (some random data that I don't care) located to the start of the file, my theory is my program start reading, read a non ascii character and stop reading...
I read using the File class, maybe the wrong thing.....
label.Text = File.ReadAllText(my_file);
So, im lock now. I don't exactly know what im supposed to do....
Hope you can help me :D

I can't tell much without looking at the text, but it seems you need to add the Encoding
Something like this:
string myText = File.ReadAllText(path, Encoding.Default);

You need to know how your binary file is structured. You need to know the encoding of the strings. A normal text file normally has markers at the beginning two or so bytes that identify its encoding if it is Unicode. This way the system can know whether its UTF-8, UTF-16, ...
If you try to read a binary file this Information is not there. Instead the reading process will most probably find unexpected binary data. So you cannot read a binary file as text. If your file is structured the way that at the beginning is binary data and later only text, just skip the first part and start reading at the start of the second part. But I don't think, that it is that easy:
if it really is binary data, chances are that the file structure is much more complicated and you need to do more work to read it.
if only the first two bytes are binary data, then maybe its a text file and you can read it without problems, you maybe only need to pass the right encoding to the reading function

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)

Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.

The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

String processing / CSV challenge

Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.

The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.

To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.

I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.

How to Handle Accented Characters in a Directory Name

I have a problem with using Directory.Exists() on a string that contains an accented character.
This is the directory path: D:\ést_test\scenery. It is coming in as a simple string in a file that I am parsing:
[Area.121]
Title=ést_test
local=D:\AITests\ést_test
Layer=121
Active=FALSE
Required=FALSE
My code is taking the local value and adding \scenery to it. I need to test that this exists (which it does) and am simply using:
if (!Directory.Exists(area.Path))
{
// some handling code
area.AreaIsValid = false;
}
This returns false. It seems that the string handling that I am doing is replacing the accented character. The text visualizer in VS2012 is showing this (directoryManager is just a wrap around System.IO.Directory):
And the warning message as displayed is showing this:
So it seems that the accented character is not being recognized. Searching for this issue does turn up but mostly about removing or replacing the accented character. I am currently using 'normal' string handling. I tried using FileInfo but the path seems to get mangled anyway.
So my first question is how do I get the path stored into a string so that it will pass the Directory.Exists test?
This raises a wider question of non latin characters in path names. I have users all over the world so I can see arabic. Russian, Chinese and so on in paths. How can I handle all of these?

The problem is almost certainly that you're loading the file with the wrong encoding. The fact that it's a filename is irrelevant - the screenshots show that you've lost the relevant data before you call Directory.Exists.
You should make sure you know the file encoding (e.g. UTF-8, Cp1252 etc) and then pass that in as an argument into however you're loading the file (e.g. File.ReadAllText). If this isn't enough information to get you going, you'll need to tell us more about the file (to work out what encoding it's in) and more about your code (how you're reading it).
Once you've managed to load the correct data, I'd hope that the file aspect just handles itself automatically.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.