I am working on a C# program which processes a bunch of text files. These files have been created by a system so I can't change the source, but within the file ¿ appears multiple times which is causing my code to fall over.
What does ¿ mean and how can I handle it?
¿ means you have a character that is converted from another encoding type and is not recognized in the character table of your encoding type. You may handle it if you use another encoding type.
Documentation
At the start of Unicode-encoded files is a "header". This header tells programs reading it that it's a Unicode file. This is called a "Byte order mark" and signifies to readers what type of Unicode it is. http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx
To elaborate on my comment, first you should find out what encoding was used when these were created, then use that encoding when reading them in. Check out:
BinaryReader(Stream, Encoding)
http://msdn.microsoft.com/en-us/library/system.io.binaryreader.aspx
Related
The typical way to edit a text or source file from code is to read the file using File.ReadAllLines or File.ReadAllText, make your changes, and then write it out using WriteAllLines or WriteAllText.
However, if you were to open the text file (say some source code file) in Visual Studio or Notepad++, scroll down a few lines, make a change, and save, a lot more is handled.
It seems that what is handled, at least on Windows, is a complicated set of rules and heuristics that takes into account, at a minimum:
The inferred encoding of the text file.
The line-endings
Whether the last line is an "incomplete line" (as described in diffutils/manual, namely a line with no line ending character(s)
I'll discuss these partially just to illustrate the complexity. My question is, is there a full set of heuristics, an already established algorithm that can be used or an existing component that encapsulates this.
Inferred Encoding
Most common for source / text files:
UTF-16 with BOM
UTF-8 with BOM
UTF-8 without BOM
When there's no BOM, the encoding is inferred using some heuristics.
It could be ASCII or Windows1252 (Encoding.GetEncoding(1252)), or BOMless UTF-8
It depends on what the rest of the data looks like. If there's some known upper-ascii or what might look UTF-8.
When you save, you need to keep the same encoding.
Line endings
You have to keep the same line-endings. So if the file uses CR/LF, then keep it at CR/LF.
But when it's just LF, then keep that.
But it can get more complicated then that as given text file may have both, and one would need to maintain that as well.
For example, a source file that's CR/LF may, inside of it, have a section that's only LF line-ended only.
This can happen when someone pastes text from another tool into a literal multi-line string, such as C#'s #"" strings.
Visual Studio handles this correctly.
Incomplete lines
If the last line is incomplete, that has to be maintained as well. That means, if the last line doesn't end with end-of-line character(s)
Possible approach
I think one way to get around all of these problems from the start is to treat the file as binary instead of text. This means the normal text-file processing in .NET cannot be used. A new set of APIs will be needed to handle editing such files.
I can imaging a component that requires you to open the file as a memory stream and pass that to the component. The component then can read the stream and provide a line-oriented view to clients, such that client code can iterate over the lines for processing. Each element through the iteration will be an object of a type that looks something like this:
class LineElement
{
int originalLineNumber;
string[] lines;
string[] lineEndings;
}
As an example for a normal text file on Windows:
originalLineNumber will be 1
lines will be a one-dimensional array with the first line of the file, without line-endings
lineEndings[0] will be "\x0D\x0A"
the lines field can be modified. It can be replaced with empty array to delete the line
or it can be replaced with with a multi-element array to insert lines (replacing the existing line)
lineEndings array handled similarly.
In many cases, new lines aren't removed or inserted, in which case the application code never has to deal with line-endings at all. They simply operate on the lines[] array, ignoring the lineEndings[] array.
I'm open to other suggestions.
I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:
It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));
You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)
The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.
I have simple txt file, read this content, do with them some operation (for example encode and decode) and save result in file. When I compare its two files in beyond compare I see that content the same. But sizes of files are different. Why? And how I can resolve this problem?
There can be many reasons for that, for example different encoding, or maybe one is using \r\n and the other uses only \n.
Use Hex Compare in BeyondCompare to find out exactly.
I have a problem with using Directory.Exists() on a string that contains an accented character.
This is the directory path: D:\ést_test\scenery. It is coming in as a simple string in a file that I am parsing:
[Area.121]
Title=ést_test
local=D:\AITests\ést_test
Layer=121
Active=FALSE
Required=FALSE
My code is taking the local value and adding \scenery to it. I need to test that this exists (which it does) and am simply using:
if (!Directory.Exists(area.Path))
{
// some handling code
area.AreaIsValid = false;
}
This returns false. It seems that the string handling that I am doing is replacing the accented character. The text visualizer in VS2012 is showing this (directoryManager is just a wrap around System.IO.Directory):
And the warning message as displayed is showing this:
So it seems that the accented character is not being recognized. Searching for this issue does turn up but mostly about removing or replacing the accented character. I am currently using 'normal' string handling. I tried using FileInfo but the path seems to get mangled anyway.
So my first question is how do I get the path stored into a string so that it will pass the Directory.Exists test?
This raises a wider question of non latin characters in path names. I have users all over the world so I can see arabic. Russian, Chinese and so on in paths. How can I handle all of these?
The problem is almost certainly that you're loading the file with the wrong encoding. The fact that it's a filename is irrelevant - the screenshots show that you've lost the relevant data before you call Directory.Exists.
You should make sure you know the file encoding (e.g. UTF-8, Cp1252 etc) and then pass that in as an argument into however you're loading the file (e.g. File.ReadAllText). If this isn't enough information to get you going, you'll need to tell us more about the file (to work out what encoding it's in) and more about your code (how you're reading it).
Once you've managed to load the correct data, I'd hope that the file aspect just handles itself automatically.
Given a Stream as input, how do I safely create an XPathNavigator against an XML data source?
The XML data source:
May possibly contain invalid hexadecimal characters that need to be removed.
May contain characters that do not match the declared encoding of the document.
As an example, some XML data sources in the cloud will have a declared encoding of utf-8, but the actual encoding is windows-1252 or ISO 8859-1, which can cause an invalid character exception to be thrown when creating an XmlReader against the Stream.
From the StreamReader.CurrentEncoding property documentation: "The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method." This seems indicate that CurrentEncoding can be checked after the first read, but are we stuck storing this encoding when we need to write out the XML data to a Stream?
I am hoping to find a best practice for safely creating an XPathNavigator/IXPathNavigable instance against an XML data source that will gracefully handle encoding an invalid character issues (in C# preferably).
I had a similar issue when some XML fragments were imported into a CRM system using the wrong encoding (there was no encoding stored along with the XML fragments).
In a loop I created a wrapper stream using the current encoding from a list. The encoding was constructed using the DecoderExceptionFallback and EncoderExceptionFallback options (as mentioned by #Doug). If a DecoderFallbackException was thrown during processing the original stream is reset and the next-most-likely encoding is used.
Our encoding list was something like UTF-8, Windows-1252, GB-2312 and US-ASCII. If you fell off the end of the list then the stream was really bad and was rejected/ignored/etc.
EDIT:
I whipped up a quick sample and basic test files (source here). The code doesn't have any heuristics to choose between code pages that both match the same set of bytes, so a Windows-1252 file may be detected as GB2312, and vice-versa, depending on file content, and encoding preference ordering.
It's possible to use the DecoderFallback class (and a few related classes) to deal with bad characters, either by skipping them or by doing something else (restarting with a new encoding?).
When using a XmlTextReader or something similiar, the reader itself will figure out the encoding declared in the xml file.