How to Handle Accented Characters in a Directory Name - c#

I have a problem with using Directory.Exists() on a string that contains an accented character.
This is the directory path: D:\ést_test\scenery. It is coming in as a simple string in a file that I am parsing:
[Area.121]
Title=ést_test
local=D:\AITests\ést_test
Layer=121
Active=FALSE
Required=FALSE
My code is taking the local value and adding \scenery to it. I need to test that this exists (which it does) and am simply using:
if (!Directory.Exists(area.Path))
{
// some handling code
area.AreaIsValid = false;
}
This returns false. It seems that the string handling that I am doing is replacing the accented character. The text visualizer in VS2012 is showing this (directoryManager is just a wrap around System.IO.Directory):
And the warning message as displayed is showing this:
So it seems that the accented character is not being recognized. Searching for this issue does turn up but mostly about removing or replacing the accented character. I am currently using 'normal' string handling. I tried using FileInfo but the path seems to get mangled anyway.
So my first question is how do I get the path stored into a string so that it will pass the Directory.Exists test?
This raises a wider question of non latin characters in path names. I have users all over the world so I can see arabic. Russian, Chinese and so on in paths. How can I handle all of these?

The problem is almost certainly that you're loading the file with the wrong encoding. The fact that it's a filename is irrelevant - the screenshots show that you've lost the relevant data before you call Directory.Exists.
You should make sure you know the file encoding (e.g. UTF-8, Cp1252 etc) and then pass that in as an argument into however you're loading the file (e.g. File.ReadAllText). If this isn't enough information to get you going, you'll need to tell us more about the file (to work out what encoding it's in) and more about your code (how you're reading it).
Once you've managed to load the correct data, I'd hope that the file aspect just handles itself automatically.

Related

Why does System.IO.DirectoryInfo allow constructing a path ending in form feed?

While writing some property-based tests in F#, I have discovered a very strange behavior of the constructor of System.IO.DirectoryInfo. Based on the documentation, I would expect an ArgumentException to be thrown any time I try to construct a DirectoryInfo with an invalid path character; and I would expect Path.GetInvalidPathChars() to be a reliable source for these. Its documentation says, "The array returned from this method is not guaranteed to contain the complete set of characters that are invalid in file and directory names"; but it says nothing about the method returning characters that are not invalid! Form feed ('\012') is one of these invalid characters, as you might expect; I would expect any kind of control character to be invalid in a path name. And indeed, if I try to construct a DirectoryInfo with one in the middle or the beginning, it doesn't work. However, it does work if I put it at the end:
The absolute path produced does not have the form feed in it though, confirming that this is indeed and invalid path character. I confirmed that the directory I created in that REPL session was there, with no control characters in the name. However, to make it even odder, when I convert the DirectoryInfo back to a string, the garbage character is maintained:
Then same behavior is exhibited for horizontal tab '\009', line feed '\010', and vertical tab '\011', and carriage return '\013', but all other control characters are rejected. One would think they would at least be trimmed if they were going to be accepted.
What's going on here? Is this a bug in DirectoryInfo, or is there some rational/consistent explanation of this behavior in .NET? (And what other such potholes might I have to look for!)

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:
It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));
You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)
The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

C# Reading files and encoding issue

I've searched everywhere for this answer so hopefully it's not a duplicate. I decided I'm just finally going to ask it here.
I have a file named Program1.exe When I drag that file into Notepad or Notepad++ I get all kinds of random symbols and then some readable text. However, when I try to read this file in C#, I either get inaccurate results, or just a big MZ. I've tried all supported encodings in C#. How can notepad programs read a file like this but I simply can't? I try to convert bytes to string and it doesn't work. I try to directly read line by line and it doesn't work. I've even tried binary and it doesn't work.
Thanks for the help! :)
Reading a binary file as text is a peculiar thing to do, but it is possible. Any of the 8-bit encodings will do it just fine. For example, the code below opens and reads an executable and outputs it to the console.
const string fname = #"C:\mystuff\program.exe";
using (var sw = new StreamReader(fname, Encoding.GetEncoding("windows-1252")))
{
var s = sw.ReadToEnd();
s = s.Replace('\x0', ' '); // replace NUL bytes with spaces
Console.WriteLine(s);
}
The result is very similar to what you'll see in Notepad or Notepad++. The "funny symbols" will differ based on how your console is configured, but you get the idea.
By the way, if you examine the string in the debugger, you're going to see something quite different. Those funny symbols are encoded as C# character escapes. For example, nul bytes (value 0) will display as \0 in the debugger, as NUL in Notepad++, and as spaces on the console or in Notepad. Newlines show up as \r in the debugger, etc.
As I said, reading a binary file as text is pretty peculiar. Unless you're just looking to see if there's human-readable data in the file, I can't imagine why you'd want to do this.
Update
I suspect the reason that all you see in the Windows Forms TextBox is "MZ" is that the Windows textbox control (which is what the TextBox ultimately uses), uses the NUL character as a string terminator, so won't display anything after the first NUL. And the first thing after the "MZ" is a NUL (shows as `\0' in the debugger). You'll have to replace the 0's in the string with spaces. I edited the code example above showing how you'd do that.
The exe is a binary file and if you try to read it as a text file you'll get the effect that you are describing. Try using something like a FileStream instead that does not care about the structure of the file but treats it just as a series of bytes.

c# equivalent to stripcslashes function?

I am working with a project that includes getting MMS from a mms-gateway and storing the image on disk.
This includes using a received base64encoded string and storing it as a zip to a web server. This zip is then opened, and the image is retrieved.
We have managed to store it as a zip file, but it is corrupted and cannot be opened.
The documentation from the gateway is pretty sparse, and we have only a php example to rely on. I think we have figured out how to "translate" most of it, except for the PHP function stripcslashes(inputvalue). Can anyone shed shed any light on how to do the same thing in c#?
We are thankful for any help!
stripcslashes() looks for "\x" type elements within longer strings (where 'x' could be any character, or perhaps, more than one). If the 'x' is not recognised as meaningful, it just removes the '\' but if it does recognise it as a valid C-style escape sequence (i.e. "\n" is newline; "\t" is tab, etc.), as I understand it, the recognised character is inserted instead: \t will be replaced by a tab character (0x09, I think) in your string.
I'm not aware of any simple way to get the .net framework to do the same thing without building a similar function yourself. This obviously isn't very hard, but you need to know which escape sequences to process.
If you happen to know (or find out by inspecting your base64 text) that the only thing in your input that will need processing is a particular one or two sequences (say, tab characters), it becomes very easy and the following snippet shows use of String.Replace():
string input = #"Some\thing"; // '#' means string stored without processing '\t'
Console.WriteLine(input);
string output = input.Replace(#"\t", "\t");
Console.WriteLine(output);
Of course, if you really do simply want to remove all the slashes:
string output = input.Replace(#"\", "");

Verifying that an uploaded file contains only plain text

I have an ASP.NET MVC application that allows the user to upload a file that should only contain plain text.
I am looking for a simple approach to validate that the file does indeed contain only text.
For my purposes I am happy to define text as any of the characters that I can see printed on my GB QWERTY keyboard.
Business rules mean that my uploaded file won't contain any accented characters, so it doesn't matter if the code accepts or rejects these.
Approaches so far that have not worked:
Checking the content-type; no good as this is dependant on the file extension
Checking char.IsControl for each character; no good as the file can contain pipe (|) characters which are considered to be control characters
I'd rather avoid using a lengthy Regex pattern to get this to work.
It sounds like you want ASCII characters 32-126 plus a few odds and ends like 9 (horizontal tab), carriage return & linefeed, etc..
I'd rather avoid using a lengthy Regex
pattern to get this to work.
As long as that doesn't mean 'no regular expressions at all', you can use the accepted answer from this stack overflow question (I've added the horizontal tab character to the original):
^([^\x09\x0d\x0a\x20-\x7e\t]*)$

Categories