Verifying that an uploaded file contains only plain text

Verifying that an uploaded file contains only plain text - c#

I have an ASP.NET MVC application that allows the user to upload a file that should only contain plain text.
I am looking for a simple approach to validate that the file does indeed contain only text.
For my purposes I am happy to define text as any of the characters that I can see printed on my GB QWERTY keyboard.
Business rules mean that my uploaded file won't contain any accented characters, so it doesn't matter if the code accepts or rejects these.
Approaches so far that have not worked:
Checking the content-type; no good as this is dependant on the file extension
Checking char.IsControl for each character; no good as the file can contain pipe (|) characters which are considered to be control characters
I'd rather avoid using a lengthy Regex pattern to get this to work.

It sounds like you want ASCII characters 32-126 plus a few odds and ends like 9 (horizontal tab), carriage return & linefeed, etc..
I'd rather avoid using a lengthy Regex
pattern to get this to work.
As long as that doesn't mean 'no regular expressions at all', you can use the accepted answer from this stack overflow question (I've added the horizontal tab character to the original):
^([^\x09\x0d\x0a\x20-\x7e\t]*)$

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.

From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

How do I determine a delimiter in a text file

I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?

You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).

Regular expression for valid filename

I already gone through some question in StackOverflow regarding this but nothing helped much in my case.
I want to restrict the user to provide a filename that should contain only alphanumeric characters, -, _, . and space.
I'm not good in regular expressions and so far I came up with this ^[a-zA-Z0-9.-_]$. Can somebody help me?

This is the correct expression:
string regex = #"^[\w\-. ]+$";
\w is equivalent of [0-9a-zA-Z_].

To validate a file name i would suggest using the function provided by C# rather than regex
if (filename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
}

While what the OP asks is close to what the currently accepted answer uses (^[\w\-. ]+$), there might be others seeing this question who has even more specific constraints.
First off, running on a non-US/GB machine, \w will allow a wide range of unwanted characters from foreign languages, according to the limitations of the OP.
Secondly, if the file extension is included in the name, this allows all sorts of weird looking, though valid, filenames like file .txt or file...txt.
Thirdly, if you're simply uploading the files to your file system, you might want a blacklist of files and/or extensions like these:
web.config, hosts, .gitignore, httpd.conf, .htaccess
However, that is considerably out of scope for this question; it would require all sorts of info about the setup for good guidance on security issues. I thought I should raise the matter none the less.
So for a solution where the user can input the full file name, I would go with something like this:
^[a-zA-Z0-9](?:[a-zA-Z0-9 ._-]*[a-zA-Z0-9])?\.[a-zA-Z0-9_-]+$
It ensures that only the English alphabet is used, no beginning or trailing spaces, and ensures the use of a file extension with at least 1 in length and no whitespace.
I've tested this on Regex101, but for future reference, this was my "test-suite":
## THE BELOW SHOULD MATCH
web.config
httpd.conf
test.txt
1.1
my long file name.txt
## THE BELOW SHOULD NOT MATCH - THOUGH VALID
æøå.txt
hosts
.gitignore
.htaccess

In case someone else needs to validate filenames (including Windows reserved words and such), here's a full expression:
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|[\s\.])[^\\\/:*"?<>|]{1,254}\z
Extended expression (don't allow filenames starting with 2 dots, don't allow filenames ending in dots or whitespace):
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|\s|[\.]{2,})[^\\\/:*"?<>|]{1,254}(?<![\s\.])\z
Edit:
For the interested, here's a link to Windows file naming conventions:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx

use this regular expression ^[a-zA-Z0-9._ -]+$

This is a minor change to Engineers answer.
string regex = #"^[\w\- ]+[\w\-. ]*$"
This will block ".txt" which isn't valid.
Trouble is, it does block "..txt" which is valid

For full character set (Unicode) use
^[\p{L}0-9_\-.~]+$
or perhaps
^[\p{L}\p{N}_\-.~]+$
would be more accurate if we are talking about Unicode.
I added a '~' simply because I have some files using that character.

I've just created this. It prevents two dots and dot at end and beginning. It doesn't allow any two dots though.
^([a-zA-Z0-9_]+)\.(?!\.)([a-zA-Z0-9]{1,5})(?<!\.)$

When used in HTML5 via pattern:
<form action="" method="POST">
<fieldset>
<legend>Export Configuration</legend>
<label for="file-name">File Name</label>
<input type="text" required pattern="^[\w\-. ]+$" id="file-name" name="file_name"/>
</fieldset>
<button type="submit">Export Settings</button>
</form>
This will validate against all valid file names. You can remove required to prevent the native HTML5 validation.

I may be saying something stupid here, but it seems to me that these answers aren't correct. Firstly, are we talking Linux or Windows here (or another OS)?
Secondly, in Windows it is (I believe) perfectly legitimate to include a "$" in a filename, not to mention Unicode in general. It certainly seems possible.
I tried to get a definitive source on this... and ending up at the Wikip Filename page: in particular the section "Reserved characters and words" seems relevant: and these are, clearly, a list of things which you are NOT allowed to put in.
I'm in the Java world. And I naturally assumed that Apache Commons would have something like validateFilename, maybe in FilenameUtils... but it appears not (if it had done, this would still be potentially useful to C# programmers, as the code is usually pretty easy to understand, and could therefore be translated). I did do an experiment, though, using the method normalize: to my disappointment it allowed perfectly invalid characters (?, etc.) to "pass".
The part of the Wikip Filename page referenced above shows that this question depends on the OS you're using... but it should be possible to concoct some simple regex for Linux and Windows at least.
Then I found a Java way (at least):
Path path = java.nio.file.FileSystems.getDefault().getPath( 'bobb??::mouse.blip' );
output:
java.nio.file.InvalidPathException: Illegal char at index 4:
bobb??::mouse.blip
... presumably different FileSystem objects will have different validation rules

Copied from #Engineer for future reference as the dot was not escaped (as it should) in the most voted answer.
This is the correct expression:
string regex = #"^[\w\-\. ]+$";

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.