I'm trying to read a structure of a text file in a certain way. The text file is kind of a user-friendly configuration file.
Current structure of file (structure can be changed if necessary):
info1=exampleinfo
info2=exampleinfo2
info3="example","example2","example3"
info4="example","example2","example3"
There is no real difficulty in getting the first two lines, but the latter two are more difficult. I need to put both in two seperate string arrays that I can use. I could use a split string, but the problem is in that in the info4 array, the values can contain comma's (this is all user input).
How to go about solving this?
The reason you're having trouble writing parser is that you're not starting with a good definition of the file format. Instead of asking how you should parse it if there are commas, you should be deciding how to properly encode values with commas. Then parsing is simple.
If this file is written by non-technical users who can't be trusted with a complex format (like json), consider a format like:
info1=exampleinfo
info2=exampleinfo2
info3=example
example2
example3
info4=example
example2
example3
That is, don't mess around with quotes and commas. Users understand line breaks and spaces pretty well.
I'm 100% in favor of #DavidHeffernan's solutions, JSON would be great. And #ScottMermelstein's solution of a program that builds the output - that's probably your best bet if possible, not allowing the user to make a mistake even if they wanted to.
However, if you need them to build the textfile, and you're working with users who can't be trusted to put together valid JSON, since it is a picky format, maybe try a delimiter that won't be used by the user, to separate values.
For example, pipes are always good, since practically nobody uses them:
info1=exampleinfo
info2=exampleinfo2
info3=example|example2|example3
info4=example|exam,ple2|example3
All you'd need is a rule that says their data cannot contain pipes. More than likely, the users would be ok with that.
Related
Why I don't want to use Resx files:
I am looking for an alternative for resx files to offer multilanguage support for my project, due to the following reasons:
I don't like to specify a "messageId" when writing messages, it is more effort and it is annoying for the flow as I don't see what the log message would actually say and I would need to open another tab to edit the message
Sometimes I use code inline because I don't want to create new variables for to easy steps (e. g. Log.Info("Iterated {i+1} times");). Using variables or doing simple calculations inline makes the whole code sometimes more clearly than creating additional code lines
What I could imagine instead:
An external application which crawls a compiled exe for all strings, giving you the opportunity to ignore/add strings which should be translated. It could create a XML or Json file for all languages as well then. It would replace all strings with a hash/id so that a lookup for strings in all languages is still possible.
Am I the only one who is not happy with the commonly used Resx / centralized string db solution? Do I miss points why this wouldn't be a good idea?
One reason for relying on established approaches instead of implementing your own format is translation. It really depends on how your resources are translated: if it is done by volunteers with a technical background who don't mind working in a plain text editor, then you are free to come up with your own resource format. If on the other hand you send out your resources to professional translators who are not very technical and who prefer to work in a translation environment with integrated terminology management, translation memory, spelling and quality checks etc. it is quite likely that this environment will not be able to handle your homemade resource format.
Since I already mentioned professional translation environments: some of these tools rely on IDs to figure out which strings are old and which are new. If you use the approach that the text is the ID every fixed typo in your source language means that you create a new string that needs to be translated - and paid for. If the translator sees that the source text for a string has changed he can have a look at the change, notice that a typo has been fixed, decide that the translation is still OK and sign the string off, without extra translation cost.
By the way, if you want good localizations for strings like Log.Info("Iterated {i+1} times"); you have to find some way of dealing with plural forms correctly. Some languages have different grammatical rules for different numbers (see the Unicode Language Plural Rules for an overview). Just because something is easy to do in code does not mean that it is easy to localize, I'm afraid.
To sum this up: if you want to create your own resource format, talk with your translators. Ask them which formats they can handle. Think about translation related limitations that come with your format, for example if there are any characters that the translators should not use because they break your strings? Apostrophes and quotes are prime candidates here because they are often used as string delimiters in resource files, or < and & if you decide to go the XML way. Think about a conversion to XLIFF and back: most translation environments can handle XLIFF.
I've spent quite a bit of time trying to figure out the best way to handle this. I'm HTML encoding rich text from untrusted user input prior to storing it in the database.
I've bounce back and forth between multiple discussions, and it seems the safest method is to:
HTML encode absolutely everything, and only decode based on a white/safe list prior to sending it back to the client.
However, I'm also seeing strong suggestions for using http://htmlagilitypack.codeplex.com/
This compares user input against your safe/white list.
I've read:
C# HtmlDecode Specific tags only
https://eksith.wordpress.com/2011/06/14/whitelist-santize-htmlagilitypack/
And really, about 10 other posts and have become frustrated because now I can't figure out the best way to handle this.
I've tried using regular expressions to use regex replace methods:
For Each tag In AcceptableTags.Split(CChar("|")).ToList()
pattern = "<" + "\s*/?\s*" + tag + ".*?" + ">"
Regex = New Regex(pattern)
input = Regex.Replace(input, pattern)
Next
This doesn't seems to work well at all.
Is there someone out there who has a tried and true method with an example implementation they wouldn't mind sharing? I'll take c# or vb.net.
Depends on your data. Whitelist on the initial validation is fine if, for example, you're trying to avoid HTML in a phone number. On the other hand, if you can't be specific about what's in and what's out then just leave it "raw".
It's highly unlikely that storing encoded data in a database is the correct thing to do.
Any system of even marginal complexity will have non-HTML clients it will have to serve data to. When you do have an HTML client, you need to escape the output appropriate to HTML. Same for XML. Similarly, if you decide today you like JSON better, you'll encode to that. CSV? No problem - put quotes around your values (and escape any quotes) in case they have commas. Use parameters when doing SQL. Get the idea?
TL;DR;
Whitelist input if you can
Saving specifically encoded data is probably wrong
Always, always, always escape appropriate to your output
Never try and do your own escaping - always use a trusted library. You will never do a good enough job.
I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.
An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.
Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.
Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);
I am being completely hypothetical at this point, but since I am new to c#, I wanted to ask the opinion of others to see what the better ways of approaching this might be. At this point, I have a program that is looking for tags and comparing them to a master list of tags. However, at the moment, the tags are read and register a 24 character string. The strings are fine for the program, but I would like to have the output reference a database with a translator for each of these strings, so that when the final program outputs the tags that have been found and the ones that are missing, the tags have appropriate names along with them, and not just a complicated string of characters.
Since I am new, I would just like to see if anyone can give me ideas on how to handle this and possibly point me in the right direction to get started.
Thanks.
My asp.net web app is currently being developed and I want to handle any language input by the user. This input will then be displayed to other users on the site.
So far I have done the following:
Put this is the head - meta http-equiv="Content-Type" content="text/html; charset=utf-8"
Saved inputs in NVARCHAR fields
Do I need to do anything else? Do I need any other meta tags (content-language, etc)?
Also think of a way to localize your UI, either via resources or with an appropriate support in your database. If the users are expected to generate non-English content, they will definitely appreciate seeing UI in their native language.
You should remember not to make assumptions that are not valid in general.
A fairly common assumption that is wrong is that (str.ToUpper().ToLower() == str) for any string str. A more subtle assumption is that the concept of "upper" and "lower" case even makes sense for any given language.
Another frequent problematic assumption is that a single char in the input is always an actual character from user's perspective. This is wrong - even setting things such as surrogate pairs aside, there are also combining characters. You either have to normalize your strings (and even that isn't 100% foolproof), or just avoid dealing with individual chars.
If you want to deal with more than just plain text input displayed verbatim - i.e. full, proper localization - you'll also have to handle number, date, currency etc formats correctly; and, for example, do not assume that decimal separator is a dot.
My best general advice would be to just go and read Michael Kaplan's blog, Microsoft's local guru on localization and related issues. Look for categories (tags) such as "Collation/Casing", "Encoding/Codepages" and "Int'l Programming". There's a lot of stuff there, and most of it is either directly relevant to your question, or interesting, or both. If, after reading a couple of his blog posts, you start thinking that maybe hiring a localization expert just to point out potential non-obvious problems in that area is a good idea, then you're probably right :)
Whenever you use String.Format append client's culture spec. Using FxCop allows to explore these places.
Exclude string constants from .cs code
Place images (that can contain culture specific text) into skin files or resources.
The browsers determine the charset in the following order:
Content-Type http header (value example: "text/html; charset=utf-8")
XML declaration
meta attribute
You should check that the web server does not send conflicting content type information in headers.
Make sure you save the files in UTF-8.