I have a large json string that I need to remove any number of leading and trailing spaces from property values (in c#) e.g.
"Some Property Name": " Some Value "
needs to change to:-
"Some Property Name": "Some Value"
I have the option to do this via a regex replace on the json string before it is converted into a newtonsoft json object, or loop through the json object's properties after it has been converted.
Anybody any thoughts on the best way to do this?
Your second option is the safest one.
Any time that you have to modify a structured text of some kind (XML, HTML, JSON, C#, etc.) the safest option is to parse, modify, and re-format. Otherwise, you run the risk of changing things that you did not plan to change.
In your particular scenario a regex solution may unintentionally strip leading spaces from quoted strings inside a string, for example
"Some Property Name": " Say \" Hello, world!\" two times "
Corner cases like this often go unnoticed when developing a regex-based solution. On the other hand, parser-based solutions do not treat these situations as "corner cases," because all the complexity of understanding the format is shifted into the parser.
Related
I'm getting some JSON for an outside source that can't be changed and apparently they don't understand the rules about escaping characters correctly in JSON string values. So they have a string value that might have tabs in it, for example, that should have been escaped and other invalid escape sequences like \$. I'm trying to parse this with JSON.Net but it keeps falling over on these sequences.
For example, the source might look something like this:
{
"someRegularProp": 10,
"aNormalString": "foo bar etc",
"anInvalidString": "foo <tab \$100"
}
and it's parsed with
var obj = JObject.Parse(json);
So I can fix this specific case with something like:
json = json.Replace("\t", "").Replace("\\$", "$"); // note: in this case I'm fine with just stripping the tabs out
But is there a general way to fix these problems to remove invalid escape sequences before parsing? Because I don't know what other invalid sequences they might put in there?
I don't see general way. Obviously they are using bugged library or no library at all to generate this output and unless you explore more, all you can do is try as much output from them as possible to find all problems.
Perhaps make a script to generate as much output as possible and validate all of that, then you can be at least a bit more sure.
I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw
I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.
As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?
You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).
From server I get json. Json is very big. I show litle piece of this
{
"id": "9429531978965160",
"name": "Morning in "Paris"", // json.net cannot deserialize this line, because line have no escaped quotes.
"alias": "ThisAlias"
}
The problem is the server side that generates invalid JSON.
You could try writing a regex that fixes this (searches for any quotes in between the third and last). Just note that there might be many other issues with the JSON, like newlines that are not escaped etc.
It's not just that the output you are receiving is non-standard json, it's broken in such a way that it's not a well-defined language and doesn't parse unambiguously even in the simple cases. How should you parse {"a": "A", "b": "B"}? One way is as legal json. Another valid parse is a single property a with the value "A\", \"b\": \"B".
As others have said, the best resolution is to fix the server so that it no longer outputs invalid garbage. If that's not an option, you'll have to write your own parser. A normal parser would declare an syntax error at the 'P' in "Paris". Your parser could back up to the last quote token and try to treat it as if it were escaped. The next syntax error is at the second of the consecutive quotes, and again it could back up and treat the quote token as if it were escaped. If there are any other ways in which the input deviates from legal json you'll need to handle those as well.
If you're not familiar with parsers, this will take a while. And when you're done you'll have a parser that recognizes a poorly-specified and almost totally useless language, which is to say that it will largely be a waste of time. Do what you can to fix it on the server side.
HI there, I am looking for best practice or ideas for cleaning tags or at least grabbing the data from within custom tags in a text.
I am sure I can code some sort of "parser" that will go through every line manually, but isnt there some smartere way today?
Data thoughts:
{Phone:555-123456789}
here we have "phone" being the key and the number as the data. Looks a lot like JSON format but its easier to write for a human.
or
{link: article123456 ; title: Read about article 123456 here }
Could be normal (X)HTML too:
<a href="article123456.html" > Read about article 123456 here </a>
Humans aren't always nice to "trim" their input and neither are old websites made with lazy WYSIWYG editors, so I first need to figure out which pairs belongs together and then after finding the "data within" then trim the results.
Problem is in the "title" part above, that there are no " " surrounding the title-text, so it could either add them automatically or show the error to the human.
Any thoughts on how to grab these data the best way? There seems to be several ways that might work, but whats your best approach to this problem?
I would first write a "tokenizer" for the syntax of the data I was parsing. A tokenizer is a (relatively) simple process that breaks a string down into a series of fragments, or tokens. For example, in your first two cases your basic tokens would consist of: "{", "}", ":", ";", and everything else would be interpreted as a data token. This can be done with a loop, a recursive function, or a number of other ways. Tokenizing your second example would produce an array (or some other sort of list) with the following values:
"{", "link", ":", " article123456 ", ";", " title", ":", " Read about article 123456 here ", "}"
The next step would be to "sanitize" your data, though in these cases all that really means is removing unwanted whitespace. Iterate through the token array that was produced, and alter each token so that there is no beginning or ending whitespace. This step could be combined with tokenization, but I think it's much cleaner and clearer to do it separately. Your tokens would then look like this:
"{", "link", ":", "article123456", ";", "title", ":", "Read about article 123456 here", "}"
And finally, the actual "interpretation." You'll need to convert your token array into whatever sort of actual data structure that you intend to be the final product of the parsing process. For this you'll definitely want a recursive function. If the function is called on a data token, followed by a colon token, followed by a data token, it will interpret them at a key-value pair, and produce a data structure accordingly. If it is called on a series of tokens with semicolon tokens, it will split the tokens up at each semicolon and call itself on each of the resulting groups. And if it is called on tokens contained within curly-brace tokens, it will call itself on the contained tokens before doing anything else. Note that this is not necessarily the order in which you'll want to check for these various cases; in particular, if you intend to nest curly-braces (or any other sort of grouping tokens, such as square brackets, angle brackets, or parentheses), you'll next to make sure to interpret those tokens in the correct nested order.
The result of these processes will be a fully parsed data structure of whatever type you'd like. Keep in mind that this process assumes that your data is all implicitly stored as the string type; if you'd like "3" and 3 to be interpreted differently, then things get a bit more complicated. This method I've outlined is not at all the only way to do it, but it's how I'd approach the problem.
How would I accomplish displaying a line as the one below in a console window by writing it into a variable during design time then just calling Console.WriteLine(sDescription) to display it?
Options:
-t Description of -t argument.
-b Description of -b argument.
If I understand your question right, what you need is the # sign in front of your string. This will make the compiler take in your string literally (including newlines etc)
In your case I would write the following:
String sDescription =
#"Options:
-t Description of -t argument.";
So far for your question (I hope), but I would suggest to just use several WriteLines.
The performance loss is next to nothing and it just is more adaptable.
You could work with a format string so you would go for this:
string formatString = "{0:10} {1}";
Console.WriteLine("Options:");
Console.WriteLine(formatString, "-t", "Description of -t argument.");
Console.WriteLine(formatString, "-b", "Description of -b argument.");
the formatstring makes sure your lines are formatted nicely without putting spaces manually and makes sure that if you ever want to make the format different you just need to do it in one place.
Console.Write("Options:\n\tSomething\t\tElse");
produces
Options:
Something Else
\n for next line, \t for tab, for more professional layouts try the field-width setting with format specifiers.
http://msdn.microsoft.com/en-us/library/txafckwd.aspx
If this is a /? screen, I tend to throw the text into a .txt file that I embed via a resx file. Then I just edit the txt file. This then gets exposed as a string property on the generated resx class.
If needed, I embed standard string.Format symbols into my txt for replacement.
Personally I'd normally just write three Console.WriteLine calls. I know that gives extra fluff, but it lines the text up appropriately and it guarantees that it'll use the right line terminator for whatever platform I'm running on. An alternative would be to use a verbatim string literal, but that will "fix" the line terminator at compile-time.
I know C# is mostly used on windows machines, but please, please, please try to write your code as platform neutral. Not all platforms have the same end of line character. To properly retrieve the end of line character for the currently executing platform you should use:
System.Environment.NewLine
Maybe I'm just anal because I am a former java programmer who ran apps on many platforms, but you never know what the platform of the future is.
The "best" answer depends on where the information you're displaying comes from.
If you want to hard code it, using an "#" string is very effective, though you'll find that getting it to display right plays merry hell with your code formatting.
For a more substantial piece of text (more than a couple of lines), embedding a text resources is good.
But, if you need to construct the string on the fly, say by looping over the commandline parameters supported by your application, then you should investigate both StringBuilder and Format Strings.
StringBuilder has methods like AppendFormat() that accept format strings, making it easy to build up lines of format.
Format Strings make it easy to combine multiple items together. Note that Format strings may be used to format things to a specific width.
To quote the MSDN page linked above:
Format Item Syntax
Each format item takes the following
form and consists of the following
components:
{index[,alignment][:formatString]}
The matching braces ("{" and "}") are
required.
Index Component
The mandatory index component, also
called a parameter specifier, is a
number starting from 0 that identifies
a corresponding item in the list of
objects ...
Alignment Component
The optional alignment component is a
signed integer indicating the
preferred formatted field width. If
the value of alignment is less than
the length of the formatted string,
alignment is ignored and the length of
the formatted string is used as the
field width. The formatted data in
the field is right-aligned if
alignment is positive and left-aligned
if alignment is negative. If padding
is necessary, white space is used. The
comma is required if alignment is
specified.
Format String Component
The optional formatString component is
a format string that is appropriate
for the type of object being formatted
...