This question already has an answer here:
What is the correct way to use JSON.NET to parse stream of JSON objects?
(1 answer)
Closed 4 years ago.
I've got to process files that are full of JSON objects. These have simply been concatenated together with no separator thus making the whole file invalid JSON. What is the best way to split this up again? I need to ensure that I don't end up splitting in encoded strings and it needs to be fairly fast as the file can be quite big.
Example file:
{"property":"Data which may include}{"}{"property":"A second object"}
I've done a lot of parsing like this. There's so much JSON code out there that it's rarely necessary with JSON. But if you really need to pass this code yourself in C#, I see no way to approach this other than by manually parsing it character by character.
Special attention needs to be given to curly braces and colons. And, when parsing tokens you'll need to determine if it's quoted. If it's quoted, then you go until the closing quote (ignore any escaped quotes). If it's not quoted, then you go until you hit a non-symbol character.
You might find this task a little easier using my Text Parsing Helper Class class to handle some of the lower-level string handling of your parser.
Related
This question already has answers here:
Parsing CSV files in C#, with header
(19 answers)
Closed 7 years ago.
i have a semicolon separates string, that contains values of every type. string and date values are in quotations.
Now i have an evil string, where an inner string contains s semicolon, that i need to remove (replace by nothing).
eg:
"Value1";0;"Value2";4711;"Evil; Value";"2015-09-03"
in C#:
string value = "\"Value1\";0;\"Value2\";4711;\"Evil; Value\";\"2015-09-03\""
So how to replace all semicolons, that are in quotations? can anybody help?
Regex is awful at handling delimited strings. It can do it, but it's not often as good of a choice as it first appears. This is one of several reasons why.
Instead, you should use a dedicated delimited string parser. There are (at least) three built into the .Net framework. The TextFieldParser type is one of those, and it will handle this correctly.
You should try this i.e to match only those semicolons which is not preceded by : :
(?<=[^"]);
Here is demo
This question already has answers here:
Dealing with forbidden characters in XML using C# .NET
(6 answers)
Closed 8 years ago.
We have an app that collects data electronically and by user input. The data is eventually turned into XML. We have had problems with invalid XML characters being in the inbound data when we turn it into XML either by serializing objects or using a .Net Transform. The process will thrown an exception like the below.
Exception: System.Xml.XmlException: '', hexadecimal value 0x10, is an invalid character. Line 5, position 74.
I don't know any other way to fix this other than scrubbing all the data either at input time or at the time the XML is created. The thought of running every string input or string property in an object through a cleaning function doesn't sound appealing. Is that the way this would need to be resolved.
Looking for confirmation or alternatives.
Thanks,
Kevin
There really isn't an elegant solution for this, but this response has some examples of whitelist cleansers.
This question already has answers here:
Determine a string's encoding in C#
(10 answers)
Closed 9 years ago.
I have a string read as a UTF8 (not from a file, can't check BOM).
The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.
is it possible to detect that this string is not actual UTF8?
Thanks!
No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Parsing CSV files in C#
I have a C# application that parses a pipe delimited file. It uses the Regex.Split method:
Regex.Split(line, #"(?<!(?<!\\)*\\)\|")
However recently a data file came across with a pipe included in one of the data fields. The data field in question used quoted identifers so when you open in Excel it opens correctly.
For example I have a file that looks like:
Field1|Field2|"Field 3 has a | inside the quotes"|Field4
When I use the above regex it parses to:
Field1
Field2
Field 3 has a
inside the quotes
Field4
when I would like
Field1
Field2
Field 3 has a | inside the quotes
Field4
I've done a fair amount of research and can't seem to get the Regex.Split to split the file on pipes but respect the quoted identifiers. Any help is greatly appreciated!
Here is a quick expression I've thrown together than seems to do the trick:
"([^"]+)"|([^\|]+)
Though your expression seems to be doing something with \'s as well, so you might need to add to this expression any other needs you have. I've ignored them in my answer because they were not explained in the question and therefore I cannot provide a solution without knowing why they are there - they may in fact not need to be there at all.
Also, my expression ignores empty fields though (i.e. 1||2|3 would come out as 1, 2 and 3 only) and I don't know whether this is what you need, if it isn't let me know and I can change the expression to something that would cater for that too.
Hope this helps anyway.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is the best way to parse html in C#?
I'm trying to write some code which uses a HttpWebRequest with GET method (or any suggested faster function), find a keyword on the page and then display what comes after it in various textviews.
The homepage it looks up will always be the same and will always find the same lines but with different data.
I've read about something called HtmlAgilityPack a lot but I cannot figure out if I can use it for this, nor how to.
Is there any faster functions to use to just get and find data within source?
Can I use HtmlAgilityPack, if so how (example please)?
Is there any easier way this can be done?
cheersnox
Yes you can use HtmlAgilityPack, if you want to extract text from tags
HtmlAgilityPack is an HTML parser that builds a read/write DOM from “real world” HTML files. It supports XPATH or XSLT and is tolerant with "real world" malformed HTML
In one line it use's XPath queries that real helps in extracting data quickly