C# is valid UTF-8 [duplicate] - c#

This question already has answers here:
Determine a string's encoding in C#
(10 answers)
Closed 9 years ago.
I have a string read as a UTF8 (not from a file, can't check BOM).
The problem is that sometimes the original text was formed with another encoding, but was converted to UTF8 - so the string is not readable, sort of gibberish.
is it possible to detect that this string is not actual UTF8?
Thanks!

No. They're just bytes. You could try to guess, if you wanted, by trying different conversions and seeing whether there are valid dictionary words, etc., but in a theoretical sense it's impossible without knowing something about the data itself, i.e. knowing that it never uses certain characters, or always uses certain characters, or that it contains mostly words found in a given dictionary, etc. It might look like gibberish to a person, but the computer has no way of quantifying "gibberish".

Related

How to split concatenated JSON files using C# [duplicate]

This question already has an answer here:
What is the correct way to use JSON.NET to parse stream of JSON objects?
(1 answer)
Closed 4 years ago.
I've got to process files that are full of JSON objects. These have simply been concatenated together with no separator thus making the whole file invalid JSON. What is the best way to split this up again? I need to ensure that I don't end up splitting in encoded strings and it needs to be fairly fast as the file can be quite big.
Example file:
{"property":"Data which may include}{"}{"property":"A second object"}
I've done a lot of parsing like this. There's so much JSON code out there that it's rarely necessary with JSON. But if you really need to pass this code yourself in C#, I see no way to approach this other than by manually parsing it character by character.
Special attention needs to be given to curly braces and colons. And, when parsing tokens you'll need to determine if it's quoted. If it's quoted, then you go until the closing quote (ignore any escaped quotes). If it's not quoted, then you go until you hit a non-symbol character.
You might find this task a little easier using my Text Parsing Helper Class class to handle some of the lower-level string handling of your parser.

How to get every possible combination of 8 characters? [duplicate]

This question already has answers here:
How to get all the unique n-long combinations of a set of duplicatable elements?
(5 answers)
Closed 3 years ago.
I am trying to save every combination of AAAAAAAA - ZZZZZZZZ to a text file. So far after having many many errors, I have got almost nowhere. I could post my code if needed, but it doesn't work or get near the wanted outcome.
So I was wondering how to do this in c#. My method at the moment is beyond repair, I will have to start all again in order to fix this.
As the output I would like something along the lines of
AAAAAAAA, AAAAAAAB, AAAAAAAC ... ZZZZZZZX, ZZZZZZZY, ZZZZZZZZ
Thanks in advance for any help.
This is a basic combinatorics question:
You want to write a string of 8 characters.
Each character can be a letter between A-Z (26 options), therefore, there are 26^8 combinations: 26*26*26*...26.
That is 208827064576 combinations.
Each combination is 10 bytes (8 for string, then \r\n), which is a total of 1944.85 GB.
Are you sure you want to write it to a file?
This will take about 1.5-2 Terabytes. That's a huge text file to start with, probably impractical.
Secondly, the way to do this simply is to have 8 nested loops, each running through A to Z, then concatenate the string inside the inner loop, appending to the data store each time.

How to deal with strings that contain invalid XML characters [duplicate]

This question already has answers here:
Dealing with forbidden characters in XML using C# .NET
(6 answers)
Closed 8 years ago.
We have an app that collects data electronically and by user input. The data is eventually turned into XML. We have had problems with invalid XML characters being in the inbound data when we turn it into XML either by serializing objects or using a .Net Transform. The process will thrown an exception like the below.
Exception: System.Xml.XmlException: '', hexadecimal value 0x10, is an invalid character. Line 5, position 74.
I don't know any other way to fix this other than scrubbing all the data either at input time or at the time the XML is created. The thought of running every string input or string property in an object through a cleaning function doesn't sound appealing. Is that the way this would need to be resolved.
Looking for confirmation or alternatives.
Thanks,
Kevin
There really isn't an elegant solution for this, but this response has some examples of whitelist cleansers.

Should I store telephone numbers as strings or integers? [duplicate]

This question already has answers here:
What's the right way to represent phone numbers?
(9 answers)
Closed 9 years ago.
I'm trying to decide between storing a phone number as a string or an int. Any ideas?
For any situation like these, think of : will I have to calculate anything with that value? If that doesn't make any sense, you should use a string. In that case, there's no logical case where you'd use the telephone number as a number, so use a string.
I recommend using a string since that gives you more flexibility when it comes to formatting and non numeric characters like extension etc.
I would suggest using String - aside from anything else, otherwise you won't be able to store leading zeroes. You definitely shouldn't use int (too small) float or double (too much risk of data loss); long or BigInteger could be appropriate (aside from the leading zeroes problem), but frankly I'd go with String. That way you can also store whatever dashes or spaces the user has entered to make it easier to remember the number, if you want to.
Reference: What's the right way to represent phone numbers?
I highly recommend you use a string for this.
If you are going to validate phone number input then you can use the regex lib's matcher and pattern to make sure a phone number was entered in the correct format.

How to convert a char to its full Unicode name? [duplicate]

This question already has answers here:
Finding out Unicode character name in .Net
(7 answers)
Closed 9 years ago.
I need functions to convert between a character (e.g. 'α') and its full Unicode name (e.g. "GREEK SMALL LETTER ALPHA") in both directions.
The solution I came up with is to perform a lookup in the official Unicode Standard available online: http://www.unicode.org/Public/6.2.0/ucd/UnicodeData.txt, or, rather, in its cached local copy, possibly converted to a suitable collection beforehand to improve the lookup performance).
Is there a simpler way to do these conversions?
I would prefer a solution in C#, but solutions in other languages that can be adapted to C# / .NET are also welcome. Thanks!
if you do not want to keep unicode name table in memory just prepare text file where offset of unicode value multiplied by max unicode length name will point to unicode name. for max 4 bytes length it wont be mroe than few megabytes. If you wish to have more compact implementation then group offset address in file to unicode names at start of file indexed by unicode value then enjoy more compact name table. but you have to prepare such file though it is not difficult.

Categories