deserializing data best practices - c#

I have been given the task of deserializing some data. The data has all been munged into a string which is in the following format:
InternalNameA8ValueDisplay NameA¬InternalNameB8ValueDisplay NameB¬ etc etc.
(ie, it has an internal name, '8', the value, the display name, followed by '¬' **). for example, you'd have FirstName8JoeFirst Name¬
I have no control over how this data is serialized, its legacy stuff.
I've thought of doing a bunch of splits on the string, or breaking it up into a char array and splitting down the text that way. But this just seems horrible. This way there is too much that could go wrong (e.g, if the value of a phone number (for example), could begin with '8'.
What I want to know is what peoples' approaches to this would be? Is there anything more clever i can do to break the data down
note: '¬' isn't actually the character, it looks more like an arrow pointing left. but I'm away from my machine at the moment. Doh!
Thanks.

Instead of using splits, I would recommend using a simple state machine. Walk over each characters until you hit a delimiter, then you know you're on the next field. That takes care of issues like an "8" in a phone number.
NOTE - untested code ahead.
var fieldValues = new string[3];
var currentField = 0;
var line = "InternalNameA8ValueDisplay NameA¬InternalNameB8ValueDisplay NameB¬";
foreach (var c in line)
{
if (c == '8' && currentField == 0)
{
currentField++; continue;
}
if (c == '¬')
{
currentField++; continue;
}
fieldValues[currentField] += c;
}
Dealing with wonky formats - always a good time!
Good luck,
Erick

Related

C# - How do I read text between an unkown number of commas?

I hope you're all doing well. I'll try to keep myself short so let me give you an example right away.
Let's say that I have two lists. In the first list there are different words on different lines. In the second list there are synonyms for each and every word.
List 1:
hello
exhausted
happy
List 2:
hi, hey, howdy
beat, foreworn
glad
The console application outputs a random line number's string from the first list and then it let's me enter a synonym. After this, the console app checks if the synonym that I stated is within the same line number but in list 2.
I would like to count how many letters there are before, in between and after the commas.
Can anyone help? I'm far from a professional when it comes to C# so if anyone thinks I should rewrite the lists in any different way to make it easier I would be happy to do so.
Thanks!
Edit: The reason I want to do this is because if the console outputs "hello" for example, all it takes for the player is to enter "h" and it counts as correct. This is because there is an H in "hi" (see list 2). If I knew how many letters there are between the commas then the console would be able to give the player a wrong if the input's characters is more or less than the number of characters in list 2.
Split your string..
using System.Collections.Generic;
using System.Linq;
// ....
bool CheckSynonym(List<string> list1,List<string> list2, string provided, string synonymprovided)
{
if (list1.IndexOf(provided)<0) return false;
if (list1.IndexOf(provided)>=list2.Count) return false;
string s = list2[list1.IndexOf(provided)];
List<string> lstwords = s.Split( new char[] { ',' } ).ToList();
if (lstwords.IndexOf(" "+synonymprovided) >= 0) return true;
if (lstwords.IndexOf(synonymprovided) >= 0) return true;
return false;
}
This could work fine.. but I'd like to make a remark about the way you store your data. You stated "After this, the console app checks if the synonym that I stated is within the same line number but in list 2."
When you are going to expand your synonym dictionary, this is not handy, because you will have to check line numbers. I would use a single list, to include the words in the first list, like so
List:
hello, hi, hey, howdy
exhausted, beat, foreworn
happy, glad
Code can be straightforward, like
using System.Collections.Generic;
using System.Linq;
// ....
bool CheckSynonym(List<string> list, string provided, string synonymprovided)
{
foreach (string s in list)
{
List<string> lstwords = s.Split( new char[] { ',' } ).ToList();
if (lstwords.IndexOf(provided ) == 0)
if (lstwords.IndexOf(synonymprovided) > 0) return true;
else if (lstwords.IndexOf(" " + synonymprovided) >= 0) return true;
}
return false;
}
Note: this second solution allows for a single list.. but it will be slower !
One possible method:
string x1 = "hi, hey, howdy";
x1.Split(',').ToList().ForEach(x => Console.WriteLine(x.Trim().ToCharArray().Count()));
Prints
2,
3,
5

Fastest way to split a huge text into smaller chunks

I have used the below code to split the string, but it takes a lot of time.
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd();
int startPos = 0;
ArrayList alSegments = new ArrayList();
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
startPos = startPos + segmentSize;
}
}
Please suggest me an alternative way to split the string into smaller chunks of fixed size
First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.
One proposed (and untested) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength)
{
var characters = StringInfo.GetTextElementEnumerator(value);
while (characters.MoveNext())
yield return String.Concat(Take(characters, desiredLength));
}
private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
for (int i = 0; i < count; ++i)
{
yield return (string)enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).
About your code note that:
You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).
There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.
In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).
Below is my analysis of your question and code (read the comments)
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string?
int startPos = 0;
ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string>
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want?
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split"
startPos = startPos + segmentSize;
}
}
Making all kind of assumption, below is the code I would recommend for splitting long string. It is just a clean way of doing what you are doing in the sample. You can optimize this, but not sure how fast you are looking for.
static void Main(string[] args) {
string fileNamePath = "ConsoleApplication1.pdb";
var segmentSize = 32;
var op = ReadSplit(fileNamePath, segmentSize);
var joinedSTring = string.Join(Environment.NewLine, op);
}
static List<string> ReadSplit(string filePath, int segmentSize) {
var splitOutput = new List<string>();
using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) {
char []buffer = new char[segmentSize];
while (!file.EndOfStream) {
int n = file.ReadBlock(buffer, 0, segmentSize);
splitOutput.Add(new string(buffer, 0, n));
}
}
return splitOutput;
}
I haven't done any performance tests on my version, but my guess is that it is faster than your version.
Also, I am not sure how you plan to consume the output, but a good optimization when doing I/O is to use async calls. And a good optimization (at the cost of readability and complexity) when handling large string is to stick with char[]
Note that
You might have to deal with Character encoding issues while reading the file
If you already have the long string in memory and file reading was just include in the demo, then you should use the StringReader class instead of the StreamReader class

Parsing string and generating all possible combinations in multiple strings

In my SQL Server database I have strings stored representing the correct solution to a question. In this string a certain format can be used to represent multiple correct solutions. The format:
possible-text [posibility-1/posibility-2] possible-text
This states either posibility-1 or posibility-2 is correct. There is no limit on how many possibilities there are (e.g. [ pos-1 / pos-2 / pos-3 / ... ] is possible).
However, a possibility can be null, e.g.:
I am [un/]certain.
This means the answer could be "I am certain" or "I am uncertain".
The format can also be nested in a sentence, e.g.:
I am [[un/]certain/[un/]sure].
The format can also occur multiple times in one sentence, e.g.:
[I am/I'm] [[un/]certain/[/un]sure].
What I want is to generate all the possible combinations. E.g. the above expression should return:
I am uncertain.
I am certain.
I am sure.
I am unsure.
I'm uncertain.
I'm certain.
I'm sure.
I'm unsure.
There is no limit on the nesting, nor the amount of possibilities. If there is only one possible solution then it will have not be in the above format. I'm not sure on how to do this.
I have to write this in C#. I think a possible solution could be to write a regex expression that can capture the [ / ] format and return me the possible solutions in a list (for every []-pair) and then generating the possible solutions by going over them in a stack-style way (some sort of recursion and backtracking style), but I'm not to a working solution yet.
I'm at a loss on to how exactly start on this. If somebody could give me some pointers on how to tackle this problem I'd appreciate it. When I find something I'll add it here.
Note: I noticed there are a lot of similar questions, however the solutions all seem to be specific to the particular problem and I think not applicable to my problem. If however I'm wrong, and you remember a previously answered question that can solve this, could you then tell me? Thanks in advance.
Update: Just to clarify if it was unclear. Every line in code is possible input. So this whole line is input:
[I am/I'm] [[un/]certain/[/un]sure].
This should work. I didn't bother optimizing it or doing error checking (in case the input string is malformed).
class Program
{
static IEnumerable<string> Parts(string input, out int i)
{
var list = new List<string>();
int level = 1, start = 1;
i = 1;
for (; i < input.Length && level > 0; i++)
{
if (input[i] == '[')
level++;
else if (input[i] == ']')
level--;
if (input[i] == '/' && level == 1 || input[i] == ']' && level == 0)
{
if (start == i)
list.Add(string.Empty);
else
list.Add(input.Substring(start, i - start));
start = i + 1;
}
}
return list;
}
static IEnumerable<string> Combinations(string input, string current = "")
{
if (input == string.Empty)
{
if (current.Contains('['))
return Combinations(current, string.Empty);
return new List<string> { current };
}
else if (input[0] == '[')
{
int end;
var parts = Parts(input, out end);
return parts.SelectMany(x => Combinations(input.Substring(end, input.Length - end), current + x)).ToList();
}
else
return Combinations(input.Substring(1, input.Length - 1), current + input[0]);
}
static void Main(string[] args)
{
string s = "[I am/I'm] [[un/]certain/[/un]sure].";
var list = Combinations(s);
}
}
You should create a parser that read character by character and builds up a logical tree of the sentence. When you have the tree it is easy to generate all possible combinations. There are several lexical parsers available that you could use, for example ANTLR: http://programming-pages.com/2012/06/28/antlr-with-c-a-simple-grammar/

Constantly Incrementing String

So, what I'm trying to do this something like this: (example)
a,b,c,d.. etc. aa,ab,ac.. etc. ba,bb,bc, etc.
So, this can essentially be explained as generally increasing and just printing all possible variations, starting at a. So far, I've been able to do it with one letter, starting out like this:
for (int i = 97; i <= 122; i++)
{
item = (char)i
}
But, I'm unable to eventually add the second letter, third letter, and so forth. Is anyone able to provide input? Thanks.
Since there hasn't been a solution so far that would literally "increment a string", here is one that does:
static string Increment(string s) {
if (s.All(c => c == 'z')) {
return new string('a', s.Length + 1);
}
var res = s.ToCharArray();
var pos = res.Length - 1;
do {
if (res[pos] != 'z') {
res[pos]++;
break;
}
res[pos--] = 'a';
} while (true);
return new string(res);
}
The idea is simple: pretend that letters are your digits, and do an increment the way they teach in an elementary school. Start from the rightmost "digit", and increment it. If you hit a nine (which is 'z' in our system), move on to the prior digit; otherwise, you are done incrementing.
The obvious special case is when the "number" is composed entirely of nines. This is when your "counter" needs to roll to the next size up, and add a "digit". This special condition is checked at the beginning of the method: if the string is composed of N letters 'z', a string of N+1 letter 'a's is returned.
Here is a link to a quick demonstration of this code on ideone.
Each iteration of Your for loop is completely
overwriting what is in "item" - the for loop is just assigning one character "i" at a time
If item is a String, Use something like this:
item = "";
for (int i = 97; i <= 122; i++)
{
item += (char)i;
}
something to the affect of
public string IncrementString(string value)
{
if (string.IsNullOrEmpty(value)) return "a";
var chars = value.ToArray();
var last = chars.Last();
if(char.ToByte() == 122)
return value + "a";
return value.SubString(0, value.Length) + (char)(char.ToByte()+1);
}
you'll probably need to convert the char to a byte. That can be encapsulated in an extension method like static int ToByte(this char);
StringBuilder is a better choice when building large amounts of strings. so you may want to consider using that instead of string concatenation.
Another way to look at this is that you want to count in base 26. The computer is very good at counting and since it always has to convert from base 2 (binary), which is the way it stores values, to base 10 (decimal--the number system you and I generally think in), converting to different number bases is also very easy.
There's a general base converter here https://stackoverflow.com/a/3265796/351385 which converts an array of bytes to an arbitrary base. Once you have a good understanding of number bases and can understand that code, it's a simple matter to create a base 26 counter that counts in binary, but converts to base 26 for display.

String Builder vs Lists

I am reading in multiple files in with millions of lines and I am creating a list of all line numbers that have a specific issue. For example if a specific field is left blank or contains an invalid value.
So my question is what would be the most efficient date type to keep track of a list of numbers that could be upwards of a million number of rows. Would using String Builder, Lists, or something else be more efficient?
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51, etc. So in the case of a String Builder, I would check the previous value and if it is is only 1 more I would change it from 1 to 1-2 and if it was more than one would separate it by a comma. With the List, I would just add each number to the list and then combine then once the file has been completely read. However in this case I could have multiple list containing millions of numbers.
Here is the current code I am using to combine a list of numbers using String Builder:
string currentLine = sbCurrentLineNumbers.ToString();
string currentLineSub;
StringBuilder subCurrentLine = new StringBuilder();
StringBuilder subCurrentLineSub = new StringBuilder();
int indexLastSpace = currentLine.LastIndexOf(' ');
int indexLastDash = currentLine.LastIndexOf('-');
int currentStringInt = 0;
if (sbCurrentLineNumbers.Length == 0)
{
sbCurrentLineNumbers.Append(lineCount);
}
else if (indexLastSpace == -1 && indexLastDash == -1)
{
currentStringInt = Convert.ToInt32(currentLine);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace > indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastSpace);
currentStringInt = Convert.ToInt32(currentLineSub);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace < indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastDash + 1);
currentStringInt = Convert.ToInt32(currentLineSub);
string charOld = currentLineSub;
string charNew = lineCount.ToString();
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Replace(charOld, charNew);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51
If that's the end goal, no point in going through an intermediary representation such as a List<int> - just go with a StringBuilder. You will save on memory and CPU that way.
StringBuilder serves your purpose so stick with that, if you ever need the line numbers you can easily change the code then.
Depends on how you can / want to break the code up.
Given you are reading it in line order, not sure you need a list at all.
Your current desired output implies that you can't output anything until the file is completely scanned. The size of the file suggests a one pass`analysis phase would be a good idea as well, given you are going to use buffered input as opposed to reading the entire thing into memory.
I'd be tempted with an enum to describe the issue e.g Field??? is blank and then use that as the key a dictionary of string builders.
As a first thought anyway
Is your output supposed to be human readable? If so, you'll hit the limit of what is reasonable to read, long before you have any performance/memory issues from your data structure. Use whatever is easiest for you to work with.
If the output is supposed to be machine readable, then that output might suggest an appropriate data structure.
As others have pointed out, I would probably use StringBuilder. The List may have to resize many times; the new implementation of StringBuilder does not have to resize.

Categories