I'm about to build a solution to where I receive a comma separated list every night. It's a list with around 14000 rows, and I need to go through the list and select some of the values in the list.
The document I receive is built up with around 50 semicolon separated values for every "case". How the document is structured:
"";"2010-10-17";"";"";"";Period-Last24h";"Problem is that the customer cant find....";
and so on, with 43 more semicolon statements. And every "case" ends with the value "Total 515";
What I need to do is go through all these "cases" and withdraw some of the values in the "cases". The "cases" is always built up in the same order and I know that it's always the 3, 15 and 45'th semicolon value that I need to withdraw.
How can I do this in the easiest way?
I think you should decompose this problem into smaller problems. Here are the steps I'd take:
Each semi-colon separated record represents a single object. C# is an object-oriented language. Stop thinking in terms of .csv records and start thinking in terms of objects. Break up the input into semi-colon delimited records.
Given a single comma-separated record, the values represent the properties of your object. Give them meaningful names.
Parse a comma-separated record into an object. When you're done, you'll have a collection of objects that you can deal with.
Use C#'s collections and LINQ to filter your list based on those cases that you need to withdraw. When you're done, you'll have a collection of objects with the desired cases removed.
Don't worry about the "easiest" way. You need one way that works. Whatever you do, get something working and worry about optimizing it to make it easiest, fastest, smallest, etc. later on.
Assuming the "rows" are lines and that you read line by line, your main tool should be string.Split:
foreach (string line in ... )
{
string [] parts = line.split (';');
string part3 = parts[2];
string part15 = parts[14];
// etc
}
Note that this is a simple approach that will fail if the content of any column can contain ';'
You could use String.Split twice.
The first time using "Total 515"; as the split string using this overload. This will give you an array of cases.
The second time using ";" as the split character using this overload on each of the cases. This will give you a data array for each case. As the data is consistent you can extract the 3rd, 15th and 45th elements of this array.
I'd search for an existing csv library. The escaping rules are probably not that easily mapped to regex.
If writing a library myself I'd first parse each line into a list/an array of strings. And then in a second step(probably outside of the csv library itself) convert the stringlist to a strongly typed object.
A simple but slow approach would be reading single characters from the input (StringReader class, for example). Write a ReadItem method that reads a quote, continues to read until the next quote, and then looks for the next character. If it is a newline of semicolon, one item has been read. If it is another quote, add a single quote to the item being read. Otherwise, throw an exception. Then use this method to split up the input data into a series of items, each line stored e.g. in a string[number of items in a row], lines stored in a List<>. Then you can use this class to read the CSV data inside another class that decodes the data read into objects that you can get your data out of.
Related
I have a string as shown below
string names = "<?startname; Max?><?startname; Alex?><?startname; Rudy?>";
is there any way I can split this string and add Max , Alex and Rudy into a separate list ?
Sure, split on two strings (all that consistently comes before, and all that consistently comes after) and specify that you want Split to remove the empties:
var r = names.Split(new[]{ "<?startname; ", "?>" }, StringSplitOptions.RemoveEmptyEntries);
If you take out the RemoveEmptyEntries it will give you a more clear idea of how the splitting is working, but in essence without it you'd get your names interspersed with array entries that are empty strings because split found a delimiter (the <?...) immediately following another (the ?>) with an empty string between the delimiters
You can read the volumes of info about this form of split here - that's a direct link to netcore3.1, you can change your version in the table of contents - this variant of Split has been available since framework2.0
You did also say "add to a separate list" - didn't see any code for that so I guess you will either be happy to proceed with r here being "a separate list" (an array actually, but probably adequately equivalent and easy to convert with LINQ's ToList() if not) or if you have another list of names (that really is a List<string>) then you can thatList.AddRange(r) it
Another Idea is to use Regex
The following regex should work :
(?<=; )(.*?)(?=\s*\?>)
So what I'm trying to do is that I want my bot to be able to have two different parameters. Like what I mean is something like I can extract like a certain part of it and then after there's a "," or another symbol I can extract the following separately. So I get two different strings from one input. So like I have two strings and I want one of them to be the first half and the second one to be the rest. And I am not planning on updating to 1.0 so tell me if it's not possible in 0.9.6.
Your question isn't very clear but I think I know what you are looking for. This is a general answer for C# as I don't know how the Discord interface differs. You seem to be taking input in the form of a string, for example: play *songname*,*channelname*. To split this string into two inputs you want to use String.Split(',')
An example would be this:
string stringTakenFromDiscord = "play *songname*,*channelname*";
String[] input = stringTakenFromDiscord.Split(',');
//input[0] will be equal to what comes before the comma
//if you were to print it, it would be "play *songname*"
//input[1] will be what comes after the comma
//if you were to print it, it would be "*channelname*"
Now you can do anything you want with either of the values of the array input[] and feed them through your code to parse them. Do note that when it splits by the character, the character won't appear in either of the output strings. This will only work for inputs that only have one instance of your chosen character. You can change the character to whatever you want.
It occurs to me that it might be easier to just take the input on two separate lines instead.
Just looking to see what the best way to approach the following situation would be.
I am trying to make a small job that reads in a txt file which has a thousand or so lines;
Each line is about 40 characters long (mostly numbers, some letter identifiers).
I have used
DataTable txtCache = new DataTable();
txtCache.Columns.Add(new DataColumn("Column1"));
string[] lines = System.IO.File.ReadAllLines(FILEcheck.Properties.Settings.Default.filePath);
foreach (string line in lines)
{
txtCache.Rows.Add(line);
}
However, what I really want to do is a bit confusing and hard to explain so i'll do my best. An example of line is below:
5498494000584454684840}eD44448774V6468465 Z
In the beginning of that long string is a "84", and then a "58" a little bit later. I need to do a comparison on these two numbers. They could be anything, but only a few combinations are acceptable in the file. They will always be in the same spot and same amount of characters (so it will always be 2 numbers and always in the 4-5 location). So I want to have 3 columns. I want the full string in 1 column, and then the 2 individual smaller numbers in columns of themselves. I can then compare them later on, and if there is an issue, I can return the full string which caused the issue.
Is this possible? I am just not sure how to parse out a substring based on character location and then loading it into a datatable.
Any advice would be appreciated. Thank you,
You could create the columns for each of items you are looking to store (whole string, first number, second number), and then add a row for each of the lines in the input file. You could just use the substring method to parse out the two digit numbers and store them. To do your analysis, you could parse the numbers out from the strings, or whatever else you need to do.
lines[0].Substring(3,2) will give you "84" in your above example. If you want the int, you could use Int32.Parse(lines[0].Substring(3,2))
Substring reference: http://msdn.microsoft.com/en-us/library/aka44szs%28v=vs.110%29.aspx
Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.
I have 5 strings, let's call them
EarthString
FireString
WindString
WaterString
HeartString
All of them can have varying length, any of them can be empty, or can be very long (but never null).
These 5 strings are very good friends, and every weekend they are concatenated to form a result string using this c# statement
ResultString = EarthString + FireString + WindString + WaterString + HeartString
Depending on the values of these strings, sometimes (only sometimes), ResultString will contain "Captain Planet" as a substring.
My question is, how do I manipulate each of the 5 strings before they are concatenated, so that when they are combined, "Captain Planet" will never appear as a substring in the resultant string?
The only way I can think of right now is to examine each character in each string, in sequential order, but that seems very tedious. Since each of the 5 good friends strings can be of any length, examining the characters individually will also require some kind of concatenation before we can determine whether any character need to be dropped.
Edit: The resultant string is a filtered version of the 5 strings concatenated together, all the other content remain the same except the "Captain Planet" string is dropped. Yes, i'm looking for a solution which allows the 5 strings to be manipulated before concatenation. (this is actually a simplification of a bigger programming problem i'm encountering). Thanks guys.
If you want to do it pre-concat you could
Assign the start and end of each string a numeric value based on the portion of "CaptainPlanet" they contein. Ex: if Air = "net the big captain" then it would get 3 for a start value and 7 for an end value. to determine if you could concat 2 values safely you would just check to see if the end of the left string + start of the right string were not equal to the total length of "CaptainPlanet". If you had very large strings this would allow you to inspect just the first x and last x characters of the string to compute the start/end value.
This solution doesn't account for short strings like ei air = "Cap" , earth ="tain" and fire="Planet". In that case you would need to have a special case for tokens that are shorter than the length of "CaptainPlanet" For those.
Is there a particular reason you can't just do this?
ResultString.Replace("CaptainPlanet", "x");
If it doesn't matter how many chars will be dropped, you can remove f.e. all 'C' in all strings.
The original answer cleared all of the strings, but as pointed out by J.Steen, there was already a formulation of the expected output. So there we go.
Run elementString.Replace("Captain Planet", "") on every substring.
Now you have to identify all the prefixes / suffixes of "Captain Planet" on each of the substrings, and keep that information so that it can be processed before contatenation. That is, e.g. if the substring ends with "Capt", then you should have an information that "substring contains at the end a prefix of the 4 first letters of 'Captain Planet'". You also have to consider the cases of complete substrings (e.g. one of the strings is "ptain Pla"). The problem also becomes more complex if any of the e.g. prefixes can be recursive or repeated (e.g. "CaptainCap" contains 2 kinds of valid prefixes for "CaptainCaptain", and "apt" can be found at two locations in the resulting string);
You process that information before concatenation so that the result string has the same thing as ResultString.Replace("Captain Planet", ""). Congratulations, you have made your program much more complex than necessary!
But in short, you cannot get both the result that you want (all of the substrings intact except for the combined result output) and do the processing wholly before the concatenation step.