Decoding data read from a .bin file into fields - c#

I was originally going to read a file through an array, pin to struct, convert and display. I have been trying to find another solution (I have removed the original details here to cause less confusion). I have a .bin File and I can correctly identify how many records are in the file by using a simple sum and FileInfo.
I've looked at:
http://www.sixscape.com/joomla/sixscape/images/pdf/C%20Sharp%20Training%20-%20Part%204%20-%20Contact%20Manager%20with%20Random%20Access%20File.pdf
http://www.vbi.org/Items/article.asp?id=16
Files imported have the same structure and look similar to the following: _(note screenshots show the amount of rows in the file and I have then calculated the rows for DataGridView table using a sum from here I produced Adding rows to second column ONLY - populating data using a for loop:
long Count = 1;
FileInfo Fi = new FileInfo(import.FileName);
long sum = (Fi.Length / 1024) - Count;
for (int i = 0; i < sum; i++)
{
DataGridView1.Rows.Add(null, Count++);
ReadWeldRecs(sum); // Added after question was published & called in ReadWeldRecs
}
First shows a total of 21 rows and the second being 9:
I have a method called DecodeDate and another called ReadWeldRecs, DecodeDate is fed through ReadWeldRecs which is then activated via a button click event. I know what date should be displayed, but when it comes to viewing the result from DecodeDate, it is wrong. I want to be able to read the date inside the file. import.FileName is the filename (#Kevin) that that been opened using OpenFileDialog and the date is displayed at position 5 in the file.
My first go:
The Date is displayed as: 22/08/2123
But should be: 21/10/2008
I've thought, may be it's an issue with the location? But I'm sure it's position 5.
Update: Turns out I was looking in the wrong location... Duh.
private DateTime DecodeDate(int Start)
{
int Year = Strings.Asc(Strings.Mid(import.FileName, Start, 1)) + 2000;
int Month = Strings.Asc(Strings.Mid(import.FileName, Start + 1, 1));
int Day = Strings.Asc(Strings.Mid(import.FileName, Start + 2, 1));
return DateAndTime.DateSerial(Year, Month, Day);
}
Original:
This is the original VB code which worked fine in the out-dated program: (I looked at this mostly to reconstruct the DecodeDate method in C#...)
Public Function DecodeDate(Start As Integer) As Date
YYear = Asc(Mid$(ImportRecord, Start, 1)) + 2000
MMonth = Asc(Mid$(ImportRecord, Start + 1, 1))
DDay = Asc(Mid$(ImportRecord, Start + 2, 1))
DecodeDate = DateSerial(YYear, MMonth, DDay)
End Function
ImportRecord is defined as the following: (global string)
Open ImportFileName For Random As #1 Len = Len(ImportRecord)
// ...
Get #1, Index + 1, ImportRecord
// ...
.Date = DecodeDate(5)
Current:
private void ReadWeldRecs(long RecordNumber)
{
byte[] Rec = new byte[1024];
using (FileStream Fs = new FileStream(import.FileName, FileMode.Open, FileAccess.Read))
using (BinaryReader Br = new BinaryReader(Fs))
{
int Rec_Len;
RecordNumber = 0; // Start with Record 0
while (true)
{
Fs.Seek(RecordNumber * 1024, SeekOrigin.Begin); // Position file to record
Rec_Len = Br.Read(Rec, 0, 1024); // Read the record here
if (Rec_Len == 0) // If reached end of file, end loop
{
break;
}
if (Rec[0] != 0) // If there is a record, let's display it
{
Label_Date1.Text = DecodeDate(Rec, 28).ToShortDateString();
}
RecordNumber++; // Read first record ++
}
Br.Close();
Fs.Close();
}
}
Plus #Kevin's updated solution :)
However, also this has resolved a major issue I still another where I am trying to go by the guidelines and template of #Kevin's solution for my other method DecodeString.
In VB:
Public Function DecodeString(Start As Integer, Length As Integer) As String
Dim Count As Integer
Dummy = Mid(ImportRecord, Start, Length)
For Count = 1 To Len(Dummy)
If (Mid$(Dummy, Count, 1) = Chr$(0)) Or (Mid$(Dummy, Count, 1) = Chr$(255)) Then
Mid$(Dummy, Count, 1) = Chr$(32)
End If
Next
DecodeString = Trim(Dummy)
End Function
Again, note I'm looking at using the solution as a template for this

A couple of things...
Is this C# or VB you are looking for since it's tagged as C# but Strings and DateAndTime are VB.
Are you just parsing the date out of a string that is the filename?
Is the date value really represented as the ASCII value of the characters? Really? That seems extremely odd... This means that to get year = 8, month = 10, day = 20 your filename would be something like abcde[BackspaceCharacter][LinefeedCharacter][Device control 4]. I definitely don't think you have control characters in there unless of course it was binary data stuffed into a string in the first place.
I'm going to assume the answers to these questions are... C#, Yes, and No. It would be easy to convert from C# -> VB if that's an incorrect assumption. I don't think I've used anything specific to C# here. Note I've spread this out onto several lines for readability on the forum.
private DateTime DecodeDate(int Start)
{
// If your filename is 'ABC-01234 2008-10-21#012345.bin' then Start should = 10
DateTime date;
var datePart = filename.Substring(Start, 10);
var culture = CultureInfo.InvariantCulture;
var style = DateTimeStyles.None;
// note since you are telling it the specific format of your date string
// the culture info is not relevant. Also note that "yyyy-MM-dd" below can be
// changed to fit your parsing needs yyyy = 4 digit year, MM = 2 digit month
// and dd = 2 digit day.
if (!DateTime.TryParseExact(datePart, "yyyy-MM-dd", culture, style, out date))
{
// TryParseExact returns false if it couldn't parse the date.
// Since it failed to properly parse the date... do something
// here like throw an exception.
throw new Exception("Unable to parse the date!");
}
return date;
}
After looking up DateAndTime there is another way for you to get the numbers you are seeing without there being control characters in the filename... but in that case I still don't think you want to use DateAndTime.DateSerial since it only obfuscates the error. (if you accidentally give it 25 for the month the date it returns has january for the month and adds 2 years to the year portion)
If this doesn't solve it... give the format of your filename and it will be easier to figure out what exactly the problem is.
EDIT: Based on your updates... looks like you are trying to parse the binary data...
An easier way to do this with the .NET framework methods on the BinaryReader like BinaryReader.ReadByte, .ReadDouble, etc..
For your code though you'll need to pass in the byte array so you can pull the values out of it. Style wise you want to limit or eliminate usage of global variables...
private DateTime DecodeDate(byte[] bytes, int start)
{
// Note that Length is 1 based and the indexer is 0 based so
if (bytes.Length < start + 3)
throw new Exception("Byte array wasn't long enough");
int year = Convert.ToInt32(bytes[start]) + 2000;
int month = Convert.ToInt32(bytes[start+1]);
int day = Convert.ToInt32(bytes[start+2]);
return new DateTime(year, month, day);
}
And then your call to it looks like this...
Label_Date1.Text = DecodeDate(Rec, 5).ToShortDateString();
I wanted to mention a couple of style pointers... These are as much personal preference as anything but I find them helpful.
Don't use globals or severely limit usage of them - import is used in your method but is defined somewhere else - this makes it harder to see what your code is doing and for you to trouble shoot it.
Name local variables starting in lowerCase in order to make it easier to see what is a method and what is a variable.
Be VERY careful with While(true) - its better if you can re-write it to be clearer when you exit the loop.
ONE MORE THING! You said your date was at position 5. In VB arrays generally start at 1 so the 5th element is array index 5... in C# arrays start with 0 so the 5th element would be index 4. I can't know which you should be checking from your code since I don't know the structure of the data so just keep this in mind.
EDIT 2:
For DecodeString it's both simple and hard... the code is very simple... however... strings are MUCH more complex than you would think. To convert from string to binary or vice versa you have to choose an encoding. Encodings are an algorithm for converting characters into binary data. Their names will be familiar to you like ASCII, UTF8, Unicode, etc. Part of the problem is that some programming languages obfuscate encodings from you and much of the time programmers can be blissfully ignorant of them. In your case it seems like your binary file is written in ASCII but it's impossible for me to know that... so my code below works if it's in ASCII... if you run into cases where your string isn't decoded properly you may have to change that.
private static string DecodeString(byte[] bytes, int start, int length)
{
return Encoding.ASCII.GetString(bytes, start, length);
}
Bonus reading - Joel Spolsky's post on Encodings

Related

Unity formatting multiple numbers

So I'm a complete newb to unity and c# and I'm trying to make my first mobile incremental game. I know how to format a variable from (e.g.) 1000 >>> 1k however I have several variables that can go up to decillion+ so I imagine having to check every variable's value seperately up to decillion+ will be quite inefficient. Being a newb I'm not sure how to go about it, maybe a for loop or something?
EDIT: I'm checking if x is greater than a certain value. For example if it's greater than 1,000, display 1k. If it's greater than 1,000,000, display 1m...etc etc
This is my current code for checking if x is greater than 1000 however I don't think copy pasting this against other values would be very efficient;
if (totalCash > 1000)
{
totalCashk = totalCash / 1000;
totalCashTxt.text = "$" + totalCashk.ToString("F1") + "k";
}
So, I agree that copying code is not efficient. That's why people invented functions!
How about simply wrapping your formatting into function, eg. named prettyCurrency?
So you can simply write:
totalCashTxt.text = prettyCurrency(totalCashk);
Also, instead of writing ton of ifs you can handle this case with logarithm with base of 10 to determine number of digits. Example in pure C# below:
using System.IO;
using System;
class Program
{
// Very simple example, gonna throw exception for numbers bigger than 10^12
static readonly string[] suffixes = {"", "k", "M", "G"};
static string prettyCurrency(long cash, string prefix="$")
{
int k;
if(cash == 0)
k = 0; // log10 of 0 is not valid
else
k = (int)(Math.Log10(cash) / 3); // get number of digits and divide by 3
var dividor = Math.Pow(10,k*3); // actual number we print
var text = prefix + (cash/dividor).ToString("F1") + suffixes[k];
return text;
}
static void Main()
{
Console.WriteLine(prettyCurrency(0));
Console.WriteLine(prettyCurrency(333));
Console.WriteLine(prettyCurrency(3145));
Console.WriteLine(prettyCurrency(314512455));
Console.WriteLine(prettyCurrency(31451242545));
}
}
OUTPUT:
$0.0
$333.0
$3.1k
$314.5M
$31.5G
Also, you might think about introducing a new type, which implements this function as its ToString() overload.
EDIT:
I forgot about 0 in input, now it is fixed. And indeed, as #Draco18s said in his comment nor int nor long will handle really big numbers, so you can either use external library like BigInteger or switch to double which will lose his precision when numbers becomes bigger and bigger. (e.g. 1000000000000000.0 + 1 might be equal to 1000000000000000.0). If you choose the latter you should change my function to handle numbers in range (0.0,1.0), for which log10 is negative.

Using c# to read from a text file

Am reading from a text file using the code below.
if (!allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":71A");
var fifthIndex = allLines.IndexOf(":72");
var sixthIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 4, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(fifthIndex + 4, sixthIndex - fifthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
else if (allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":70");
var fifthIndex = allLines.IndexOf(":71");
var sixthIndex = allLines.IndexOf(":72");
var seventhIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 5, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(sixthIndex + 4, seventhIndex - sixthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
Below is the format of the text file am reading.
{1:F21DBLNNGLAAXXX4695300820}{4:{177:1405260906}{451:0}}{1:F01DBLNNGLAAXXX4695300820}{2:O1030859140526SBICNGLXAXXX74790400761405260900N}{3:{103:NGR}{108:AB8144573}{115:3323774}}{4:
:20:SBICNG958839-2
:23B:CRED
:23E:SDVA
:32A:140526NGN168000000,
:50K:IHS PLC
:53A:/3000025296
SBICNGLXXXX
:57A:/3000024426
DBLNNGLA
:59:/0040186345
SONORA CAPITAL AND INVSTMENT LTD
:71A:OUR
:72:/CODTYPTR/001
-}{5:{MAC:00000000}{PAC:00000000}{CHK:42D0D867739F}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
The above file format represent one transaction in a single text file, but while testing with live files, I came accross a situation where a file can have more than one transaction. Example is the code below.
{1:F21DBLNNGLAAXXX4694300150}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300150}{2:O1031656140523FCMBNGLAAXXX17087957771405231916N}{3:{103:NGR}{115:3322817}}{4:
:20:TRONGN3RDB16
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN1634150,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0022617678
GOLDEN DC INT'L LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:C21000C4ECBA}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}${1:F21DBLNNGLAAXXX4694300151}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300151}{2:O1031656140523FCMBNGLAAXXX17087957781405231916N}{3:{103:NGR}{115:3322818}}{4:
:20:TRONGN3RDB17
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN450000,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0032501697
SUNSTEEL INDUSTRIES LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:01C3B7B3CA53}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
My challenge is that in my code, while reading allLines, each line is identified by certain index, a situation where I need to pick up the second transaction from the file, and the same index exist like as before, how can I manage this situation.
This is a simple problem obscured by excess code. All you are doing is extracting 3 values from a chunk of text where the precise layout can vary from one chunk to another.
There are 3 things I think you need to do.
Refactor the code. Instead of two hefty if blocks inline, you need functions that extract the required text.
Use regular expressions. A single regular expression can extract the values you need in one line instead of several.
Separate the code from the data. The logic of these two blocks is identical, only the data changes. So write one function and pass in the regular expression(s) needed to extract the data items you need.
Unfortunately this calls for a significant lift in the abstraction level of the code, which may be beyond what you're ready for. However, if you can do this and (say) you have function Extract() with regular expressions as arguments, you can apply that function once, twice or as often as needed to handle variations in your basic transaction.
You may perhaps use the code below to achieve multiple record manipulation using your existing code
//assuming fileText is all the text read from the text file
string[] fileData = fileText.Split('$');
foreach(string allLines in fileData)
{
//your code goes here
}
Maybe indexing works, but given the particular structure of the format, I highly doubt that is a good solution. But if it works for you, then that's great. You can simply split on $ and then pass each substring into a method. This assures that the index for each substring starts at the beginning of the entry.
However, if you run into a situation where indices are no longer static, then before you even start to write a parser for any format, you need to first understand the format. If you don't have any documentation and are basically reverse engineering it, that's what you need to do. Maybe someone else has specifications. Maybe the source of this data has it somewhere. But I will proceed under the assumption that none of this information is available and you have been given a task with absolutely no support and are expected to reverse-engineer it.
Any format that is meant to be parsed and written by a computer will 9 out of 10 times be well-formed. I'd say 9.9 out of 10 for that matter, since there are cases where people make things unnecessarily complex for the sake of "security".
When I look at your sample data, I see "chunks" of data enclosed within curly braces, as well as nested chunks.
For example, you have things like
{tag1:value1} // simple chunk
{tag2:{tag3: value3}{tag4:value4}} // nested chunk
Multiple transactions are delimited by a $ apparently. You may be able to split on $ signs in this case, but again you need to be sure that the $ is a special character and doesn't appear in tags or values themselves.
Do not be fixated on what a "chunk" is or why I use the term. All you need to know is that there are "tags" and each tag comes with a particular "value".
The value can be anything: a primitive such as a string or number, or it can be another chunk. This suggests that you need to first figure out what type of value each tag accepts. For example, the 1 tag takes a string. The 4 tag takes multiple chunk, possibly representing different companies. There are chunks like DLM that have an empty value.
From these two samples, I would assume that you need to consume each each chunk, check the tag, and then parse the value. Since there are nested chunks, you likely need to store them in a particular way to correctly handle it.

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

String Builder vs Lists

I am reading in multiple files in with millions of lines and I am creating a list of all line numbers that have a specific issue. For example if a specific field is left blank or contains an invalid value.
So my question is what would be the most efficient date type to keep track of a list of numbers that could be upwards of a million number of rows. Would using String Builder, Lists, or something else be more efficient?
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51, etc. So in the case of a String Builder, I would check the previous value and if it is is only 1 more I would change it from 1 to 1-2 and if it was more than one would separate it by a comma. With the List, I would just add each number to the list and then combine then once the file has been completely read. However in this case I could have multiple list containing millions of numbers.
Here is the current code I am using to combine a list of numbers using String Builder:
string currentLine = sbCurrentLineNumbers.ToString();
string currentLineSub;
StringBuilder subCurrentLine = new StringBuilder();
StringBuilder subCurrentLineSub = new StringBuilder();
int indexLastSpace = currentLine.LastIndexOf(' ');
int indexLastDash = currentLine.LastIndexOf('-');
int currentStringInt = 0;
if (sbCurrentLineNumbers.Length == 0)
{
sbCurrentLineNumbers.Append(lineCount);
}
else if (indexLastSpace == -1 && indexLastDash == -1)
{
currentStringInt = Convert.ToInt32(currentLine);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace > indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastSpace);
currentStringInt = Convert.ToInt32(currentLineSub);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace < indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastDash + 1);
currentStringInt = Convert.ToInt32(currentLineSub);
string charOld = currentLineSub;
string charNew = lineCount.ToString();
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Replace(charOld, charNew);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51
If that's the end goal, no point in going through an intermediary representation such as a List<int> - just go with a StringBuilder. You will save on memory and CPU that way.
StringBuilder serves your purpose so stick with that, if you ever need the line numbers you can easily change the code then.
Depends on how you can / want to break the code up.
Given you are reading it in line order, not sure you need a list at all.
Your current desired output implies that you can't output anything until the file is completely scanned. The size of the file suggests a one pass`analysis phase would be a good idea as well, given you are going to use buffered input as opposed to reading the entire thing into memory.
I'd be tempted with an enum to describe the issue e.g Field??? is blank and then use that as the key a dictionary of string builders.
As a first thought anyway
Is your output supposed to be human readable? If so, you'll hit the limit of what is reasonable to read, long before you have any performance/memory issues from your data structure. Use whatever is easiest for you to work with.
If the output is supposed to be machine readable, then that output might suggest an appropriate data structure.
As others have pointed out, I would probably use StringBuilder. The List may have to resize many times; the new implementation of StringBuilder does not have to resize.

Fastest way to check a List<T> for a date

I have a list of dates that a machine has worked on, but it doesn't include a date that machine was down. I need to create a list of days worked and not worked. I am not sure of the best way to do this. I have started by incrementing through all the days of a range and checking to see if the date is in the list by iterating through the entire list each time. I am looking for a more efficient means of finding the dates.
class machineday
{
datetime WorkingDay;
}
class machinedaycollection : List<machineday>
{
public List<TimeCatEvent> GetAllByCat(string cat)
{
_CategoryCode = cat;
List<machineday> li = this.FindAll(delegate(machinedaydummy) { return true; });
li.Sort(sortDate);
return li;
}
int sortDate(machinedayevent1, machinedayevent2)
{
int returnValue = -1;
if (event2.date < event1.date)
{
returnValue = 0;
}
else if (event2.date == event1.date)
{
//descending
returnValue = event1.date.CompareTo(event2.date);
}
return returnValue;
}
}
Sort the dates and iterate the resulting list in parallel to incrementing a counter. Whenever the counter does not match the current list element, you've found a date missing in the list.
List<DateTime> days = ...;
days.Sort();
DateTime dt = days[0].Date;
for (int i = 0; i < days.Length; dt = dt.AddDays(1))
{
if (dt == days[i].Date)
{
Console.WriteLine("Worked: {0}", dt);
i++;
}
else
{
Console.WriteLine("Not Worked: {0}", dt);
}
}
(This assumes there are no duplicate days in the list.)
Build a list of valid dates and subtract your machine day collection from it using LINQ's Enumerable.Except extension method. Something like this:
IEnumerable<DateTime> dates = get_candidate_dates();
var holidays = dates.Except(machinedays.Select(m => m.WorkingDay));
The get_candidate_dates() method could even be an iterator that generates all dates within a range on the fly, rather than a pre-stored list of all dates.
Enumerable's methods are reasonably smart and will usually do a decent job on the performance side of things, but if you want the fastest possible algorithm, it will depend on how you plan to consume the result.
Sorry dudes, but I do not pretty much like your solutions.
I think you should create a HashTable with your dates. You can do this by interating only once the working days.
Then, you interate the full range of of days and for every one you query in the hashtable if the date is there or not, by using
myHashTable.ContainsKey(day); // this is efficient
Simple, elegant and fast.
I think your solution uses an exponencial time, this one is lineal or logarithmical (which is actually a good thing).
Assuming the list is sorted and the machine was "working" most of the time, you may be able to avoid iterating through all the dates by grouping the dates by month and skipping the dates in between. Something like this (you'll need to clean up):
int chunksize = 60; // adjust depending on data
machineday currentDay = myMachinedaycollection[0];
for (int i = 0; i < myMachinedaycollection.Count; i += chunksize)
{
if (currentDay.WorkingDay.AddDays(chunksize) != myMachinedaycollection[i + chunksize].WorkingDay)
{
// write code to iterate through current chunk and get all the non-working days
}
currentDay = myMachinedaycollection[i + chunksize];
}
I doubt you want a list of days working and not working.
Your question title suggests that you want to know whether the system was up on a particular date. It also seems reasonable to calculate % uptime. Neither of these requires building a list of all time instants in the interval.
Sort the service times. For the first question, do BinarySearch for the date you care about and check whether the preceding entry was the system being taken offline of maintenance or put back into service. For % uptime, take the (down for maintenance, service restored) pair-wise, use subtraction to find the duration of maintenance, add these up. Then use subtraction to find the length of the total interval.
If your question didn't actually mean you were keeping track of maintenance intervals (or equivalently usage intervals) then you can ignore this answer.

Categories