Using c# to read from a text file - c#

Am reading from a text file using the code below.
if (!allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":71A");
var fifthIndex = allLines.IndexOf(":72");
var sixthIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 4, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(fifthIndex + 4, sixthIndex - fifthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
else if (allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":70");
var fifthIndex = allLines.IndexOf(":71");
var sixthIndex = allLines.IndexOf(":72");
var seventhIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 5, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(sixthIndex + 4, seventhIndex - sixthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
Below is the format of the text file am reading.
{1:F21DBLNNGLAAXXX4695300820}{4:{177:1405260906}{451:0}}{1:F01DBLNNGLAAXXX4695300820}{2:O1030859140526SBICNGLXAXXX74790400761405260900N}{3:{103:NGR}{108:AB8144573}{115:3323774}}{4:
:20:SBICNG958839-2
:23B:CRED
:23E:SDVA
:32A:140526NGN168000000,
:50K:IHS PLC
:53A:/3000025296
SBICNGLXXXX
:57A:/3000024426
DBLNNGLA
:59:/0040186345
SONORA CAPITAL AND INVSTMENT LTD
:71A:OUR
:72:/CODTYPTR/001
-}{5:{MAC:00000000}{PAC:00000000}{CHK:42D0D867739F}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
The above file format represent one transaction in a single text file, but while testing with live files, I came accross a situation where a file can have more than one transaction. Example is the code below.
{1:F21DBLNNGLAAXXX4694300150}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300150}{2:O1031656140523FCMBNGLAAXXX17087957771405231916N}{3:{103:NGR}{115:3322817}}{4:
:20:TRONGN3RDB16
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN1634150,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0022617678
GOLDEN DC INT'L LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:C21000C4ECBA}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}${1:F21DBLNNGLAAXXX4694300151}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300151}{2:O1031656140523FCMBNGLAAXXX17087957781405231916N}{3:{103:NGR}{115:3322818}}{4:
:20:TRONGN3RDB17
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN450000,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0032501697
SUNSTEEL INDUSTRIES LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:01C3B7B3CA53}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
My challenge is that in my code, while reading allLines, each line is identified by certain index, a situation where I need to pick up the second transaction from the file, and the same index exist like as before, how can I manage this situation.

This is a simple problem obscured by excess code. All you are doing is extracting 3 values from a chunk of text where the precise layout can vary from one chunk to another.
There are 3 things I think you need to do.
Refactor the code. Instead of two hefty if blocks inline, you need functions that extract the required text.
Use regular expressions. A single regular expression can extract the values you need in one line instead of several.
Separate the code from the data. The logic of these two blocks is identical, only the data changes. So write one function and pass in the regular expression(s) needed to extract the data items you need.
Unfortunately this calls for a significant lift in the abstraction level of the code, which may be beyond what you're ready for. However, if you can do this and (say) you have function Extract() with regular expressions as arguments, you can apply that function once, twice or as often as needed to handle variations in your basic transaction.

You may perhaps use the code below to achieve multiple record manipulation using your existing code
//assuming fileText is all the text read from the text file
string[] fileData = fileText.Split('$');
foreach(string allLines in fileData)
{
//your code goes here
}

Maybe indexing works, but given the particular structure of the format, I highly doubt that is a good solution. But if it works for you, then that's great. You can simply split on $ and then pass each substring into a method. This assures that the index for each substring starts at the beginning of the entry.
However, if you run into a situation where indices are no longer static, then before you even start to write a parser for any format, you need to first understand the format. If you don't have any documentation and are basically reverse engineering it, that's what you need to do. Maybe someone else has specifications. Maybe the source of this data has it somewhere. But I will proceed under the assumption that none of this information is available and you have been given a task with absolutely no support and are expected to reverse-engineer it.
Any format that is meant to be parsed and written by a computer will 9 out of 10 times be well-formed. I'd say 9.9 out of 10 for that matter, since there are cases where people make things unnecessarily complex for the sake of "security".
When I look at your sample data, I see "chunks" of data enclosed within curly braces, as well as nested chunks.
For example, you have things like
{tag1:value1} // simple chunk
{tag2:{tag3: value3}{tag4:value4}} // nested chunk
Multiple transactions are delimited by a $ apparently. You may be able to split on $ signs in this case, but again you need to be sure that the $ is a special character and doesn't appear in tags or values themselves.
Do not be fixated on what a "chunk" is or why I use the term. All you need to know is that there are "tags" and each tag comes with a particular "value".
The value can be anything: a primitive such as a string or number, or it can be another chunk. This suggests that you need to first figure out what type of value each tag accepts. For example, the 1 tag takes a string. The 4 tag takes multiple chunk, possibly representing different companies. There are chunks like DLM that have an empty value.
From these two samples, I would assume that you need to consume each each chunk, check the tag, and then parse the value. Since there are nested chunks, you likely need to store them in a particular way to correctly handle it.

Related

Fastest way to split a huge text into smaller chunks

I have used the below code to split the string, but it takes a lot of time.
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd();
int startPos = 0;
ArrayList alSegments = new ArrayList();
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine;
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine);
startPos = startPos + segmentSize;
}
}
Please suggest me an alternative way to split the string into smaller chunks of fixed size
First of all you should define what you mean with chunk size. If you mean chunks with a fixed number of code units then your actual algorithm may be slow but it works. If it's not what you intend and you actually mean chunks with a fixed number of characters then it's broken. I discussed a similar issue in this Code Review post: Split a string into chunks of the same length then I will repeat here only relevant parts.
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, three cases:
One character is encoded with more than one code unit. Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
One character is composed by more than one code point. You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑).
One character has combining characters or modifiers. This is more common than you may think: for example Unicode combining character like U+0300 COMBINING GRAVE ACCENT used to build à and Unicode modifiers such as U+02BC MODIFIER LETTER APOSTROPHE.
Definition of character for a programming language and for a human being are pretty different, for example in Slovak dž is a single character however it's made by 2/3 Unicode code points which are in this case also 2/3 UTF-16 code units then "dž".Length > 1. More about this and other cultural issues on How can I perform a Unicode aware character by character comparison?.
Ligatures exist. Assuming one ligature is one code point (and also assuming it's encoded as one code unit) then you will treat it as a single glyph however it represents two characters. What to do in this case? In general definition of character may be pretty vague because it has a different meaning according to discipline where this word is used. You can't (probably) handle everything correctly but you should set some constraints and document code behavior.
One proposed (and untested) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength)
{
var characters = StringInfo.GetTextElementEnumerator(value);
while (characters.MoveNext())
yield return String.Concat(Take(characters, desiredLength));
}
private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
for (int i = 0; i < count; ++i)
{
yield return (string)enumerator.Current;
if (!enumerator.MoveNext())
yield break;
}
}
It's not optimized for speed (as you can see I tried to keep code short and clear using enumerations) but, for big files, it still perform better than your implementation (see next paragraph for the reason).
About your code note that:
You're building a huge ArrayList (?!) to hold result. Also note that in this way you resize ArrayList multiple times (even if, given input size and chunk size then its final size is known).
strSegmentData is rebuilt multiple times, if you need to accumulate characters you must use StringBuilder otherwise each operation will allocate a new string and copying old value (it's slow and it also adds pressure to Garbage Collector).
There are faster implementations (see linked Code Review post, especially Heslacher's implementation for a much faster version) and if you do not need to handle Unicode correctly (you're sure you manage only US ASCII characters) then there is also a pretty readable implementation from Jon Skeet (note that, after profiling your code, you may still improve its performance for big files pre-allocating right size output list). I do not repeat their code here then please refer to linked posts.
In your specific you do not need to read entire huge file in memory, you can read/parse n characters at time (don't worry too much about disk access, I/O is buffered). It will slightly degrade performance but it will greatly improve memory usage. Alternatively you can read line by line (managing to handle cross-line chunks).
Below is my analysis of your question and code (read the comments)
using (StreamReader srSegmentData = new StreamReader(fileNamePath))
{
string strSegmentData = "";
string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string?
int startPos = 0;
ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string>
while (startPos < line.Length && (line.Length - startPos) >= segmentSize)
{
strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want?
alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split"
startPos = startPos + segmentSize;
}
}
Making all kind of assumption, below is the code I would recommend for splitting long string. It is just a clean way of doing what you are doing in the sample. You can optimize this, but not sure how fast you are looking for.
static void Main(string[] args) {
string fileNamePath = "ConsoleApplication1.pdb";
var segmentSize = 32;
var op = ReadSplit(fileNamePath, segmentSize);
var joinedSTring = string.Join(Environment.NewLine, op);
}
static List<string> ReadSplit(string filePath, int segmentSize) {
var splitOutput = new List<string>();
using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) {
char []buffer = new char[segmentSize];
while (!file.EndOfStream) {
int n = file.ReadBlock(buffer, 0, segmentSize);
splitOutput.Add(new string(buffer, 0, n));
}
}
return splitOutput;
}
I haven't done any performance tests on my version, but my guess is that it is faster than your version.
Also, I am not sure how you plan to consume the output, but a good optimization when doing I/O is to use async calls. And a good optimization (at the cost of readability and complexity) when handling large string is to stick with char[]
Note that
You might have to deal with Character encoding issues while reading the file
If you already have the long string in memory and file reading was just include in the demo, then you should use the StringReader class instead of the StreamReader class

Decoding data read from a .bin file into fields

I was originally going to read a file through an array, pin to struct, convert and display. I have been trying to find another solution (I have removed the original details here to cause less confusion). I have a .bin File and I can correctly identify how many records are in the file by using a simple sum and FileInfo.
I've looked at:
http://www.sixscape.com/joomla/sixscape/images/pdf/C%20Sharp%20Training%20-%20Part%204%20-%20Contact%20Manager%20with%20Random%20Access%20File.pdf
http://www.vbi.org/Items/article.asp?id=16
Files imported have the same structure and look similar to the following: _(note screenshots show the amount of rows in the file and I have then calculated the rows for DataGridView table using a sum from here I produced Adding rows to second column ONLY - populating data using a for loop:
long Count = 1;
FileInfo Fi = new FileInfo(import.FileName);
long sum = (Fi.Length / 1024) - Count;
for (int i = 0; i < sum; i++)
{
DataGridView1.Rows.Add(null, Count++);
ReadWeldRecs(sum); // Added after question was published & called in ReadWeldRecs
}
First shows a total of 21 rows and the second being 9:
I have a method called DecodeDate and another called ReadWeldRecs, DecodeDate is fed through ReadWeldRecs which is then activated via a button click event. I know what date should be displayed, but when it comes to viewing the result from DecodeDate, it is wrong. I want to be able to read the date inside the file. import.FileName is the filename (#Kevin) that that been opened using OpenFileDialog and the date is displayed at position 5 in the file.
My first go:
The Date is displayed as: 22/08/2123
But should be: 21/10/2008
I've thought, may be it's an issue with the location? But I'm sure it's position 5.
Update: Turns out I was looking in the wrong location... Duh.
private DateTime DecodeDate(int Start)
{
int Year = Strings.Asc(Strings.Mid(import.FileName, Start, 1)) + 2000;
int Month = Strings.Asc(Strings.Mid(import.FileName, Start + 1, 1));
int Day = Strings.Asc(Strings.Mid(import.FileName, Start + 2, 1));
return DateAndTime.DateSerial(Year, Month, Day);
}
Original:
This is the original VB code which worked fine in the out-dated program: (I looked at this mostly to reconstruct the DecodeDate method in C#...)
Public Function DecodeDate(Start As Integer) As Date
YYear = Asc(Mid$(ImportRecord, Start, 1)) + 2000
MMonth = Asc(Mid$(ImportRecord, Start + 1, 1))
DDay = Asc(Mid$(ImportRecord, Start + 2, 1))
DecodeDate = DateSerial(YYear, MMonth, DDay)
End Function
ImportRecord is defined as the following: (global string)
Open ImportFileName For Random As #1 Len = Len(ImportRecord)
// ...
Get #1, Index + 1, ImportRecord
// ...
.Date = DecodeDate(5)
Current:
private void ReadWeldRecs(long RecordNumber)
{
byte[] Rec = new byte[1024];
using (FileStream Fs = new FileStream(import.FileName, FileMode.Open, FileAccess.Read))
using (BinaryReader Br = new BinaryReader(Fs))
{
int Rec_Len;
RecordNumber = 0; // Start with Record 0
while (true)
{
Fs.Seek(RecordNumber * 1024, SeekOrigin.Begin); // Position file to record
Rec_Len = Br.Read(Rec, 0, 1024); // Read the record here
if (Rec_Len == 0) // If reached end of file, end loop
{
break;
}
if (Rec[0] != 0) // If there is a record, let's display it
{
Label_Date1.Text = DecodeDate(Rec, 28).ToShortDateString();
}
RecordNumber++; // Read first record ++
}
Br.Close();
Fs.Close();
}
}
Plus #Kevin's updated solution :)
However, also this has resolved a major issue I still another where I am trying to go by the guidelines and template of #Kevin's solution for my other method DecodeString.
In VB:
Public Function DecodeString(Start As Integer, Length As Integer) As String
Dim Count As Integer
Dummy = Mid(ImportRecord, Start, Length)
For Count = 1 To Len(Dummy)
If (Mid$(Dummy, Count, 1) = Chr$(0)) Or (Mid$(Dummy, Count, 1) = Chr$(255)) Then
Mid$(Dummy, Count, 1) = Chr$(32)
End If
Next
DecodeString = Trim(Dummy)
End Function
Again, note I'm looking at using the solution as a template for this
A couple of things...
Is this C# or VB you are looking for since it's tagged as C# but Strings and DateAndTime are VB.
Are you just parsing the date out of a string that is the filename?
Is the date value really represented as the ASCII value of the characters? Really? That seems extremely odd... This means that to get year = 8, month = 10, day = 20 your filename would be something like abcde[BackspaceCharacter][LinefeedCharacter][Device control 4]. I definitely don't think you have control characters in there unless of course it was binary data stuffed into a string in the first place.
I'm going to assume the answers to these questions are... C#, Yes, and No. It would be easy to convert from C# -> VB if that's an incorrect assumption. I don't think I've used anything specific to C# here. Note I've spread this out onto several lines for readability on the forum.
private DateTime DecodeDate(int Start)
{
// If your filename is 'ABC-01234 2008-10-21#012345.bin' then Start should = 10
DateTime date;
var datePart = filename.Substring(Start, 10);
var culture = CultureInfo.InvariantCulture;
var style = DateTimeStyles.None;
// note since you are telling it the specific format of your date string
// the culture info is not relevant. Also note that "yyyy-MM-dd" below can be
// changed to fit your parsing needs yyyy = 4 digit year, MM = 2 digit month
// and dd = 2 digit day.
if (!DateTime.TryParseExact(datePart, "yyyy-MM-dd", culture, style, out date))
{
// TryParseExact returns false if it couldn't parse the date.
// Since it failed to properly parse the date... do something
// here like throw an exception.
throw new Exception("Unable to parse the date!");
}
return date;
}
After looking up DateAndTime there is another way for you to get the numbers you are seeing without there being control characters in the filename... but in that case I still don't think you want to use DateAndTime.DateSerial since it only obfuscates the error. (if you accidentally give it 25 for the month the date it returns has january for the month and adds 2 years to the year portion)
If this doesn't solve it... give the format of your filename and it will be easier to figure out what exactly the problem is.
EDIT: Based on your updates... looks like you are trying to parse the binary data...
An easier way to do this with the .NET framework methods on the BinaryReader like BinaryReader.ReadByte, .ReadDouble, etc..
For your code though you'll need to pass in the byte array so you can pull the values out of it. Style wise you want to limit or eliminate usage of global variables...
private DateTime DecodeDate(byte[] bytes, int start)
{
// Note that Length is 1 based and the indexer is 0 based so
if (bytes.Length < start + 3)
throw new Exception("Byte array wasn't long enough");
int year = Convert.ToInt32(bytes[start]) + 2000;
int month = Convert.ToInt32(bytes[start+1]);
int day = Convert.ToInt32(bytes[start+2]);
return new DateTime(year, month, day);
}
And then your call to it looks like this...
Label_Date1.Text = DecodeDate(Rec, 5).ToShortDateString();
I wanted to mention a couple of style pointers... These are as much personal preference as anything but I find them helpful.
Don't use globals or severely limit usage of them - import is used in your method but is defined somewhere else - this makes it harder to see what your code is doing and for you to trouble shoot it.
Name local variables starting in lowerCase in order to make it easier to see what is a method and what is a variable.
Be VERY careful with While(true) - its better if you can re-write it to be clearer when you exit the loop.
ONE MORE THING! You said your date was at position 5. In VB arrays generally start at 1 so the 5th element is array index 5... in C# arrays start with 0 so the 5th element would be index 4. I can't know which you should be checking from your code since I don't know the structure of the data so just keep this in mind.
EDIT 2:
For DecodeString it's both simple and hard... the code is very simple... however... strings are MUCH more complex than you would think. To convert from string to binary or vice versa you have to choose an encoding. Encodings are an algorithm for converting characters into binary data. Their names will be familiar to you like ASCII, UTF8, Unicode, etc. Part of the problem is that some programming languages obfuscate encodings from you and much of the time programmers can be blissfully ignorant of them. In your case it seems like your binary file is written in ASCII but it's impossible for me to know that... so my code below works if it's in ASCII... if you run into cases where your string isn't decoded properly you may have to change that.
private static string DecodeString(byte[] bytes, int start, int length)
{
return Encoding.ASCII.GetString(bytes, start, length);
}
Bonus reading - Joel Spolsky's post on Encodings

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

String Builder vs Lists

I am reading in multiple files in with millions of lines and I am creating a list of all line numbers that have a specific issue. For example if a specific field is left blank or contains an invalid value.
So my question is what would be the most efficient date type to keep track of a list of numbers that could be upwards of a million number of rows. Would using String Builder, Lists, or something else be more efficient?
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51, etc. So in the case of a String Builder, I would check the previous value and if it is is only 1 more I would change it from 1 to 1-2 and if it was more than one would separate it by a comma. With the List, I would just add each number to the list and then combine then once the file has been completely read. However in this case I could have multiple list containing millions of numbers.
Here is the current code I am using to combine a list of numbers using String Builder:
string currentLine = sbCurrentLineNumbers.ToString();
string currentLineSub;
StringBuilder subCurrentLine = new StringBuilder();
StringBuilder subCurrentLineSub = new StringBuilder();
int indexLastSpace = currentLine.LastIndexOf(' ');
int indexLastDash = currentLine.LastIndexOf('-');
int currentStringInt = 0;
if (sbCurrentLineNumbers.Length == 0)
{
sbCurrentLineNumbers.Append(lineCount);
}
else if (indexLastSpace == -1 && indexLastDash == -1)
{
currentStringInt = Convert.ToInt32(currentLine);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace > indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastSpace);
currentStringInt = Convert.ToInt32(currentLineSub);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace < indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastDash + 1);
currentStringInt = Convert.ToInt32(currentLineSub);
string charOld = currentLineSub;
string charNew = lineCount.ToString();
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Replace(charOld, charNew);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51
If that's the end goal, no point in going through an intermediary representation such as a List<int> - just go with a StringBuilder. You will save on memory and CPU that way.
StringBuilder serves your purpose so stick with that, if you ever need the line numbers you can easily change the code then.
Depends on how you can / want to break the code up.
Given you are reading it in line order, not sure you need a list at all.
Your current desired output implies that you can't output anything until the file is completely scanned. The size of the file suggests a one pass`analysis phase would be a good idea as well, given you are going to use buffered input as opposed to reading the entire thing into memory.
I'd be tempted with an enum to describe the issue e.g Field??? is blank and then use that as the key a dictionary of string builders.
As a first thought anyway
Is your output supposed to be human readable? If so, you'll hit the limit of what is reasonable to read, long before you have any performance/memory issues from your data structure. Use whatever is easiest for you to work with.
If the output is supposed to be machine readable, then that output might suggest an appropriate data structure.
As others have pointed out, I would probably use StringBuilder. The List may have to resize many times; the new implementation of StringBuilder does not have to resize.

read text from file and apply some operations on it

I have a problem on how to read text from file and perform operations on it for example
i have this text file that include
//name-//sex---------//birth //m1//m2//m3
fofo, male, 1986, 67, 68, 69
momo, male, 1986, 99, 98, 100
Habs, female, 1988, 99, 100, 87
toto, male, 1989, 67, 68, 69
lolo, female, 1990, 89, 80, 87
soso, female, 1988, 99, 100, 83
now i know how to read line by line till i reach null .
but this time I want later to perform and average function to get the average of the first colume of numbers m1
and then get the average of m1 for females only and for males only
and some other operations that i can do no problem
I need help i don't know how to get it
what i have in mind is to read each line in the text file and put it in a string then split the string (str.Split(','); ) but how to get the m1 record on each string
I'm really confused should i use regex to get the integers ? should i use an array 2d? I'm totally lost, any ideas?
please if u can improve any ideas by a code sample that will be great and a kindness initiation from u.
and after i done it i will post it for you guys to check.
{ as a girl I Think I made the wrong decision to join the IT community :-( }
Try something like this.
var qry = from line in File.ReadAllLines(#"C:\Temp\Text.txt")
let vals = line.Split(new char[] { ',' })
select new
{
Name = vals[0].Trim(),
Sex = vals[1].Trim(),
Birth = vals[2].Trim(),
m1 = Int32.Parse(vals[3]),
m2 = Int32.Parse(vals[4]),
m3 = Int32.Parse(vals[5])
};
double avg = qry.Average(a => a.m1);
double GirlsAvg = qry.Where(a => a.Sex == "female").Average(a => a.m1);
double BoysAvg = qry.Where(a => a.Sex == "male").Average(a => a.m1);
I wrote a blog post a while back detailing the act of reading a CSV file and parsing its columns:
http://www.madprops.org/blog/back-to-basics-reading-a-csv-file/
I took the approach you mention (splitting the string), then use DateTime.TryParseExact() and related methods to convert the individual values to the types I need.
Hope the post helps!
Is there a reason for not creating a data structure that stores the fields of the file, a string, a boolean(for m/f), an integer and 3 integers, which you could make into a List that stores the values and then loop over it to compute various sums, averages, whatever other aggregate functions you'd like.
(note: this might seem an over-complicated solution, but I'm assuming that the source data is large (lots of rows), so loading it into a List<T> might not be feasible)
The file reading would be done quite well with an iterator block... if the data is large, you only want to handle one row at a time, not a 2D array.
This actually looks like a good fit for MiscUtil's PushLINQ approach, which can perform multiple aggregates at the same time on a stream of data, without buffering...
An example is below...
why is this useful?
Because it allows you to write multiple queries on a data source using standard LINQ syntax, but only read it once.
Example
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using MiscUtil.Linq;
using MiscUtil.Linq.Extensions;
static class Program
{
static void Main()
{
// prepare a query that is capable of parsing
// the input file into the expected format
string path = "foo.txt";
var qry = from line in ReadLines(path)
let arr = line.Split(',')
select new
{
Name = arr[0].Trim(),
Male = arr[1].Trim() == "male",
Birth = int.Parse(arr[2].Trim()),
M1 = int.Parse(arr[3].Trim())
// etc
};
// get a "data producer" to start the query process
var producer = CreateProducer(qry);
// prepare the overall average
var avg = producer.Average(row => row.M1);
// prepare the gender averages
var avgMale = producer.Where(row => row.Male)
.Average(row => row.M1);
var avgFemale = producer.Where(row => !row.Male)
.Average(row => row.M1);
// run the query; until now *nothing has happened* - we haven't
// even opened the file
producer.ProduceAndEnd(qry);
// show the results
Console.WriteLine(avg.Value);
Console.WriteLine(avgMale.Value);
Console.WriteLine(avgFemale.Value);
}
// helper method to get a DataProducer<T> from an IEnumerable<T>, for
// use with the anonymous type
static DataProducer<T> CreateProducer<T>(IEnumerable<T> data)
{
return new DataProducer<T>();
}
// this is just a lazy line-by-line file reader (iterator block)
static IEnumerable<string> ReadLines(string path)
{
using (var reader = File.OpenText(path))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
}
I recommend using the FileHelpers library. Check out example here: Quick start
You could calculate the average in a foreach-loop like the one on the page.
Suzana, I apologize in advance but I don't mean to offend you. You already said "As a girl, you made the wrong decision to join IT...", and I have heard that before from my sisters saying all the time when I tried to help them with their career selection. But if you have conceptual difficulty following the above answers without just copying and paste the code, I think you just validated part of your statement.
Having said that, there are more in IT than just writing code. In other words, coding might not just be for you, but there are other areas in IT you might excel, including becoming a manager one day. I have had many managers who are not capable of doing the above in any language, but they do a good job of managing people, projects and resources.
Believe me, it's only getting harder from here on. This is a very basic task in programming. But if you realize this soon enough, you could talk to your managers asking for non-coding challenges in the company. QA might also be an alternative. Again, I only wish to help and am sorry if you become offended. Good luck.
Re your follow-up "what if"; you would simply loop:
// rows is the jagged array of string1, string2 etc
int totalCounter = 0, totalSum = 0; // etc
foreach(string[] row in rows)
{
int m1 = int.Parse(row[3]);
totalCounter++;
totalSum += m1;
switch(row[2]) {
case "male":
maleCount++;
maleSum += m1;
break;
case "female":
femaleCount++;
femaleSum += m1;
break;
}
}
etc. However, while this works, you can do the same thing a lot more conveniently/expressively in C# 3.0 with LINQ, which is what a lot of the existing replies are trying to show... the fact is, Tim J's post already does all of this:
ReadAllLines: gets the array of rows per line
Split: gets the array of data per row
"select new {...}": parses the data into something convenient
3 "avg" lines show how to take an average over filtered data
The only change I'd make is that I'd add chuck a ToArray() in there somewhere so we only read the file once...

Categories