receiving unreadable text while trying to bencode a "find_node" query - c#

Im using the bencodeNET trying to send a find_node query and receive an answer using a bootstrapping node.
it seems like the request works well and I do get a response on wireshark and c#. the problem is when decoding (using bencoding ofc) I get unreadable and unusable dictionary
code:
IPAddress[] ipAddresses = Dns.GetHostAddresses("router.bittorrent.com");
var endpoint = new IPEndPoint(ipAddresses[0], 6881);
// Create a new bencoded dictionary for the query
BDictionary query = new BDictionary();
query["t"] = new BString("aa"); // transaction ID
query["y"] = new BString("q"); // query type
query["q"] = new BString("find_node"); // query name
// Add additional parameters to the query
BDictionary argsDict = new BDictionary();
argsDict["id"] = new BString(id); // target node ID
Console.WriteLine(BMethods.BytesToId(Encoding.ASCII.GetBytes(argsDict["id"].ToString())) + "penis");
argsDict["target"] = new BString(id); // target ID
query["a"] = argsDict;
Console.WriteLine(argsDict.Get<BString>("id")+"here");
var parser = new BencodeParser(Encoding.GetEncoding("ISO-8859-1"));
// Encode the query as a byte array
var queryBytes = query.EncodeAsBytes();
// Create a new socket
UdpClient udpServer = new UdpClient(11000);
udpServer.Send(queryBytes, queryBytes.Length, endpoint); // reply back
var data = udpServer.Receive(ref endpoint); // listen on port 11000
Console.WriteLine(Encoding.GetEncoding("ISO-8859-1").GetString(data));
BDictionary response = parser.Parse<BDictionary>(data);
foreach (KeyValuePair<BString, IBObject> i in response)
{
if(i.Value is BDictionary)
{
BDictionary b = (BDictionary)i.Value;
foreach (KeyValuePair<BString, IBObject> j in b)
{
Console.WriteLine(j.Key + ": " + BMethods.BytesToId(BMethods.ConvertHexToByte((((BString)j.Value).ToString()))));
}
continue;
}
Console.WriteLine(i.Key + ": " + ((i.Value).ToString()));
}
id is a random 20 byte array if you're wondering
the response output usually looks something like this:
ip: ?Z?w*o
id: 2oNisQÿJì)Iº«òûaF|Ag
nodes: I??"2/IªLO!d?å?Y3?/lYA??cwIov0ahrth_EqU½?G??▌Ä?÷0ï%(I¼A?ÿ?¡9I?òµ??°O0?~x,B?ªO];'?UDz½??cUP¿}ïWLx?;K=&o¼âC}ÑWAº3{àO\AOé81Ñ
B-F¥)ú21?c_ìvyêo?AMe6lcp\g?èöUÉ°_6OrEu[?)??']km?èU¡(á
t: aa
y: r
with obviously id and nodes being completely unreadable and unusable.
anyone wondering how a proper response should look like its in BEP 5
now with id I could get around this by just taking the bytes and turning the bytes to hex. the problem is the nodes key which holds a lot more information than just a sha-1 id (that being ip and port).
Edit: forgot to add sometimes the response dictionary doesn't contain a "nodes" key at all despite Wireshark showing it should

I don't have the code with me right now but I'll add it later. Essentially "nodes" key as written in bep 5 is composed of 20 bytes of the id, 4 bytes of the IP, and 2 bytes of the port. Using this information you can use this mess of a dump and composed it into a byte array which you can turn into a list of nodes with the information above

This is normal because bencoding is foremost a binary format which just sometimes happens to be human-readable. Bencode "strings" really are byte[], BEP52 clarifies this. The examples in BEP5 use values that just happen to be in the ascii range to improve readability.

Related

Passing array into query string

I need to send array as query parameter, I do it like this
StringBuilder Ids = new StringBuilder();
for (int i = 0; i < array.Count; i++)
{
Ids.Append(String.Format("&id[{0}]={1}", i, items[i].ID));
}
ifrDocuments.Attributes.Add("src", "Download.aspx?arrayCount=" + array.Count + Ids);
After this I have string:
Download.aspx?arrayCount=8&id[0]=106066&id[1]=106065&id[2]=106007&id[3]=105284&id[4]=105283&id[5]=105235&id[6]=105070&id[7]=103671
It can contain 100 elements and in this case I'm getting error:
enter image description here
Maybe I can do it in another way? Not by sending it inside query qtring.
There is a limit on URL length on multiple levels (browsers, proxy servers, etc). You can change maxQueryString (*1) but I would not recommend it if you expect real users to use your system.
It looks like downloads.aspx is your page. Put all those ids in temporary storage - (cache or database) and pass the key to this new entity in the request
*1: https://blog.elmah.io/fix-max-url-and-query-string-length-with-web-config-and-iis/
QueryString is not the way to pass an array because of the limits.
If you have hands on the endpoint, you should consider sending your array in a POST Body.
Regards

How to to optimize more than 10 addresses using google-maps api v3

So, this will be pretty straight forward question.
I have function in helper class to fetch optimized (or not) route using maps API.
public static JObject CalcRoute(string origin, string destination, string[] waypoints)
{
var requestUrl = string.Format("https://maps.googleapis.com/maps/api/directions/json?origin={0}&destination={1}&waypoints={2}", origin, destination, string.Join("|", waypoints));
using (var client = new WebClient())
{
var result = client.DownloadString(requestUrl);
var data = JsonConvert.DeserializeObject<JObject>(result);
//ensure directions response contains a valid result
string status = (string) data["status"];
if ((string)data["status"] != "OK")
throw new Exception("Invalid request");
return data;
}
}
And in my controller I am calling it like so:
var data = GoogleGeocodeExtension.CalcRoute("startLat,endLat", "endLat,endLong", new[]
{
"optimize:true",
"lat,lang",
"lat,lang",
//6 more lat and lang pairs
//I want to optimise more than 10 limit
});
//showing data in debug window like so
//so you can test it faster
foreach (var route in data["routes"])
{
Debug.WriteLine("----------------------------------------------------");
Debug.WriteLine(route["waypoint_order"]);
Debug.WriteLine("----------------------------------------------------");
foreach (var leg in route["legs"])
{
Debug.WriteLine("===================================================================================================");
Debug.WriteLine("Start: {0} End: {1} \n Distance: {2} Duration: {3}", leg["start_address"], leg["end_address"], leg["distance"], leg["duration"]);
Debug.WriteLine("Start lat lang: {0}", leg["start_location"]);
Debug.WriteLine("End lat lang: {0}", leg["end_location"]);
Debug.WriteLine("===================================================================================================");
}
}
So I can send 10 coordinates (lat & lang pairs), 2 as start and end position, and other 8 as way-points, but how to send 20? Or 30?
I have read many many question here on SO and other sites that mostly answered question about showing or calculating an already optimized list of coordinates.
I know I can can divide my list into multiple lists that have less than 10 coordinates, but in that way I wouldn't get properly optimized route, since it doesn't take into consideration all coordinates and only parts of it would properly optimized.
Unfortunately I can't afford to pay for premium licence to google (if I am not mistaken, it 10k freaking bucks :S).
EDIT: Apparently when using web directions service on server side, you can do that 23 way-points limit. What you need tho is that you add your key to API call like so:
var requestUrl = string.Format("https://maps.googleapis.com/maps/api/directions/json?key={3}&origin={0}&destination={1}&waypoints={2}", origin, destination, string.Join("|", waypoints), "YourApiKey");
The web directions service (which is what you are using) supports up to 23 waypoints in each server-side request for users of the standard API (you don't need a premium license).
You do need a key, but keys are now (as of June 22, 2016) required for all of Google's mapping services.

How to get a list of UIDs in reverse order with MailKit?

I would like to get the latest 100 UIDs from the inbox using MailKit. I am accessing a Gmail mailbox which does not appear to support the SORT extension so I am unable to use OrderBy.
Here is my code. The problem is that it appears to retrieve the oldest 100 emails rather the latest ones (which is how I would expect it to work). Is there a way to do this?
Option A - looks promising only gets the 100 oldest emails UIDs and I want the 100 newest:
imap.Inbox.Open(FolderAccess.ReadOnly);
var orderBy = new [] { OrderBy.ReverseArrival };
var items = imap.Inbox.Fetch(0, limit, MessageSummaryItems.UniqueId);
Option B - gets all UIDs by date order (but does not work on Gmail anyway):
imap.Inbox.Open(FolderAccess.ReadOnly);
var orderBy = new [] { OrderBy.ReverseArrival };
SearchQuery query = SearchQuery.All;
var items = imap.Inbox.Search(query, orderBy);
The IMAP server does not support the SORT extension.
The reason is to quickly scan the mailbox in order to improve responsiveness to user.
You were pretty close in Option A, you just used the wrong values for the first 2 arguments.
This is what you want:
imap.Inbox.Open (FolderAccess.ReadOnly);
if (imap.Inbox.Count > 0) {
// fetch the UIDs of the newest 100 messages
int index = Math.Max (imap.Inbox.Count - 100, 0);
var items = imap.Inbox.Fetch (index, -1, MessageSummaryItems.UniqueId);
...
}
The way that IMailFolder.Fetch (int, int, MessageSummaryItems) works is that the first int argument is the first message index and the second argument is the last message index in the range (-1 is a special case that just means "the last message in the folder").
Since once we open the folder, we can use the IMailFolder.Count property to get the total number of messages in the folder, we can use that to count backwards from the end to get our starting index. We want the last 100, so we can do folder.Count - 100. We use Math.Max() to make sure we don't get a negative value if the number of messages in the folder is less than 100.
Hope that helps.
If you want to download each Message individually, you can do something simple like this.
for (int i = inbox.Count-1; i > inbox.Count-101; i--)
{
var message = inbox.GetMessage(i);
Console.WriteLine($"Subject: {message.Subject}");
}
If you would like to receive it all in one request, try this.
var lastHundredMessages = Enumerable.Range(inbox.Count - 100, 100).ToList();
var messages = inbox.Fetch(lastHundredMessages, MailKit.MessageSummaryItems.UniqueId);
foreach (var message in messages)
{
//To something here with this
}

Using c# to read from a text file

Am reading from a text file using the code below.
if (!allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":71A");
var fifthIndex = allLines.IndexOf(":72");
var sixthIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 4, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(fifthIndex + 4, sixthIndex - fifthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
else if (allLines.Contains(":70"))
{
var firstIndex = allLines.IndexOf(":20");
var secondIndex = allLines.IndexOf(":23B");
var thirdIndex = allLines.IndexOf(":59");
var fourthIndex = allLines.IndexOf(":70");
var fifthIndex = allLines.IndexOf(":71");
var sixthIndex = allLines.IndexOf(":72");
var seventhIndex = allLines.IndexOf("-}");
var firstValue = allLines.Substring(firstIndex + 4, secondIndex - firstIndex - 5).TrimEnd();
var secondValue = allLines.Substring(thirdIndex + 5, fourthIndex - thirdIndex - 5).TrimEnd();
var thirdValue = allLines.Substring(sixthIndex + 4, seventhIndex - sixthIndex - 5).TrimEnd();
var len1 = firstValue.Length;
var len2 = secondValue.Length;
var len3 = thirdValue.Length;
inflow103.REFERENCE = firstValue.TrimEnd();
pointer = 1;
inflow103.BENEFICIARY_CUSTOMER = secondValue;
inflow103.RECEIVER_INFORMATION = thirdValue;
}
Below is the format of the text file am reading.
{1:F21DBLNNGLAAXXX4695300820}{4:{177:1405260906}{451:0}}{1:F01DBLNNGLAAXXX4695300820}{2:O1030859140526SBICNGLXAXXX74790400761405260900N}{3:{103:NGR}{108:AB8144573}{115:3323774}}{4:
:20:SBICNG958839-2
:23B:CRED
:23E:SDVA
:32A:140526NGN168000000,
:50K:IHS PLC
:53A:/3000025296
SBICNGLXXXX
:57A:/3000024426
DBLNNGLA
:59:/0040186345
SONORA CAPITAL AND INVSTMENT LTD
:71A:OUR
:72:/CODTYPTR/001
-}{5:{MAC:00000000}{PAC:00000000}{CHK:42D0D867739F}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
The above file format represent one transaction in a single text file, but while testing with live files, I came accross a situation where a file can have more than one transaction. Example is the code below.
{1:F21DBLNNGLAAXXX4694300150}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300150}{2:O1031656140523FCMBNGLAAXXX17087957771405231916N}{3:{103:NGR}{115:3322817}}{4:
:20:TRONGN3RDB16
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN1634150,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0022617678
GOLDEN DC INT'L LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:C21000C4ECBA}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}${1:F21DBLNNGLAAXXX4694300151}{4:{177:1405231923}{451:0}}{1:F01DBLNNGLAAXXX4694300151}{2:O1031656140523FCMBNGLAAXXX17087957781405231916N}{3:{103:NGR}{115:3322818}}{4:
:20:TRONGN3RDB17
:23B:CRED
:23E:SDVA
:26T:001
:32A:140523NGN450000,00
:50K:/2206117013
SUNLEK INVESTMENT LTD
:53A:/3000024763
FCMBNGLA
:57A:/3000024426
DBLNNGLA
:59:/0032501697
SUNSTEEL INDUSTRIES LTD
:71A:OUR
:72:/CODTYPTR/001
//BNF/TRSF
-}{5:{MAC:00000000}{PAC:00000000}{CHK:01C3B7B3CA53}{DLM:}}{S:{SPD:}{SAC:}{FAC:}{COP:P}}
My challenge is that in my code, while reading allLines, each line is identified by certain index, a situation where I need to pick up the second transaction from the file, and the same index exist like as before, how can I manage this situation.
This is a simple problem obscured by excess code. All you are doing is extracting 3 values from a chunk of text where the precise layout can vary from one chunk to another.
There are 3 things I think you need to do.
Refactor the code. Instead of two hefty if blocks inline, you need functions that extract the required text.
Use regular expressions. A single regular expression can extract the values you need in one line instead of several.
Separate the code from the data. The logic of these two blocks is identical, only the data changes. So write one function and pass in the regular expression(s) needed to extract the data items you need.
Unfortunately this calls for a significant lift in the abstraction level of the code, which may be beyond what you're ready for. However, if you can do this and (say) you have function Extract() with regular expressions as arguments, you can apply that function once, twice or as often as needed to handle variations in your basic transaction.
You may perhaps use the code below to achieve multiple record manipulation using your existing code
//assuming fileText is all the text read from the text file
string[] fileData = fileText.Split('$');
foreach(string allLines in fileData)
{
//your code goes here
}
Maybe indexing works, but given the particular structure of the format, I highly doubt that is a good solution. But if it works for you, then that's great. You can simply split on $ and then pass each substring into a method. This assures that the index for each substring starts at the beginning of the entry.
However, if you run into a situation where indices are no longer static, then before you even start to write a parser for any format, you need to first understand the format. If you don't have any documentation and are basically reverse engineering it, that's what you need to do. Maybe someone else has specifications. Maybe the source of this data has it somewhere. But I will proceed under the assumption that none of this information is available and you have been given a task with absolutely no support and are expected to reverse-engineer it.
Any format that is meant to be parsed and written by a computer will 9 out of 10 times be well-formed. I'd say 9.9 out of 10 for that matter, since there are cases where people make things unnecessarily complex for the sake of "security".
When I look at your sample data, I see "chunks" of data enclosed within curly braces, as well as nested chunks.
For example, you have things like
{tag1:value1} // simple chunk
{tag2:{tag3: value3}{tag4:value4}} // nested chunk
Multiple transactions are delimited by a $ apparently. You may be able to split on $ signs in this case, but again you need to be sure that the $ is a special character and doesn't appear in tags or values themselves.
Do not be fixated on what a "chunk" is or why I use the term. All you need to know is that there are "tags" and each tag comes with a particular "value".
The value can be anything: a primitive such as a string or number, or it can be another chunk. This suggests that you need to first figure out what type of value each tag accepts. For example, the 1 tag takes a string. The 4 tag takes multiple chunk, possibly representing different companies. There are chunks like DLM that have an empty value.
From these two samples, I would assume that you need to consume each each chunk, check the tag, and then parse the value. Since there are nested chunks, you likely need to store them in a particular way to correctly handle it.

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

Categories