real time stock quotes, StreamReader performance optimization - c#

I am working on a program that extracts real time quote for 900+ stocks from a website. I use HttpWebRequest to send HTTP request to the site and
store the response to a stream and open a stream using the following code:
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream stream = response.GetResponseStream ();
StreamReader reader = new StreamReader( stream )
the size of the received HTML is large (5000+ lines), so it takes a long time to parse it and extract the price. For 900 files,
It takes about 6 mins for parsing and extracting. Which my boss isn't happy with, he told me he'd want the whole process to be done in TWO mins.
I've identified the part of the program that takes most of time to finish is parsing and extracting. I've tried to optimize the code to make it faster, the following is
what I have now after some optimization:
// skip lines at the top
for(int i=0;i<1500;++i)
reader.ReadLine();
// read the line that contains the price
string theLine = reader.ReadLine();
// ... extract the price from the line
now it takes about 4 mins to process all the files, there is still a significant gap to what my boss's expecting. So I am wondering, is there other way that I
can further speed up the parsing and extracting and have everything done within 2 mins?

I was doing HTML screen scraping for a while with stock quotes but I found that Yahoo offers a great simple web service that is much better that loading websites.
http://www.gummy-stuff.org/Yahoo-data.htm
With this service you can request up to 100 stock quotes in a single request and it returns a csv formatted response with one line for every symbol. You can set what columns you want returned in the query string of the request. I built a small program that would query the service once a day for every stock in the stock market to get prices. It seemed to work well for me and was way faster than hitting websites for the data.
An example querystring would be
http://finance.yahoo.com/d/quotes.csv?s=GE&f=nkqwxyr1l9t5p4
Which returns text of
"GENERAL ELEC CO",32.98,"Jun 26","21.30 - 32.98","NYSE",2.66,"Jul 25",28.55,"Jul 3","-0.21%"

for(int i=0;i<1500;++i)
reader.ReadLine();
this particulary is not good. ReadLine reads all line and stores it somewhere, but no one uses it. Extra work for GC. Read byte-by-byte and catch \D \A.
Then don't use StreamReader at all! It is fat overhead, read from stream.

Hard to see how this is possible, StreamReader is blindingly fast compared to HttpWebRequest. Some basic assumptions: say you are downloading 900 files with 5000 lines, 100 chars each in 6 minutes. That means you need to download 900 x 5000 x 100 = 450 Megabytes. In 6 minutes, that requires a bandwidth of 450E6 / 6 / 60 * 8 = 10 Mbps.
What do you have? 10 Mbps is about typical for high-speed Internet service, although you need a server that can sustain this. To get it down to 2 seconds, you'll need to upgrade your service to 30 Mbps. Your boss can fix that.
About the speed improvement you saw: watch out for the cache.

If you really need to have real-time data fast then you should subscribe to the data feeds rather than scrape them off a site.
Alternatively, isn't there some token that you can search for to find the field/data pair(s) you need.
4 minutes sounds ridiculously long for reading in 900 files.

Related

Most efficient way to make a large number of small POSTs to web service

I need to send approximately 10,000 small json strings from a C# client application to a web service which will then be inserted into a database.
My question: Is it better to send a large number of small requests or is there some way to send the whole thing /multiple large chunks?
I'm trying to avoid something like this:
List<Thing> things_to_update = get_things_to_update(); // List now contains 10,000 records
foreach (Thing th in things_to_update)
{
// POSTs using HttpWebRequest or similar
post_to_web_service(th);
}
If you control the server and can change the code, it's definitely better to send them batched.
Just encode your objects into json object and put them in an array. So instead of sending data like
data={ id: 5, name: "John" }
make it an array
data=[{ id: 5, name: "John" }, { id: 6, name: "Jane" }, ...]
then parse it in your Controller action on the server side and insert it. You should create a new Action that handles more than 1 request for sake of cleaner and more maintainable code.
I'd suggest splitting it into smaller batches of 1000 or 2000 rather than sending all 10000 at once. Easy done with Linq:
int batchSize = 1000;
for(int i=0; i<things_to_update.Count; i+= batchSize) {
List<Thing> batch = things_to_update.Skip(i).Take(batchSize);
post_to_web_service(batch);
}
Post the whole List<Thing>.
I believe a string in C# is generally 20 bytes + 2x the size of it's length.
This may not be exactly correct, but just to get an rough approximation:
Assuming your "small" (?) json strings are bout 100 chars long then, that should result in 10'000 string-objects, each of approximately 220 bytes, which results in a total size of 2'200'000 bytes, which is approximately 2 megabytes.
Not too much, in other words. I would definitely prefer that to 10'000 connections to your web-service.
If for some reason 2 mb is too much to accept at a go for your WS, you could always split it into a few smaller packages of say 1mb, or 200 kb, etc. I've no doubt you'll still get way better performance doing that than trying to send 10'000 strings, one at a time.
So yes, place your strings in one (or a handfull of) array(s), and send them in a batch.

Bloomberglp.Blpapi.RequestQueueOverflowException: Queue Size: 128

I can't figure out what the problem is with Bloomberg API.
Everytime when I try to download historical finance data, that means to create a DataRequest for 5000 instruments for 3 days once for euro currency and once for local currency, I get this queue exception.
What's really confusing is, the program still goes on for the first request which contains euro prices for instruments but not for the second.
Thanks for the help.
Well since you are not getting a slow consumer warning, I'd bet you are just requesting to much data on one request.
Try Splitting your request into several chunks.
Queue Size of each request is 1024, if the size > 1024, it will throw BloombergOverFolowException

How to download the data from the server discontinuously?

i need to download a big data from the server,because the data is so big,i am not able to download it at a time,do you have any idea?Thanks you very much.
If the server supports it, you can use HTTP byte ranges to request specific parts of the file.
This page describes HTTP byte range requests:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.1
The following code creates a request which will ask to skip the first 100 bytes, but return the rest of the file:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(#"http://example.com/somelargefile");
request.Headers.Add("Range", "bytes=100-");
The only logical way I can think about doing it is to pre-arrange the data into chunks for download with index. The index increments with the number of chunks received, so when the server sends down the file, it knows it can skip (chunkCount * chunkSize) from the byte stream and begin sending down the next chunkSize bytes.
Of course, this would mean a rather excessive number of requests, so YMMV.
There is a Background Transfer Service code sample on MSDN that might help. I've never used it, but the sample might give you a place to start from.

Limiting number of calls using Yahoo YQL in C#

I'm a bit new to C# and i'm running into a problem with YQL limiting the number of calls to 10,000 an hour. I keep getting my temp ban everytime I try to run my app. I read that Yahoo has a limit of 10,000 calls per hour but i'm a little confused about what exactly constitutes a "call." The code I"m using to get the XML from YQL is below:
public static string getXml(string sSymbol)
{
XDocument doc = XDocument.Load("http://www.google.com/ig/api?stock=" + sSymbol);
string xmlraw = doc.ToString();
string xml = xmlraw.Replace("'", "");
return xml;
}
Where sSymbol is a value that is returned from my SQL DB. I have roughly 2,000 stocks in my Database. I have also read that some people are saying 1,000 calls per hour so I have has misunderstood what I was reading.
The question I guess is two-fold: What constitutes a calls?
How can I avoid this rate limit if I want to download each of the 2,000 quotes per hour? Is it as simple as asking yahoo for 200 quotes per Load and calling the Load 10 times?
For this case a call is a request. If you want to make single stock requests you need 2000 calls. Fortunately you can make one call requesting more than one stock as with Yahoo.
http://www.google.com/ig/api?stock=MSFT&stock=IBM

Efficient way to read a specific line number of a file. (BONUS: Python Manual Misprint)

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before solving this issue I would like to see this line, but alas my favorite method of
import linecache
print linecache.getline(filename, linenumber)
is throwing a MemoryError. Interestingly the manual says that "This function will never throw an exception." On this large file it throws one as I try to read line number 1, and I have about 6GB free RAM...
I would like to know what is the most elegant method to get to that unreachable line. Available tools are Python 2, Python 3 and C# 4 (Visual Studio 2010). Yes, I understand that I can always do something like
var line = 0;
using (var stream = new StreamReader(File.OpenRead(#"s:\source\transactions.dat")))
{
while (++line < 219506324) stream.ReadLine(); //waste some cycles
Console.WriteLine(stream.ReadLine());
}
Which would work, but I doubt it's the most elegant way.
EDIT: I'm waiting to close this thread, because the hard drive containing the file is being used right now by another process. I'm going to test both suggested methods and report timings. Thank you all for your suggestions and comments.
The Results are in I implemented Gabes and Alexes methods to see which one was faster. If I'm doing anything wrong, do tell. I'm going for the 10 millionth line in my 100GB file using the method Gabe suggested and then using the method Alex suggested which i loosely translated into C#... The only thing I'm adding from myself, is first reading in a 300 MB file into memory just to clear the HDD cache.
const string file = #"x:\....dat"; // 100 GB file
const string otherFile = #"x:\....dat"; // 300 MB file
const int linenumber = 10000000;
ClearHDDCache(otherFile);
GabeMethod(file, linenumber); //Gabe's method
ClearHDDCache(otherFile);
AlexMethod(file, linenumber); //Alex's method
// Results
// Gabe's method: 8290 (ms)
// Alex's method: 13455 (ms)
The implementation of gabe's method is as follows:
var gabe = new Stopwatch();
gabe.Start();
var data = File.ReadLines(file).ElementAt(linenumber - 1);
gabe.Stop();
Console.WriteLine("Gabe's method: {0} (ms)", gabe.ElapsedMilliseconds);
While Alex's method is slightly tricker:
var alex = new Stopwatch();
alex.Start();
const int buffersize = 100 * 1024; //bytes
var buffer = new byte[buffersize];
var counter = 0;
using (var filestream = File.OpenRead(file))
{
while (true) // Cutting corners here...
{
filestream.Read(buffer, 0, buffersize);
//At this point we could probably launch an async read into the next chunk...
var linesread = buffer.Count(b => b == 10); //10 is ASCII linebreak.
if (counter + linesread >= linenumber) break;
counter += linesread;
}
}
//The downside of this method is that we have to assume that the line fit into the buffer, or do something clever...er
var data = new ASCIIEncoding().GetString(buffer).Split('\n').ElementAt(linenumber - counter - 1);
alex.Stop();
Console.WriteLine("Alex's method: {0} (ms)", alex.ElapsedMilliseconds);
So unless Alex cares to comment I'll mark Gabe's solution as accepted.
Here's my elegant version in C#:
Console.Write(File.ReadLines(#"s:\source\transactions.dat").ElementAt(219506323));
or more general:
Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));
Of course, you may want to show some context before and after the given line:
Console.Write(string.Join("\n",
File.ReadLines(filename).Skip(linenumber - 5).Take(10)));
or more fluently:
File
.ReadLines(filename)
.Skip(linenumber - 5)
.Take(10)
.AsObservable()
.Do(Console.WriteLine);
BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:
fp = open(fullname, 'rU')
lines = fp.readlines()
fp.close()
In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."
Well, memory can run out at any time, asynchronously and unpredictably -- that's why the "never an exception" promise doesn't really apply there (just like, say, in Java, where every method must specify which exceptions it can raise, some exceptions are exempted from this rule, since just about any method, unpredictably, can raise them at any time due to resource scarcity or other system-wide issues).
linecache tries to read the whole file. Your only simple alternative (hopefully you're not in a hurry) is to read one line at a time from the start...:
def readoneline(filepath, linenum):
if linenum < 0: return ''
with open(filepath) as f:
for i, line in enumerate(filepath):
if i == linenum: return line
return ''
Here, linenum is 0-based (if you don't like that, and your Python is 2.6 or better, pass a starting value of 1 to enumerate), and the return value is the empty string for invalid line numbers.
Somewhat faster (and a lot more complicated) is to read, say, 100 MB at a time (in binary mode) into a buffer; count the number of line-ends in the buffer (just a .count('\n') call on the buffer string object); once the running total of line ends exceeds the linenum you're looking for, find the Nth line-end currently in the buffer (where N is the difference between linenum, here 1-based, and the previous running total of line ends), read a bit more if the N+1st line-end is not also in the buffer (as that's the point where your line ends), extract the relevant substring. Not just a couple of lines net of the with and returns for anomalous cases...;-).
Edit: since the OP comments doubting that reading by-buffer instead of by-line can make a performance difference, I unrooted an old piece of code where I was measuring the two approaches for a somewhat-related tasks -- counting the number of lines with the buffer approach, a loop on lines, or reading the whole file in memory at one gulp (by readlines). The target file is kjv.txt, the standard English text of the King James' Version of the Bible, one line per verse, ASCII:
$ wc kjv.txt
114150 821108 4834378 kjv.txt
Platform is a MacOS Pro laptop, OSX 10.5.8, Intel Core 2 Duo at 2.4 GHz, Python 2.6.5.
The module for the test, readkjv.py:
def byline(fn='kjv.txt'):
with open(fn) as f:
for i, _ in enumerate(f):
pass
return i +1
def byall(fn='kjv.txt'):
with open(fn) as f:
return len(f.readlines())
def bybuf(fn='kjv.txt', BS=100*1024):
with open(fn, 'rb') as f:
tot = 0
while True:
blk = f.read(BS)
if not blk: return tot
tot += blk.count('\n')
if __name__ == '__main__':
print bybuf()
print byline()
print byall()
The prints are just to confirm correctness of course (and do;-).
The measurement, of course after a few dry runs to ensure everybody's benefitting equally from the OS's, disk controller's, and filesystem's read-ahead functionality (if any):
$ py26 -mtimeit -s'import readkjv' 'readkjv.byall()'
10 loops, best of 3: 40.3 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.byline()'
10 loops, best of 3: 39 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.5 msec per loop
The numbers are quite repeatable. As you see, even on such a tiny file (less than 5 MB!), by-line approaches are slower than buffer-based ones -- just too much wasted effort!
To check scalability, I next used a 4-times-larger file, as follows:
$ cat kjv.txt kjv.txt kjv.txt kjv.txt >k4.txt
$ wc k4.txt
456600 3284432 19337512 k4.txt
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.4 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf("k4.txt")'
10 loops, best of 3: 102 msec per loop
and, as predicted, the by-buffer approach scales just about exactly linearly. Extrapolating (always a risky endeavour, of course;-), a bit less than 200 MB per second seems to be the predictable performance -- call it 6 seconds per GB, or maybe 10 minutes for 100 GB.
Of course what this small program does is just line counting, but (once there is enough I/O t amortize constant overheads;-) a program to read a specific line should have similar performance (even though it needs more processing once it's found "the" buffer of interest, it's a roughly constant amount of processing, for a buffer of a given size -- presumably repeated halving of the buffer to identify a small-enough part of it, then a little bit of effort linear in the size of the multiply-halved "buffer remainder").
Elegant? Not really... but, for speed, pretty hard to beat!-)
You can try this sed one-liner: sed '42q;d' to fetch line number 42. It's not in Python or C#, but I assume you have sed on your Mac.
Not a elegant but a faster solution would be to use multiple threads (or tasks in .NET 4.0) to read & process multiple chunks of a file at the same time.
If you expect to have this operation needed often on the same file it would make sense to make an index.
You make an index by going through the whole file once and recording the positions of line beginnings, for example in a sqlite database. Then when you need to go to a specific line you query the index for it, seek to that position and read the line.

Categories