Charting massive amounts of data

Charting massive amounts of data - c#

We are currently using ZedGraph to draw a line chart of some data. The input data comes from a file of arbitrary size, therefore, we do not know what the maximum number of datapoints in advance. However, by opening the file and reading the header, we can find out how many data points are in the file.
The file format is essentially [time (double), value (double)]. However, the entries are not uniform in the time axis. There may not be any points between say t = 0 sec and t = 10 sec, but there might be 100K entires between t = 10 sec and t = 11 sec, and so on.
As an example, our test dataset file is ~2.6 GB and it has 324M points. We'd like to show the entire graph to the user and let her navigate through the chart. However, loading up 324M points to ZedGraph not only is impossible (we're on a 32-bit machine), but also not useful since there is no point of having so many points on the screen.
Using the FilteredPointList feature of ZedGraph also appears to be out of question, since that requires loading the entire data first and then performing filtering on that data.
So, unless we're missing anything, it appears that our only solution is to -somehow- decimate the data, however as we keep working on it, we're running into a lot of problems:
1- How do we decimate data that is not arriving uniformly in time?
2- Since the entire data can't be loaded into memory, any algorithm needs to work on the disk and so needs to be designed carefully.
3- How do we handle zooming in and out, especially, when the data is not uniform on the x-axis.
If data was uniform, upon initial load of the graph, we could Seek() by predefined amount of entries in the file, and choose every N other samples and feed it to ZedGraph. However, since the data is not uniform, we have to be more intelligent in choosing the samples to display, and we can't come up with any intelligent algorithm that would not have to read the entire file.
I apologize since the question does not have razor-sharp specificity, but I hope I could explain the nature and scope of our problem.
We're on Windows 32-bit, .NET 4.0.

I've needed this before, and it's not easy to do. I ended up writing my own graph component because of this requirement. It turned out better in the end because I put in all the features we needed.
Basically, you need to get the range of data (min and max possible/needed index values), subdivide it into segments (let's say 100 segments), and then determine a value for each segment by some algorithm (average value, median value, etc.). Then you plot based on those summarized 100 elements. This is much faster than trying to plot millions of points :-).
So what I am saying is similar to what you are saying. You mention you do not want to plot every X element because there might be a long stretch of time (index values on the x-axis) between elements. What I am saying is that for each subdivision of data determine what is the best value, and take that as the data point. My method is index value-based, so in your example of no data between the 0 sec and 10-sec index values I would still put data points there, they would just have the same values among themselves.
The point is to summarize the data before you plot it. Think through your algorithms to do that carefully, there are lots of ways to do so, choose the one that works for your application.
You might get away with not writing your own graph component and just write the data summarization algorithm.

I would approach this in two steps:
Pre-processing the data
Displaying the data
Step 1
The file should be preprocessed into a binary fixed format file.
Adding an index to the format, it would be int,double,double.
See this article for speed comparisons:
http://www.codeproject.com/KB/files/fastbinaryfileinput.aspx
You can then either break up the file into time intervals, say
one per hour or day, which will give you an easy way to express
accessing different time intervals. You could also just keep
one big file and have an index file which tells you where to find specific times,
1,1/27/2011 8:30:00
13456,1/27/2011 9:30:00
By using one of these methods you will be able to quickly find any block of data
by either time, via a index or file name, or by number of entries, due to the fixed byte
format.
Step 2
Ways to show data
1. Just display each record by index.
2. Normalize data and create aggregate data bars with open, high, low ,close values.
a. By Time
b. By record count
c. By Diff between value
For more possible ways to aggregate non-uniform data sets, you may want to look at
different methods used to aggregate trade data in the financial markets. Of course,
for speed in realtime rendering you would want to create files with this data already
aggregated.

1- How do we decimate data that is not
arriving uniformly in time?
(Note - I'm assuming your loader datafile is in text format.)
On a similar project, I had to read datafiles that were more than 5GB in size. The only way I could parse it out was by reading it into an RDBMS table. We chose MySQL because it makes importing text files into datatables drop-dead simple. (An interesting aside -- I was on a 32-bit Windows machine and couldn't open the text file for viewing, but MySQL read it no problem.) The other perk was MySQL is screaming, screaming fast.
Once the data was in the database, we could easily sort it and quantify large amounts of data into singular paraphrased queries (using built-in SQL summary functions like SUM). MySQL could even read its query results back out to a text file for use as loader data.
Long story short, consuming that much data mandates the use of a tool that can summarize the data. MySQL fits the bill (pun intended...it's free).

A relatively easy alternative I've found to do this is to do the following:
Iterate through the data in small point groupings (say 3 to 5 points at a time - the larger the group, the faster the algorithm will work but the less accurate the aggregation will be).
Compute the min & max of the small group.
Remove all points that are not the min or max from that group (i.e. you only keep 2 points from each group and omit the rest).
Keep looping through the data (repeating this process) from start to end removing points until the aggregated data set has a sufficiently small number of points where it can be charted without choking the PC.
I've used this algorithm in the past to take datasets of ~10 million points down to the order of ~5K points without any obvious visible distortion to the graph.
The idea here is that, while throwing out points, you're preserving the peaks and valleys so the "signal" viewed in the final graph isn't "averaged down" (normally, if averaging, you'll see the peaks and the valleys become less prominent).
The other advantage is that you're always seeing "real" datapoints on the final graph (it's missing a bunch of points, but the points that are there were actually in the original dataset so, if you mouse over something, you can show the actual x & y values because they're real, not averaged).
Lastly, this also helps with the problem of not having consistent x-axis spacing (again, you'll have real points instead of averaging X-Axis positions).
I'm not sure how well this approach would work w/ 100s of millions of datapoints like you have, but it might be worth a try.

Related

Remove All Duplicates In A Large Text File

I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.
The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.
Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.

The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.
By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.
With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.
There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.
If you can't use a ready made program, then . . .
If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.
If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.
In pseudocode, it looks something like this:
fileNumber = 0
while not end-of-input
load as many lines as you can into a list
sort the list
filename = "file"+fileNumber
write sorted list to filename, optionally removing duplicates
fileNumber = fileNumber + 1
You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.
Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.
You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.
Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.
I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).
One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.
Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.

I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done.
Again, not sure that it will work, but i'd try this approach first...

Improving Game Performance C#

We are all aware of the popular trend of MMO games. where players face each other live. However during gameplay there is a tremendous flow of SQL inserts and queries, as given below
There are average/minimum 100 tournaments online per 12 minutes or 500 players / hour
In Game Progress table, We are storing each player move
12 round tournament of 4 player there can be 48 records
plus around same number for spells or special items
a total of 96 per tournament or 48000 record inserts per hour (500 players/hour)
In reponse to my previous question ( Improve MMO game performance ), I changed the schema and we are not writing directly to database.
Instead accumulating all values in a DataTable. The process then whenever the DataTable has more than 100k rows (which can sometimes be even within the hour) writes to a text file in csv format. Another background application which frequently scans the folder for CSV files, reads any available CSV file and stores the information into server database.
Questions
Can we access the datatable present in the game application from another application, directly (it reads the datatable and clears records that have read). So that the in place of writing and reading from disk, we read and write directly from memory.
Is there any method that is quicker that DataTable, that can hold large data and yet be fairly quicker in sorting and updating operation. Because we have to frequenly scan for userids, update game status (almost at every insert). It can be a cache utility OR a fast
Scan/Search algorithm OR even a CollectionModel. Right now, we use a foreach loop to go through all records in a DataTable and update rows if user is present. If not then we create a new row. I tried using SortedList and classes, but then it not only doubles the effort, memory usage increases tremendously slowing down overall game performance.
thanks
arvind

Well, let's answer:
You can share object between applications, using Remoting - but it's much slower, and makes the code less readable. But, you have another solution so you'll keep working with memory. you can use MemoryMappedFiles, so all the work will be actually using the memory and not the disk: http://msdn.microsoft.com/en-us/library/dd997372.aspx
you can use NoSQL DB from some kind (there are many out there: Redis, MongoDB, RavenDB) - all of them based on key-value access, and you should test their performance. Even better, some of this db's are persistent and can be used with multiple servers.
Hope this helps.

Using memcache would increase your performance

Performance out of windows C# chart controls?

Has any one had any luck with getting good performance out of the built in c# chart running in real time?
I have made a chart with 3 charting areas and 9 series which are all the fast line type. I update in real time points to the series and shift the graphs over once 7 seconds of data has been graphed. All this works fine, however the rate that my graphs update is horribly slow. Sometimes it can take almost a second for the data being fed in to be shown in the graph (and many times i wonder if it is accurately updating my graph with my data since it is so slow and the data changes can be so fast).
I have tried using mychart.Series.SuspendUpdates(), Series.ResumeUpdates(), and Series.Invalidate() as i saw on different posting with no noticeable results.
If anyone could share some insight about ways to optimize I would be truly gracious.( and cutting the number data points is not a valid optimization )
Thanks in advance
OCV

If external libraries is an option, ZedGraph worked great for me when displaying data at 10ms intervals (up to 8 series).
If you really must use the built-in C#, I think you might prevent blocking by separating drawing a data into separate threads.

Drawing Many Objects to Screen

I'm working on a project in which we need to summarize a substantial amount of data in the form of a heat map. This data will be kept in a database for as long as possible. At some point, we will need to store a summary in a matrix (possibly?) before we can draw the blocks for the heat map to the screen. We are creating a windows form application with C#.
Let's assume the heat map is going to summarize a log file for an online mapping program such as google maps. It will assign a color to a particular address or region based on the number of times a request was made to that region/address. It can summarize the data at differing levels of detail. That is, each block on the heat map can summarize data for a particular address (max detail, therefore billions/millions of blocks) or it can summarize for requests to a street, city, or country (minimum detail -- few blocks as they each represent a country). Imagine that millions of requests were made for addresses. We have considered summarizing this with a database. The problem is that we need to draw so many blocks to the screen (up to billions, but usually much less). Let's assume this data is summarized in a database table that stores the number of hits to the larger regions. Can we draw the blocks to the window without constructing an object for each region or even bringing in all of the information from the db table? That's my primary concern, because if we did construct a matrix, it could be around 10 GB for a demanding request.
I'm curious to know how many blocks we can draw to the screen and what the best approach to this may be (i.e. direct3d, XNA). From above, you can see the range will vary substantially and we expect the potential for billions of squares that need to be drawn. We will have a vertical scroll bar to scroll down quickly to see other blocks.
Overall, I'm wondering how we might accomplish this with C#? Creating the matrix for the demanding request could require around 10 Gigabytes. Is there a way to draw to the screen that will not require a substantial amount of memory (i.e. creating an object for each block). If we could have the results of a SQL query be translated directly into rendered blocks on the screen, that would be ideal (i.e. not constructing objects, etc etc). All we need are squares and their only property is color and we might need to maintain a number for each block.
Note:
We are pretty sure about how we will draw the heat map (how zooming, scrolling, etc should appear to user). To clarify, I'm more concerned about how we will implement our idea. Is there a library or some method that allows us to draw this many objects without constructing a billion objects and using Gigabytes of data. Each block is essentially a group of pixels (20x20) that are the same color. I don't believe this should necessitate constructing 1 billion objects.
Thanks!

If this is really for a graphic heat map, then I agree with the comments that an image that's at least 780 laptop screens wide is impractical. If you have this information in a SQL(?) database somewhere, then you can do a fancy query that partitions your results into buckets of a certain widths. The database should be able to aggregate these records into 1680 (pixels wide) buckets efficiently.
Furthermore, if your buckets are of a fixed width (yielding a fixed width heat-map image) you could pre-generate the bucket numbers for the "addresses" in your database. Indexed properly, grouping by this would be very fast.
If you DO need to see a 1:1 image, you might consider only rendering a section of the image that you're scrolled to. This would significantly reduce the amount of memory necessary to store the current view. Assuming you don't need to actually view all 780 screens worth of data at 100% (especially if you couple this with the "big picture view" strategy above) then you'll save on processing too.
The aggregate function for the "big picture view" might be MAX, SUM, AVG. If these functions aren't appropriate, please explain more about the particular features you'd be looking for in the heat-map.
As far as the drawing itself, you don't need "objects" for each box, you just need to draw the pixels on a graphics object.

I think technique you are looking for is called "virtualization". Now I don't mean hardware virtualization, but technique, where you create concrete visual object only for items, that are visible. Many grids and lists use this technique to show thousands of hundreds of items at normal speeds and memory consumption. You can also reuse those visual objects while swaping concrete data objects.
I would also question necesity of displaying bilions of details. You should make it similiar to zooming or agregation of data to show only few items and then let the user choose specific part or pice of data. But I guess you have that thought out.

How to record to .wav from microphone after applying fast fourier transform and a high pass filter on it?

I receive input from the microphone and I apply the fft on it.
After that, I put frequencies higher than 1KHz with zero (the high pass filter).
I want to know how can I record the input from microphone after I passed it to fft and after the application of the high pass filter.
I'm working with c#, what do I need to do? :P

After your FFT and filter, you need to do an inverse FFT to get the data back to the time domain. Then you want to add that set of samples to your .WAV file.
As far as producing the file itself goes, the format is widely documented (Googling for ".WAV format" should turn up more results than you have any use for), and pretty simple. It's basically a simple header (called a "chunk") that says it's a .WAV file (or actually a "RIFF" file). Then there's an "fmt " chunk that tells about the format of the samples (bits per sample, samples per second, number of channels, etc.) Then there's a "data" chunk that contains the samples themselves.
Since it sounds like you're going to be doing this in real time, my advice would be to forget about doing your FFT, filter, and iFFT. An FIR filter will give essentially the same results, but generally a lot faster. The basic idea of the FIR filter is that instead of converting your data to frequency domain, filtering it, then converting back to time domain, you convert your filter coefficients to time domain, and apply them (fairly) directly to your input data. This is where DSPs earn their keep: nearly all of them have multiply-accumulate instructions, which can implement most of a FIR filter in one instruction. Even without that, however, getting a FIR filter to run in real time on a modern processor doesn't take any real trick unless you're doing really fast sampling. In any case, it's a lot easier that getting an FFT/filter/iFFT to operate at the same speed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.