I have relatively fast streams of data coming into my program ( 4 streams of approx. 25Hz) I have to store every input and persist it, and upload it later on. The objects themselves are relatively simple, only made strings and doubles.
I'm thinking I'll use a database for storage. I've thought of files before, but I think a db is better. Either way, if you have a suggestion for this, feel free to share it, but its out of the scope of this question.
Now my problem lies here : how to achieve this task properly ?
The fast flow of objects will be stored into a collection (I don't know which yet), for each stream separately, and every once in a while (probably every 500 objects per stream or so), I'll save them in the database.
I'm afraid this will be prone to race conditions, since I'll be writing in the collection while removing objects from it.
Also, I don't think I need the collection to be ordered because the data is time sensitive ; there is a timestamp on each object of each of the data streams. So it does not really matter if I happen to persist data in the "wrong" order, as long as it's saved in the database, removed from the collection, and that the flow is not interrupted.
Basically, this could be a classic FIFO behaviour, but if it's easier not to then I should be fine anyway. Either way, I'm not sure how to achieve it in terms of logic. I've had my fair share of head scratching and I'd rather go prepared.
I don't specifically need copy-paste code, I'm looking for an actual answer with, if possible, an explanation.
What kind of collection do you recommend ?
Do I need some kind of asynchronous collection logic?
Do I need some kind of thread/lock logic ?
Is there an collection that can be modified on both ends at the same time?
I have no particular guideline in mind, I'm really open to suggestions.
EDIT : Also, it's worth mentioning i'm using C# if someone wants to link something from a documentation.
Thank you all very much for your time, as always, it is greatly appreciated :)
You are looking for a queue (FIFO). In particular ConcurrentQueue - it will handle locking for you.
Alternatively with such a low volume of data basic list with lock around reading and writing may be enough.
Related
In my application, I want to return a collection of objects from a WCF Service (hosted as a Windows Service) to populate a DataGrid in a WPF application. The amount of objects in the collection ranges from one to several hundred depending on the method called.
I'm curious as what is the 'best' way to deal with returning large collections from a service.
These are the options I've seen suggested are:
Increasing the max message size and returning all the objects in one go. This seems like a bad idea because there could possibly come a time when I need to return more than 2GB of data.
Paginating the records and calling the method repeatedly until all objects have been retrieved. I've seen this one suggested for ASP.NET projects but I don't know how well it would work for desktop apps.
Using a stream. To be honest, I don't understand how this works since it appears to be meant for transferring large single objects rather than many smaller ones.
Doing something with the yield keyword, but this went over my head and I couldn't follow it. :-/
What would be the best way approach this task, and why?
Increasing the max message size and returning all the objects in one
go. This seems like a bad idea because there could possibly come a
time when I need to return more than 2GB of data.
Definitely not the good choice, unless your're sure that your data will never exceed the new limit you set. Otherwise you will just push the problem back and have it again in a few months. 2Gb is already a lot by the way (think how long your user will wait)
Paginating the records and calling the method repeatedly until all
objects have been retrieved. I've seen this one suggested for ASP.NET
projects but I don't know how well it would work for desktop apps.
The most common and obvious approach, you can use pagination and only query for a defined number of elements on each page. I can't understand your question about "desktop apps" though? The relevant concept here is client/server.
Your client (desktop app) need to query your server for the content of a page (if you use pagination) to display. If your client was a web page, the concept would still be relevant.
Using a stream. To be honest, I don't understand how this works since
it appears to be meant for transferring large single objects rather
than many smaller ones.
I guess you read things like "manage you own stream". In a few words you could consider any stream as a bit flow, and just interpret it as you wish in your client side. I would certainly not recommend that, unless you have really specific transfer issue (and having a high number of object to transfer is certainly not specific enough). Having a few very big object to transfer may be specific enough, but even here I would challenge the implementation before going this way.
Doing something with the yield keyword, but this went over my head and
I couldn't follow it. :-/
Sorry I don't follow you here, but yield is only syntaxic sugar so I don't think it's relevant to solve your problem. Still have a look to understad the concept:
What is the yield keyword used for in C#?
I am trying to mimic some desktop software. The users are accustomed to never saving. The software saves as they change values.
I'm using blur and change events in jquery to trigger updates.
Clearly, this is going to use a lot of unnecessary bandwidth, but it does meet the requirements.
I have no problem doing this, but I want to ask if there is a clear, definitive reason not to do this?
Is there a clearly preferable alternative? Saving every few seconds for instance.
edit - I should note that the updates are segregated, so all of the data is not sent and received in each update. It may be up to 4 or 5 tables and 200 or so fields at once, but more typically, its a couple tables and 10 or so fields.
You're exact requirement seems a little vague but, as far as I understand, you seem to be doing the correct thing.
You can refine things a bit if you wish:
Serialization. It's not the same to send data serialized as an XML compared to JSON. For better bandwidth save, JSON serialization is recommended.
Encoding. To correctly analyze bandwidth usage you could consider thinking about what kind of stuff you're sending to the backend. Does it make sense to send it plain?, could you take advantage of some compressing algorithm? Does doing this extra calculation makes a noticeable performance impact on your solution?
Scheduling. This really depends on your requirements, but does it really makes sense to sync on every change?. Can you take the risk of syncing in intervals and possible lose some changes?. This decision could make a huge impact on total bandwidth usage of your application.
Local storage. This really depends on how you should meet your requirements, but maybe you could take advantage of local storage in HTML5 depending on your decision regarding to 3. Just an idea.
I'm working on a C# library project that will process transactions between SQL and QuickBooks Enterprise, keeping both data stores in sync. This is great and all, but the initial sync is going to be a fairly large set of transactions. Once the initial sync is complete, transactions will sync as needed for the remainder of the life of the product.
At this point, I'm fairly familiar with the SDK using QBFC, as well as all of the various resources and sample code available via the OSR, the ZOMBIE project by Paul Keister (thanks, Paul!) and others. All of these resources have been a huge help. But one thing I haven't come across yet is whether there is a limit or substantial or deadly performance cost associated with large amounts of data via a single Message Set Request. As I understand it, the database on QuickBooks' end is just a SQL database as well, but I don't want to make any assumptions.
Again, I just need to hit this hard once, so I don't want to engineer a separate solution to do the import. This also affords me an opportunity to test a copy of live data against my library, logs and all.
For what it's worth, this is my first ever post on Stack, so feel free to educate me on posting here if I've steered off course in any way. Thanks.
For what it's worth, I found that in a network environment (as opposed to everything happening on 1 box) it's better to have a larger MsgSetRequest as opposed to a smaller one. Of course everything has its limits, and maybe I just never hit it. I don't remember exactly how big the request set was, but it was big. The performance improvement was easily 10 to 1 or better.
If I was you, I'd build some kind of iteration into my design from the beginning (to iterate through your SQL data set). Start with a big number that will do it all at once, and if that breaks just scale it back until you find something that works.
I know this answer doesn't have the detail you're looking for, but hopefully it will help.
I have a case here that I would like to have some opinions from the experts :)
Situation:
I have a data structure with ´Int32´ and ´Double´ values, with a total of 108 bytes.
I have to process a large series of this data structure. Its something like (conceptual, I will use a for loop instead):
double result = 0;
foreach(Item item in series)
{
double += //some calculation based on item
}
I expect the size of the series to be about 10 Mb.
To be useful, the whole series must be processed. It's all or nothing.
The series data will never change.
My requirements:
Memory consumption is not an issue. I think that nowadays, if the user doesn't have a few dozen Mb free on his machine, he probably has a deeper problem.
Speed is a concern. I want the iteration to be as fast as possible.
No unmanaged code, or interop, or even unsafe.
What I would like to know
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Any real problem if I implement the accessors as anonymous properties? I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
Hope someone can give me a good opinion on this.
Thanks
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Code it both ways and profile it aggressively on real-world input. Then you'll know exactly which one is faster.
Any real problem if I implement the accessors as anonymous properties?
Real problem? No.
I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
You can only safely assume things guaranteed by the spec. It's not guaranteed by the spec.
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
I think you're probably worrying about this too much.
I'm sorry if my answer seems dismissive. You're asking random people on the Internet to speculate which of two things is faster. We can guess, and we might be right, but you could just code it both ways in the blink of an eye and know exactly which is faster. So, just do it?
However, I always code for correctness, readability and maintainability at first. I establish reasonable performance requirements up front, and I see if my implementation meets them. If it does, I move on. If I need more performance from my application, I profile it to find the bottlenecks and then I start worrying.
You're asking about a trivial computation that takes ~10,000,000 / 108 ~= 100,000 iterations. Is this even a bottleneck in your application? Seriously, you are overthinking this. Just code it and move on.
That's 100,000 loops which in CPU time is sod all. Stop over thinking it and just write the code. You're making a mountain out of a molehill.
Speed is subjective. How do you load your data and how much data is inside your process elsewhere? Loading the data will be the slowest part of your app if you do not need complex parsing logic to create your struct.
I do think you ask this question because you have a struct of 108 bytes of size which you do perform calculations on and you wonder why your app is slow. Please note that structs are passed by value which means if you pass the struct to one or more method during your calcuations or you fetch it from a List you will create a copy of the struct every time. This is indeed very costly.
Change your struct to a class and expose only getters to be sure to have a read only object only. That should fix your perf issues.
A good practice is to separate data from code, so regarding your "big array embedded in the code question", I say don't do that.
Use LINQ for calculations on entire series; the speed is good.
Use a Node class for each point if you want more functionality.
I used to work with such large series of data. They were points that you plot on a graph. Originally they were taken every ms or less. The datasets were huge. Users wanted to apply different formulas to these series and have that displayed. It looks to me that your problem might be similar.
To improve speed we stored different zoom levels of the points in a db. Say every ms, then aggregate for every minute, every hr, every day, etc (whatever users needed). When users zoomed in or out we would load the new values from db instead of performing the calculations right then. We would also cache the values so users don't have to go to the db all the time.
Also if the users wanted to apply some formulas to the series (like in your case), the data is less in size.
I'm an experienced programmer in a legacy (yet object oriented) development tool and making the switch to C#/.Net. I'm writing a small single user app using SQL server CE 3.5. I've read the conceptual DataSet and related doc and my code works.
Now I want to make sure that I'm doing it "right", get some feedback from experienced .Net/SQL Server coders, the kind you don't get from reading the doc.
I've noticed that I have code like this in a few places:
var myTableDataTable = new MyDataSet.MyTableDataTable();
myTableTableAdapter.Fill(MyTableDataTable);
... // other code
In a single user app, would you typically just do this once when the app starts, instantiate a DataTable object for each table and then store a ref to it so you ever just use that single object which is already filled with data? This way you would ever only read the data from the db once instead of potentially multiple times. Or is the overhead of this so small that it just doesn't matter (plus could be counterproductive with large tables)?
For CE, it's probably a non issue. If you were pushing this app to thousands of users and they were all hitting a centralized DB, you might want to spend some time on optimization. In a single-user instance DB like CE, unless you've got data that says you need to optimize, I wouldn't spend any time worrying about it. Premature optimization, etc.
The way to decide varys between 2 main few things
1. Is the data going to be accesses constantly
2. Is there a lot of data
If you are constanty using the data in the tables, then load them on first use.
If you only occasionally use the data, fill the table when you need it and then discard it.
For example, if you have 10 gui screens and only use myTableDataTable on 1 of them, read it in only on that screen.
The choice really doesn't depend on C# itself. It comes down to a balance between:
How often do you use the data in your code?
Does the data ever change (and do you care if it does)?
What's the relative (time) cost of getting the data again, compared to everything else your code does?
How much value do you put on performance, versus developer effort/time (for this particular application)?
As a general rule: for production applications, where the data doesn't change often, I would probably create the DataTable once and then hold onto the reference as you mention. I would also consider putting the data in a typed collection/list/dictionary, instead of the generic DataTable class, if nothing else because it's easier to let the compiler catch my typing mistakes.
For a simple utility you run for yourself that "starts, does its thing and ends", it's probably not worth the effort.
You are asking about Windows CE. In that particular care, I would most likely do the query only once and hold onto the results. Mobile OSs have extra constraints in batteries and space that desktop software doesn't have. Basically, a mobile OS makes bullet #4 much more important.
Everytime you add another retrieval call from SQL, you make calls to external libraries more often, which means you are probably running longer, allocating and releasing more memory more often (which adds fragmentation), and possibly causing the database to be re-read from Flash memory. it's most likely a lot better to hold onto the data once you have it, assuming that you can (see bullet #2).
It's easier to figure out the answer to this question when you think about datasets as being a "session" of data. You fill the datasets; you work with them; and then you put the data back or discard it when you're done. So you need to ask questions like this:
How current does the data need to be? Do you always need to have the very very latest, or will the database not change that frequently?
What are you using the data for? If you're just using it for reports, then you can easily fill a dataset, run your report, then throw the dataset away, and next time just make a new one. That'll give you more current data anyway.
Just how much data are we talking about? You've said you're working with a relatively small dataset, so there's not a major memory impact if you load it all in memory and hold it there forever.
Since you say it's a single-user app without a lot of data, I think you're safe loading everything in at the beginning, using it in your datasets, and then updating on close.
The main thing you need to be concerned with in this scenario is: What if the app exits abnormally, due to a crash, power outage, etc.? Will the user lose all his work? But as it happens, datasets are extremely easy to serialize, so you can fairly easily implement a "save every so often" procedure to serialize the dataset contents to disk so the user won't lose a lot of work.