Windows Store App Incremental Serialization

Windows Store App Incremental Serialization - c#

So I finally got my listview content to serialize and write to a file so I can restore my apps state across different sessions. Now I'm wondering if there is a way I can incrementally serialize and save my data. Currently, I call this method in the SaveState method of my mainpage:
private async void writeToFile()
{
var f = await Windows.Storage.ApplicationData.Current.LocalFolder.CreateFileAsync("data.txt", CreationCollisionOption.ReplaceExisting);
using (var st = await f.OpenStreamForWriteAsync())
{
var s = new DataContractSerializer(typeof(ObservableCollection<Item>),
new Type[] { typeof(Item) });
s.WriteObject(st, convoStrings);
}
}
What I think would be more ideal is to write out data to storage as it is generated, so I don't have to serialize my entire list in the small suspend time frame, but I don't know if this is possible to incrementally serialize my collection, and if it is, how would I do it?
Note that my data doesn't change after it is generated, so I don't have to worry about anything other than appending new data to the end of my currently serialized list.

It depends on you definition when to save the data to the hard drive. Maybe you want to save the new collection state when a new collection item is added or removed? Or when the content of an item changes.
The main problem about saving everything just in time to the hard drive is, that is may be doggy slow. If you're using an async programming model, it wouldn't be a problem directly since your app won't hang since yeah, everything is async.
I think it may be a better idea to save the collection say every minute AND when the user closes the application. This will only work if you're dealing with a cerain amount of data since you only have about 3 seconds to perform all the IO stuff.
As you can see, there is no perfect solution. It really depends on your requirements and the size of the data. Without further information thats all I can tell you for sure.

Related

How to separate reading a file from adding data to queue

My case is like this:
I am building an application that can read data from some source (files or database) and write that data to another source (files or database).
So, basically I have objects:
InputHandler -> Queue -> OutputHandler
Looking at a situation where input is some files, InputHandler would:
1. Use FilesReader to read data from all the files (FilesReader encapsulates the logic of reading files and it returns a collection of objects)
2. Add the objects to queue.
(and then it repeats infinitely since InputHandler has a while loop that looks for new files all the time).
The problem appears when files are really big - FilesReader, which reads all files and parses them, is not the best idea here. It would be much better if I could somehow read a portion of the file, parse it, and put it in a queue - and repeat it until the end of each file.
It is doable using Streams, however, I don't want my FilesReader to know anything about the queue - it feels to me that it breaks OOP rule of separation of concerns.
Could you suggest me a solution for this issue?
//UPDATE
Here's some code that shows (in simplified way) what InputHandler does:
public class InputHandler {
public Task Start() {
while(true) {
var newData = await _filesReader.GetData();
_queue.Enqueue(newData);
}
}
}
This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Now, if _filesReader was to use streams and return data partially, the memory usage would be kept low.
One solution would be to have _queue object inside of _filesReader - it could just read data from stream and push directly to queue - I don't like it - too much responsibility for _filesReader.
Another solution (as proposed by jhilgeman) - filesReader could raise events with the data in them.
Is there some other solution?

I'm not entirely sure I understand why using an IO stream of some kind would change the way you would add objects to the queue.
However, what I would personally do is set up a static custom event in your FilesReader class, like OnObjectRead. Use a stream to read through files and as you read a record, raise the event and pass that object/record to it.
Then have an event subscriber that takes the record and pushes it into the Queue. It would be up to your app architecture to determine the best place to put that subscriber.
On a side note, you mentioned your InputHandler has a while loop that looks for new files all the time. I'd strongly recommend you don't use a while loop for this if you're only checking the filesystem. This is the purpose of FileSystemWatcher - to give you an efficient way to be immediately notified about changes in the filesystem without you having to loop. Otherwise you're constantly grinding the filesystem and constantly eating up disk I/O.

This code shows how the code looks like right now. So, if I have 1000 files, each having lots and lots of data, _filesReader will try to read all this data and return it - and memory would quickly be exhausted.
Regarding the problem of unlimited memory consumption, a simple solution is to replace the _queue with a BlockingCollection. This class has bounding capabilities out of the box.
public class InputHandler
{
private readonly BlockingCollection<string> _buffer
= new BlockingCollection<string>(boundedCapacity: 10);
public Task Start()
{
while (true)
{
var newData = await _filesReader.GetData();
_buffer.Add(newData); // will block until _buffer
// has less than 10 items.
}
}
}

I think I came up with some idea. My main goal is to have FilesReader that does not rely on any specific way of how data is transfered from it. All it should do is to read data, return it, and not care about any queues, or whatever else I could use. That's a job of InputHandler - it knows about the queue and it's using FilesReader to get some data to put in that queue.
I changed FilesReader interface a bit. Now it has a method like this:
Task ReadData(IFileInfo file, Action<IEnumerable<IDataPoint>> resultHandler, CancellationToken cancellationToken)
Now, InputHandler invokes the method like this:
await _filesReader.ReadData(file, data => _queue.Enqueue(data), cancellationToken);
I think it its a good solution in terms of separation of concerns.
FilesReader can read data in chunks and whenever a new chunk is parsed, it just invokes the delegate - and continues working on the rest of the file.
What do you think about such solution?

Safeguarding against user error when saving a list of information

I have a private List<Experience> experiences; that tracks generic experiences and experience specific information. I am using Json Serialize and Deserialize to save and load my list. When you start the application the List populates itself with the current saved information automatically and when a new experience is added to the list it saves the new list to file.
A concern that is popping into my head that I would like to get ahead of is, there is nothing that would stop the user from at any point doing something like experiences = new List<Experience>(); and then adding new experiences to it. Saving this would result in a loss of all previous data as right now the file is overwritten with each save. In an ideal world, this wouldn't happen, but I would like to figure out how to better structure my code to guard against it. Essentially I want to disallow removing items from the List or setting the list to a new list after the list has already been populated from load.
I have toyed with the idea of just appending the newest addition to the file, but I also want to cover the case where you change properties of an existing item in the List, and given that the list will never be all that large of a file, I thought overwriting would be the simplest approach as the cost isn't a concern.
Any help in figuring out the best approach is greatly appreciated.
Edit* Looked into the repository pattern https://www.infoworld.com/article/3107186/application-development/how-to-implement-the-repository-design-pattern-in-c.html and this seems like a potential approach.

I'm making an assumption that your user in this case is a code-level consumer of your API and that they'll be using the results inside the same memory stack, which is making you concerned about reference mutation.
In this situation, I'd return a copy of the list rather than the list itself on read-operations, and on writes allow only add and remove as maccettura recommends in the comments. You could keep the references to the items in the list intact if you want the consumer to be able to mutate them, but I'd think carefully about whether that's appropriate for your use case and consider instead requiring the consumer to call an update function (which could be the same as your add function a-la HTTP PUT).

Sometimes when you want to highlight that your collection should not be modified, exposing it as an IEnumerable except List may be enough, but in case you are writing some serious API, something like repository pattern seems to, be a good solution.

What is a fast way to get metadata about files in a folder that includes size using C#?

var myFileList = new List<MyCustomFileInfo>();
var dir = new DirectoryInfo(#"C:\SomeDir");
foreach (var file in dir.GetFiles())
{
myFileList.Add(new MyCustomFileInfo()
{
Filename = file.Name,
ModifiedOn = file.LastWriteTime,
SizeInBytes = (int)file.Length
});
}
dir.GetFiles executes very fast, but when I start to access properties, it seems individual calls are made to the file system (which is slow).
How do I rewrite this, so I can get all filenames, lastWriteTime and filesizes in an more effective manner?
Nb.
The code is cut down to just illustrate the point. The real-world use-case I have is more complicated (a sync-thing), but this is the heart of the performance problem.

I wonder if using dir.GetFileSystemInfos() would be faster.
EDIT:
I looked through the relevant code with dotPeek, so maybe not see below!
Either way, if you are deploying on Windows, you should be able to use the FindFirstFile family of Win32 native functions, which is what .Length etc. do under covers (though as you rightly assumed, they do FindFirstFile for the file's full path, and read that, etc.)
EDIT 2:
I looked through the code again, and it does look like GetFileSystemInfos should populate the FileInfos and DirectoryInfos with data from the underlying system calls. (You should be able to verify this by looking at the _dataInitialised private field on the *Info -- if it's zero, then it's initialized, if it's -1, then it's not).

This is an absolutely normal approach. If it is slow, then.. buy SSD or make a RAID (SSD RAID preferably ^^).
If the problem, caused by "slowness", is irresponsible UI, then do it in a very simple manner: populate list of files with only names (this is fast) and then run a thread which will fetch additional data for each item in the list. Perhaps even use a virtual list to only fetch displayed currently data.
Other thing could be to cache recent results, so if you go back to previous directory, results will be available instantly without need to reload everything.

Iterating over all files in a folder is an inherently slow operation, no matter what the underlying storage is.
You can improve performance in some cases by using EnumerateFiles instead of GetFiles and return an IEnumerable instead of an array or a List. This way you can avoid iterating over all the files if you use Take(), Skip(), First() and other functions that can return before enumerating everything. You could also use Enumerable.Select to convert the IEnumerable to your own classes although this would result in system calls for each file processed.
Unfortunately, this won't work in a sync scenario (context really matters here) if you want to compare the current file system state against a previous snapshot. In this case it's far better to use a FileSystemWatcher to wait for actual changes to the folder you want to sync before processing.
Once you detect changes in the folder, you can wait a bit for them to stop before processing the entire folder, or keep a record of all the change events and process only those files that have actually changed. FileSystemWatcher can miss events if a lot of operations are going on (eg. if you copy a repository of some thousand files) but typically you will get a notification that a file has changed in some way.
Things are rather easier if you are sure you are using NTFS. In this case you can enable journaling and get a list of all the files that have changed since your last check. You could also use the Volume Shadow service to read even opened files, use Transactional NTFS to modify files in a safe manner etc. These features require native calls but the AlphaFS project provides a library over them.
Internally, AlphaFS uses the extended methods to search for files like FindFirstFileEx. In Windows 7+, this function can use a large buffer to improve performance.
Another bonus is that journalling or NTFS's Object IDs allow you to detect renames or file moves (which are essentially the same thing) and avoid an unnecessary file syncronization.

In any case you should remove the function call from the loop continue test condition.
var myFileList = new List<MyCustomFileInfo>();
var dir = new DirectoryInfo(#"C:\SomeDir");
var files = dir.GetFiles(); // called 1 time only
foreach (var file in files)
{
myFileList.Add(new MyCustomFileInfo()
{
Filename = file.Name,
ModifiedOn = file.LastWriteTime,
SizeInBytes = (int)file.Length
});
}

new objects added during long loop

We currently have a production application that runs as a windows service. Many times this application will end up in a loop that can take several hours to complete. We are using Entity Framework for .net 4.0 for our data access.
I'm looking for confirmation that if we load new data into the system, after this loop is initialized, it will not result in items being added to the loop itself. When the loop is initialized we are looking for data "as of" that moment. Although I'm relatively certain that this will work exactly like using ADO and doing a loop on the data (the loop only cycles through data that was present at the time of initialization), I am looking for confirmation for co-workers.
Thanks in advance for your help.
//update : here's some sample code in c# - question is the same, will the enumeration change if new items are added to the table that EF is querying?
IEnumerable<myobject> myobjects = (from o in db.theobjects where o.id==myID select o);
foreach (myobject obj in myobjects)
{
//perform action on obj here
}

It depends on your precise implementation.
Once a query has been executed against the database then the results of the query will not change (assuming you aren't using lazy loading). To ensure this you can dispose of the context after retrieving query results--this effectively "cuts the cord" between the retrieved data and that database.
Lazy loading can result in a mix of "initial" and "new" data; however once the data has been retrieved it will become a fixed snapshot and not susceptible to updates.
You mention this is a long running process; which implies that there may be a very large amount of data involved. If you aren't able to fully retrieve all data to be processed (due to memory limitations, or other bottlenecks) then you likely can't ensure that you are working against the original data. The results are not fixed until a query is executed, and any updates prior to query execution will appear in results.

I think your best bet is to change the logic of your application such that when the "loop" logic is determining whether it should do another interation or exit you take the opportunity to load the newly added items to the list. see pseudo code below:
var repo = new Repository();
while (repo.HasMoreItemsToProcess())
{
var entity = repo.GetNextItem();
}
Let me know if this makes sense.

The easiest way to assure that this happens - if the data itself isn't too big - is to convert the data you retrieve from the database to a List<>, e.g., something like this (pulled at random from my current project):
var sessionIds = room.Sessions.Select(s => s.SessionId).ToList();
And then iterate through the list, not through the IEnumerable<> that would otherwise be returned. Converting it to a list triggers the enumeration, and then throws all the results into memory.
If there's too much data to fit into memory, and you need to stick with an IEnumerable<>, then the answer to your question depends on various database and connection settings.

I'd take a snapshot of ID's to be processed -- quickly and as a transaction -- then work that list in the fashion you're doing today.
In addition to accomplishing the goal of not changing the sample mid-stream, this also gives you the ability to extend your solution to track status on each item as it's processed. For a long-running process, this can be very helpful for progress reporting restart / retry capabilities, etc.

Persist List<int> through App Shutdowns

Short Version
I have a list of ints that I need to figure out how to persist through Application Shutdown. Not forever but, you get the idea, I can't have the list disappear before it is dealt with. The method for dealing with it will remove entry's in the list.
What are my options? XML?
Background
We have a WinForm app that uses Local SQL Express DB's that participate in Merge Replication with a Central Server. This will be difficult to explain but we also have(kind of) an I-Series 400 Server that a small portion of data gets written to as well. For various reasons the I-Series is not available through replication and as such all "writes" to it need to be done while it is available.
My first thought to solve this was to simply have a List object that stored the PK that needed to be updated. Then, after a successful sync, I would have a method that checks that list and calls the UpdateISeries() once for each PK in the list. I am pretty sure this would work, except in a case where they shut down innappropriately or lost power, etc. So, does anyone have better ideas on how to solve this? XML file maybe, though I have never done that. I worry about actually creating a Table in SQL Express because of Replication....maybe unfounded but...
For reference, UpdateISeries(int PersonID) is an existing Method in a DLL that is used internally. Re-writting it, as a potential solution to this issue, really isn't viable at the time.

Sounds like you need to serialize and deserialize some objects.
See these .NET topics to find out more.
From the linked page:
Serialization is the process of converting the state of an object into a form that can be persisted or transported. The complement of serialization is deserialization, which converts a stream into an object. Together, these processes allow data to be easily stored and transferred.
If it is not important for the on-disk format to be human readable, and you want it to be as small as possible, look at binary serialization.

Using the serialization mechanism is probably the way to go. Here is an example using the BinaryFormatter.
public void Serialize(List<int> list, string filePath)
{
using (Stream stream = File.OpenWrite(filePath))
{
var formatter = new BinaryFormatter();
formatter.Serialize(stream, list);
}
}
public List<int> Deserialize(string filePath)
{
using (Stream stream = File.OpenRead(filePath)
{
var formatter = new BinaryFormatter();
return (List<int>)formatter.Deserialize(stream);
}
}

If you already have and interact with a SQL database, use that, to get simpler code with fewer dependencies. Replication can be configured to ignore additional tables (even if you have to place them in another schema). This way, you can avoid a number of potential data corruption problems.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.