Determining the start and end range of bytes changed in a file

Determining the start and end range of bytes changed in a file - c#

I'm working on a little experimental utility to use within our company that indexes notes stored in our custom CRM software for full-text searching. These notes are stored in a Btrieve database (a file called NOTES.DAT). It's possible to connect to the database and retrieve the notes for indexing by using Pervasive's ADO.NET provider. However, the indexer currently loops through each note and re-indexes it every 5 minutes. This seems grossly inefficient.
Unfortunately, there's no way for our CRM software to signal to the indexing service that a note has been changed, because it's possible for the database to exist on a remote machine (and the developers aren't going to write a procedure to communicate with my service over a network, since it's just a hobby project for now).
Rather than give up, I'd like to take this opportunity to learn a little more about raw Btrieve databases. So, here's my plan...
The NOTES.DAT file has to be shared, since our CRM software uses the Btrieve API rather than the ODBC driver (which means client installations have to be able to see the file itself on the network). I would like to monitor this file (using something like FileSystemWatcher?) and then determine the bytes that were changed. Using that information, I'll try to calculate the record at that position and get its primary key. Then the indexer will update only that record using Pervasive's ADO.NET provider.
The problem (besides the fact that I don't quite know the structure of Btrieve files yet or if determining the primary key from the raw data is possible) is that I don't know how to determine the start and end range of bytes that were changed in NOTES.DAT.
I could diff two versions, but that would mean storing a copy of NOTES.DAT somewhere (and it can be quite large, hence the reason for a full-text indexing service).
What's the most efficient way to do this?
Thanks!
EDIT: It's possible for more than one note to be added, edited, or deleted in one transaction, so if possible, the method needs to be able to determine multiple separate byte ranges.

If your NOTES.DAT file is stored on an NTFS partition, then you should be able to perform one of the following:
use the USN journal to identify changes to your file (preferred)
use the volume shadow copy service to track changes to your file by taking periodic snapshots through VSS (very fast), and then either:
diffing versions N and N-1 (probably not as slow as reindexing, but still slow), or
delving deeper and attempting to do diff the $Mft to determine which blocks changed at which offsets for the file(s) of interest (much more complex, but also much faster - yet still not as fast, reliable and simple as using the USN journal)
Using the USN journal should be your preferred method. You can use the FSUTIL utility to create and truncate the USN journal.

Related

system architecture for real-time data

The company I work for is running a C# project that crawling data from around 100 websites, saving it to the DB and running some procedures and calculations on that data.
Each one of those 100 websites is having around 10,000 events, and each event is saved to the DB.
After that, the data that was saved is being generated and aggregated to 1 big xml file, so each one of those 10,000 events that were saved, is now presented as a XML file in the DB.
This design looks like that:
1) crawling 100 websites to collects the data and save it the DB.
2) collect the data that was saved to the DB and generate XML files for each event
3) XML files are saved to the DB
The main issue for this post, is the selection of the saved XML files.
Each XML is about 1MB, and considering the fact that there are around 10,000 events, I am not sure SQL Server 2008 R2 is the right option.
I tried to use Redis, and the save is working very well (and fast!), but the query to get those XMLs works very slow (even locally, so network traffic wont be an issue).
I was wondering what are your thoughts? please take into consideration that it is a real-time system, so caching is not an option here.
Any idea will be welcomed.
Thanks.

Instead of using DB you could try a cloud-base system (Azure blobs or Amazon S3), it seems to be a perfect solution. See this post: azure blob storage effectiveness, same situation, except you have XML files instead of images. You can use a DB for storing the metadata, i.e. source and event type of the XML, the path in the cloud, but not the data itself.
You may also zip the files. I don't know the exact method, but it can surely be handled on client-side. Static data is often sent in zipped format to the client by default.

Your question is missing some details such as how long does your data need to remain in the database and such…
I’d avoid storing XML in database if you already have the raw data. Why not have an application that will query the database and generate XML reports on demand? This will save you a lot of space.
10GBs of data per day is something SQL Server 2008 R2 can handle with the right hardware and good structure optimization. You’ll need to investigate if standard edition will be enough or you’ll have to use enterprise or data center licenses.
In any case answer is yes – SQL Server is capable of handling this amount of data but I’d check other solutions as well to see if it’s possible to reduce the costs in any way.

Your basic arch doesn't seem to be at fault, its the way you've perceived the redis, basically if you design your key=>value right there is no way that the retrieval from redis could be slow.
for ex- lets say I have to store 1 mil objects in redis, and say there is an id against which I am storing my objects, this key is nothing but a guid, the save will be really quick, but when it comes to retrieval, do I know the "key" if i KNOW the key it'll be fast, but if I don't know it or I am trying to retrieve my data not on the basis of key but on the basis of some Value in my objects, then off course it'll be slow.
The point is - when it comes to retrieval you should just work against the "Key" and nothing else, so design your key like a pre-calculated value in itself; so when I need to get some data from redis/memcahce, I could make the KEY, and just do a single hit to get the data.
If you could put more details, we'll be able to help you better.

ADO and Microsoft SQL database backup and archival

I am working on re-engineering/upgrade of a tool. The database communication is in C++(unmanaged ADO) and connects to SQL server 2005.
I had a few queries regarding archiving and backup/restore techniques.
Generally archiving is different than backup/restore . can someone provide any link which explains me that .Presently the solution uses bcp tool for archival.I see lot of dependency on table names in the code. what are the things i have to consider in choosing the design(considering i have to take up the backup/archival on a button click, database size of 100mb at max)
Will moving the entire communication to .net will be of any help? considering lot of ORM tools. also all the bussiness logic and UI is in C#
What s the best method to verify the archival data ?
PS: the questionmight be too high level, but i did not get any proper link to understand this. It will be really helpful if someone can answer. I can provide more details!
Thanks in advance!

At 100 MB, I would say you should probably not spend too much time on archiving, and just use traditional backup strategies. The size of your database is so small that archiving would be quite an elaborate operation with very little gain, as the archiving process would typically only be relevant in the case of huge databases.
Generally speaking, a backup in database terms is a way to provide recoverability in case of a disaster (accidental data deletion, server crash, etc). Archiving mostly means you partition your data.
A possible goal with archiving is to keep specific data available for querying, but without the ability to alter it. When dealing with high volume databases, this is an excellent way to increase performance, as read-only data can be indexed much more densely than "hot" data. It also allows you to move the read-only data to an isolated RAID partition that is optimized for READ operations, and will not have to bother with the typical RDBMS IO. Also, by removing the non-active data from the regular database means the size of the data contained in your tables will decrease, which should boost performance of the overall system.
Archiving is typically done for legal reasons. The data in question might not be important for the business anymore, but the IRS or banking rules require it to be available for a certain amount of time.
Using SQL Server, you can archive your data using partitioning strategies. This normally involves figuring out the criteria based on which you will split the data. An example of this could be a date (i.e. data older than 3 years will be moved to the archive-part of the database). In case of huge systems, it might also make sense to split data based on geographical criteria (I.e. Americas on one server, Europe on another).
To answer your questions:
1) See the explanation written above
2) It really depends on what the goal of upgrading is. Moving it to .NET will get the code to be managed, but how important is that for the business?
3) If you do decide to partition, verifying it works could include issuing a query on the original database for data that contains both values before and after the threshold you will be using for partitioning, then splitting the data, and re-issuing the query afterwards to verify it still returns the same record-set. If you configure the system to use an automatic sliding window, you could also keep an eye on the system to ensure that data will automatically be moved to the archive partition.
Again, if the 100MB is not a typo, I would think your database is too small to really benefit from archiving. If your goal is to speed things up, put the system on a server that is able to load the whole database into RAM, or use SSD drives.
If you need to establish a data archive for legal or administrative reasons, give horizontal table partitioning a look. It's a pretty straight-forward process that is mostly handled by SQL Server automatically.
Hope this helps you out!

Best approach to incremently update application data

I have been working on an application for a couple of years that I updated using a back-end database. The whole key is that everything is cached on the client, so that it never requires an network connection to operate, but when it does have a connection it will always pickup the latest updates. Every application updated is shipped with the latest version of the database and I wanted it to download only the minimum amount of data when the database has been updated.
I currently use a table with a timestamp to check for updates. It looks something like this.
ID - Name - Description- Severity - LastUpdated
0 - test.exe - KnownVirus - Critical - 2009-09-11 13:38
1 - test2.exe - Firewall - None - 2009-09-12 14:38
This approach was fine for what I previously needed, but I am looking to expand more function of the application to use this type of dynamic approach. All the data is currently stored as XML, but I do not want to store complete XML files in the database and only transmit changed data.
So how would you go about allowing a fairly simple approach to storing dynamic content (text/xml/json/xaml) in a database, and have the client only download new updates? I was thinking of having logic that can handle XML inserted directly
ID - Data - Revision
15 - XXX - 15
XXX would be something like <Content><File>Test.dll<File/><Description>New DLL to load.</Description></Content> and would be inserted into the cache, but this would obviously be complicated as I would need to load them in sequence.
Another approach that has been mentioned was to base it on something similar to Source Control, storing the version in the root of the file and calculating the delta to figure out the minimal amount of data that need to be sent to the client.
Anyone got any suggestions on how to approach this with no risk for data corruption? I would also to expand with features that allows me to revert possibly bad revisions, and replace them with new working ones.

It really depends on the tools you are using and the architecture you already have. Is there already a server with some logic and a data access layer?
Dynamic approaches might get complicated, slow and limit the number of solutions. Why do you need a dynamic structure? Would it be feasible to just add data by using a name-value pair approach in a relational database? Static and uniform data structures are much easier to handle.
Before going into detail, you should consider the different scenarios.
Items can be added
Items can be changed
Items can be removed (I assume)
Adding is not a big problem. The client needs to remember the last revision number it got from the server and you write a query which get everything since there.
Changing is basically the same. You should care about identification of the items. You need an unchangeable surrogate key, as it seems to be the ID you already have. (Guids may be useful here.)
Removing is tricky. You need to either flag items as deleted instead of actually removing them, or have a list of removed IDs with the revision number when they had been removed.
Storing the data in the client: Consider using a relational database like SQLite in the client. (It doesn't need installation, it is just storing in a file. Firefox for instance stores quite a lot in SQLite databases.) When using the same in the server, you can probably reuse some code. It is also transaction based, which helps to keep it consistent (rollback in case of error during synchronization).
XML - if you really need it - can be stored just as a string in the database.
When using an abstraction layer or ORM that supports SQLite (eg. NHibernate), you may also reuse some code even when there is another database used by the server. Note that the learning curve for such an ORM might be rather steep. If you don't know anything like this, it could be too much.
You don't need to force reuse of code in the client and server.
Synchronization itself shouldn't be very complicated. You have a revision number in the client and a last revision in the server. You get all new / changed and deleted items since then in the client and apply it to the local store. Update the local revision number. Commit. Done.
I would never update only a part of a revision, because then you can't really know what changed since the last synchronization. Because you do differential updates, it is essential to have a well defined state of the client.

I would go with a solution using Sync Framework.
Quote from Microsoft:
Microsoft Sync Framework is a comprehensive synchronization platform enabling collaboration and offline for applications, services and devices. Developers can build synchronization ecosystems that integrate any application, any data from any store using any protocol over any network. Sync Framework features technologies and tools that enable roaming, sharing, and taking data offline.
A key aspect of Sync Framework is the ability to create custom providers. Providers enable any data sources to participate in the Sync Framework synchronization process, allowing peer-to-peer synchronization to occur.

I have just built an application pretty much exactly as you described. I built it on top of the Microsoft Sync Framework that DjSol mentioned.
I use a C# front end application with a SqlCe database, and a SQL 2005 Server at the other end.
The following articles were extremely useful for me:
Tutorial: Synchronizing SQL Server and SQL Server Compact
Walkthrough: Creating a Sync service
Step by step N-tier configuration of Sync services for ADO.NET 2.0
How to Sync schema changed database using sync framework?

You don't say what your back-end database is, but if it's SQL Server you can use SqlCE (SQL Server Compact Edition) as the client DB and then use RDA merge replication to update the client DB as desired. This will handle all your requirements for sure; there is no need to reinvent the wheel for such a common requirement.

File storage library

I want to develop an open source library, for a fast efficient file storage (under one large file, and index file) like NFileStorage. why i want to do this ?
A. under my line of work something like that waS needed.
B. our DBA said its not efficient to store files under the DB.
C. Its a good practice for me.
I am looking for a good article for file indexes
can you recommend one ?
what is your general idea ?

It may not be efficient to store files inside a database, however databases like SQL Server have the concept of FileStreams where it actually stores it on the local file system instead of placing it in the database file itself.
In my opinion this is a bad idea for a project.
You are going to run into exactly the same problem that databases have with storing all of the uploaded files inside the same single file... which is why some of them have moved away from this for binary / large objects and instead support alternative methods.
Some of the problems you will have to deal with include:
Allocating additional disk space for your backing file to store newly uploaded documents.
Permanently removing "files" from your storage and resizing / compressing the backing file.
Multi-user access / locks.
Failure recovery. Such as when you encounter a bad block on the drive and it hoses your backing file.
Transactional support.
Items 1 and 2 cause an increase in the amount of time it takes to write a "file" to your data store. Items 3, 4 and 5 are already supported by network file systems so you're just recreating the wheel.
In short you're going to have to either write your own file system or write your own DBMS. Neither of which I would consider "good practice" for 99% of real world applications. It might be worthwhile if your goal is to work for Seagate.. But even then they'd probably look at you funny.
If you are truly interested in the most efficient method of file storage, it is quite simply to purchase a SAN array and push your files to it while keeping a pointer to the file/location in your database. Easy to back up, fast to store files, much cheaper than spending developer time trying to figure out how to write your own file system and certainly 100% supported and understandable by future devs.

This kind of product already exist. You should read about Mongo Db (http://www.mongodb.org/display/DOCS/Home)

Sybase: create a database from a backup file

Using SQL statements (sybase) how can I create or restore a full database given the conpressed backup file (myDatabase.cmp.bak).
The reason i'm saying create or restore is that the DB already exists but I can drop it before creating it if that's easier.
Do I need to worry about Device files? For example if the backup file used 3 database files each a certain size do I need to create the empty device files first or will that be taken care of during the restore?
I'm doing this from a C# app
Cheers
Damien

First, a word of warning. You are attempting DBA tasks without reasonable understanding, and without a decent connection to the server.
1 Forget your C# app, and use isql or DBSQL or SQL Advantage (comes with the CD). You will find it much easier.
2 Second, read up on the commands you expect to use. You need a handle on the task, not just command syntax.
--
No, you do not have to change the Device files at all.
Yes, you do need to know the copression and the database allocations of the source (dumped) database. Usually when we transfer dump_files, we know to send the database create statement and the compression ratio with it. The no of stripes are also required but that is easy to ascertain.
If your target db is reasonably similar to the source db, as in, it was once synchronised, you do not need to DROP/CREATE DB , just add the new allocations. But if it isn't, you will need to. The target db must be created with the exact same CREATE/ALTER DB sequence as the source db. Otherwise, you will end up with mixed data/log segments, which prevents log dumps, and deems the db unrecoverable (from the log). That sequence can be gleaned from the dump_file, but the compression has to be known. Hence much easier for the target DBA, if the source DBA shipped the CREATE/ALTER DB and dump commands.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.