We need to store ~50k files every year. Each file is 0.1-5mb which translates to 5gb - 250gb range. Files are: jpg, avi, pdf, docx,etc.
We used to just store file BLOBs in sql server but I guess it's not the best idea in this scenario because database will be huge in 2 years.
What would be the best way to store that data?
I see a lot of different options there and cannot figure out where to start:
Azure storage, Azure SQL, etc. There's also some hybrid versions in new versions of SQL server.
I use the following approach for multiple systems.
Azure Storage for files. You can create multiple containers and access levels if storing proprietary information.
Azure CDN for serving static content from Azure Storage
Azure DB as a database engine
In Azure DB I store the path the the file, with some additional processing in my applications for how to access the file to build up the final URL to serve the file from. This is due to CDN lacking SSL support on custom domains.
If you need examples or more info, just let me know. I'm sitting at an airport so providing a slightly less detailed answer.
As #Martin mentioned in his answer, Azure Storage is viable, specifically because:
Access is independent of any VM, service, database, etc.
Storage is durable - triple-replicated within a region, and optionally geo-replicated
Individual blobs may be up to 200GB, with storage scaling to 500TB
Azure also provides File Service, which is essentially an SMB share sitting on top of blobs.
While there are database services in Azure (SQL Database Service and DocumentDB), you'll find that these are not really optimized for large binary data storage; they're more optimized for metadata. You can certainly store binary data in each of them, but you'll need to worry about storage limits.
You may also spin up your own database solution via Virtual Machines, along with attached disks (again, backed by durable blob storage). Virtual Machines support up to 32 1TB disks attached to a given VM, whether normal blobs or "premium" SSD-based blobs (each premium disk supporting up to 5000 IOPS). Which database solution you choose is completely up to you; there is no "best" solution.
Related
Is it possible to read a file located on local machine path C:\data with an azure function trigger by http request ?
You can expose a local file system to Azure using the on-premises data gateway.
However this supported from Logic Apps, but not as far as I know from Functions. You could however still use the Logic App as a bridge to your Function using the Azure Function Connector.
You are of course free to use your own personal computer however you like, but be aware that the on-premises data gateway exposes machines on your own network directly to the internet, which in the context of a business is often considered a significant security hazard. Definitely do not do this in a business context without clearing it with IT security persons first.
I would say no. The resource that you want to read data from needs to be accessible from the web. Put the files in the cloud so that the function can access them.
I would like to save data produced by map reducer in hdinsight in a format I can easy report upon. Ideally table structure (Azure table storage). Having done some research, it looks like HDInsight service can only work with Azure Storage Vault (ASV) (both reading and writing). Is that correct?
I would prefer to implement hdinsight mapper/reducer in C#.
I don't know much about hive or pig, and wonder if there is a functionality that will allow to persist results of reducer in external (azure table) data storage other than ASV?
Currently the default storage backing HDInsight is ASV. You can also store data on the 'local' HDFS filesystem on your HDInsight cluster. However, this means keeping the cluster running permanently, and limits you to the storage on your compute nodes. This can get very expensive.
One solution might be sqoop the results out into something like SQL server (or SQL Azure) depending on size and what you plan to do with them.
Alternatively, I am currently working on a connector between Hive and Azure Tables, which currently allows you to read from Azure Tables into Hive (by way of an external table) but will shortly be getting write support as well.
I have an old ASP.NET 2.0 site that I really have no interest in rebuilding or updating at this point. I'd like to move it to Windows Azure but I'm not all that familiar with Azure so I'm wondering if it's easily portable.
The biggest potential roadblock is the fact that users can upload multiple photos. Upon upload, I create several copies of the image in pre-defined dimensions and store them on the local file system using Server.MapPath("{location}") to indicate where it should be stored.
Can I have a site hosted on Azure (using their Free or Shared tier) and continue to use this method of uploading and storing files or do I have to switch to blob storage? There are only about 400MB of images.
Basically I'm looking for a low/no-cost way to easily host this site on Azure that doesn't involve changes to the code (or at the very least, extremely minimal changes such as changing Server.MapPath to some relative location) so that I can move my other, more current ones to Azure as well. My situation is such that if I can't move this site, it doesn't make sense to move the others because I'll have to keep paying for the server for this one anyways (they're all hosted on the same server for now).
Azure drives are not guaranteed to be stored between reboots of the virtual machine, so you will probably need to use blob storage. But you can mount a blob as a NTFS volume and store it there. This would make the transition quite simple in the code.
I'm creating a C# Metro/Modern UI app, and I need a way to handle some user data (mostly just small strings, but a fair amount of them), and specifically I'd like the data to 'roam' with a user's Microsoft Account. I know that you can handle this with roamingSettings, but it seems like that's supposed to be used more for like storing user IDs and other one-time settings, whereas I would be using it to store all of my app's data, and there seems to be a limit to the amount of space I get with that. I was thinking about using SkyDrive to host a "MyApp Data" folder, but I can't seem to figure out how to upload a simple text file to it :(
It seems like the best way to handle it would be to set up an account on Azure or EC2 and then make a simple PHP API so I could access the SQL database from my app, but I'd rather not have to pay for hosting.
I've seen other questions about Metro app storage on StackExchange and Microsoft's own forums, but most of those are in reference to local storage and using SQL servers to handle the storage.
So should I just use roamingSettings and keep an eye on the quota, should I try to use cloud hosting, or is there a better solution I just haven't thought of yet?
Thanks!
A few things about roaming settings:
- they are intended for that, settings. Not as a data replication scheme, thus the quota
- they are not immediate. You can create a setting named "highpriority" that will replicate in less than a minute, but other settings can take several minutes to replicate. If you need data to be available immediately, roaming settings are not an option. Also, if you exceed quota all your data will stop replicating, which is a bad thing. :) It also will not replicate between different versions of your app even if the settings are the same. In addition, if you do not use the app for a period of time (default is 30 days), then the roaming data will be deleted from the cloud. I am pretty sure roaming data can also be turned off via group policy in enterprise settings.
You can leverage SkyDrive. Make sure you download the Live SDK. Overview of using SkyDrive is here... http://msdn.microsoft.com/en-us/library/live/hh826521.aspx It is, fundamentally, just a collection of REST APIs. See the SkyDrive photo sample for an app that uploads files to SkyDrive http://code.msdn.microsoft.com/windowsapps/Live-SDK-Windows-Developer-8ad35141
I would go for a cloud based solution. A MS employee told me that the roaming data is a "best effort" there is no control if it actually works, sometimes it works, sometimes it just doesn't.
Personally I'd try to use the skydrive option
I am trying to create a document manager for my winforms application. It is not web-based.
I would like to be able to allow users to "attach" documents to various entities (personnel, companies, work orders, tasks, batch parts etc) in my application.
After lots of research I have made the decision to use the file system to store the files instead of a blob in SQL. I will set up a folder to store all the files, but I will store the document information (filepath, uploaded by, changed by, revision etc) in parent-child relationship with the entity in an sql database.
I only want users to be able to work with the documents through the application to prevent the files and database records getting out of sync. I some how need to protect the document folder from normal users but at the same time allow the application to work with it. My original thoughts were to set the application up with the only username and password with access to the folder and use impersonation to login to the folder and work with the files. From feedback in a recent thread I started I now believe this was not a good idea, and working with impersonation has been a headache.
I also thought about using a webservice but some of our clients just run the application on there laptops with no windows server. Most are using windows server or citrix/windows server.
What would be the best way to set this up so that only the application handles the documents?
I know you said you read about blobs but are you aware of the FILESTREAM options in SQL Server 2008 and onwards? Basically rather than saving blobs into your database which isn't always a good idea you can instead save the blobs to the NTFS file system using transactional NTFS. This to me sounds like exactly what you are trying to achieve.
All the file access security would be handled through SQL server (as it would be the only thing needing access to the folder) and you don't need to write your own logic for adding and removing files from the file system. To remove a file from the file system you just delete the related record in the sql server table and it handles removing it from the file system.
See:
http://technet.microsoft.com/en-us/library/bb933993.aspx
Option 1 (Easy): Security through Obscurity
Give everyone read (and write as appropriate) access to your document directories. Save your document 'path' as the full URI (\\servername\dir1\dir2\dir3\file.ext) so that your users can access the files, but they're not immediately available if someone goes wandering through their mapped drives.
Option 2 (Harder): Serve the File from SQL Server
You can use either a CLR function or SQLDMO to read the file from disk, present it as a varbinary field and reconstruct it at the client side. Upside is that your users will see a copy, not the real thing; makes viewing safer, editing and saving harder.
Enjoy! ;-)
I'd go with these options, in no particular order.
Create a folder on the server that's not accessible to users. Have a web service running on the server (either using IIS, or standalone WCF app) that has a method to upload & download files. Your web service should manage the directory where the files are being stored. The SQL database should have all the necessary metadata to find the documents. In this manner, only your app can get access to these files. Thus the users could only see the docs via the app.
I can see that you chose to store the documents on the file system. I wrote a similar system (e.g. attachments to customers/orders/sales people/etc...) except that I am storing it in SQL Server. It actually works pretty well. I initially worried that so much data is going to slowdown the database, but that turned out to be not the case. It's working great. The only advice I can give if you take this route is to create a separate database for all your attachments. Why? Because if you want to get a copy of the RDBMS for your local testing, you do not want to be copying a 300GB database that's made up of 1GB of actual data and 299GB of attachments.
You mentioned that some of your users will be carrying laptops. In that case, they might not be connected to the LAN. If that is the case, I'd consider storing the files (and maybe metadata itself) in the cloud (EC2, Azure, Rackspace, etc...).