I would like to save data produced by map reducer in hdinsight in a format I can easy report upon. Ideally table structure (Azure table storage). Having done some research, it looks like HDInsight service can only work with Azure Storage Vault (ASV) (both reading and writing). Is that correct?
I would prefer to implement hdinsight mapper/reducer in C#.
I don't know much about hive or pig, and wonder if there is a functionality that will allow to persist results of reducer in external (azure table) data storage other than ASV?
Currently the default storage backing HDInsight is ASV. You can also store data on the 'local' HDFS filesystem on your HDInsight cluster. However, this means keeping the cluster running permanently, and limits you to the storage on your compute nodes. This can get very expensive.
One solution might be sqoop the results out into something like SQL server (or SQL Azure) depending on size and what you plan to do with them.
Alternatively, I am currently working on a connector between Hive and Azure Tables, which currently allows you to read from Azure Tables into Hive (by way of an external table) but will shortly be getting write support as well.
Related
Is it possible to read a file located on local machine path C:\data with an azure function trigger by http request ?
You can expose a local file system to Azure using the on-premises data gateway.
However this supported from Logic Apps, but not as far as I know from Functions. You could however still use the Logic App as a bridge to your Function using the Azure Function Connector.
You are of course free to use your own personal computer however you like, but be aware that the on-premises data gateway exposes machines on your own network directly to the internet, which in the context of a business is often considered a significant security hazard. Definitely do not do this in a business context without clearing it with IT security persons first.
I would say no. The resource that you want to read data from needs to be accessible from the web. Put the files in the cloud so that the function can access them.
I have an Azure hosted (web-forms) asp.net website, using Azure SQL for the database.
I need to setup an automatic transfer of some of the data nightly to a specific FTP site. The data will be in CSV format... so just a basic query, CSV file created, and the file sent via FTP.
My first inclination would be to just create a specific web-page which does the query, creates the file, and sends it out (all in code) - and then schedule this using Azure Scheduler Jobs Collection.... but I'm just wondering if there would be another "best practice" method for doing this such as Azure Data Factory, connectors, etc?
Just wanted to get some input on what road to go down. Any help would be appreciated.
Thank you.
My first inclination would be to just create a specific web-page which does the query, creates the file, and sends it out (all in code) - and then schedule this using Azure Scheduler Jobs Collection.... but I'm just wondering if there would be another "best practice" method for doing this such as Azure Data Factory, connectors, etc?
Firstly, as you mentioned, you can run a job/task on schedule (Azure Scheduler, Azure WebJobs or Azure Functions can help you achieve it) to request that specific web-page to transfer data from Azure SQL database to FTP server.
Secondly, Azure Logic Apps enable us to use SQL Database connector and FTP connector to access/manage SQL Database and FTP server, you can try to use it. And this SO thread discussed transferring data from SQL database to FTP server using Azure Logic Apps, you can refer to it.
We need to store ~50k files every year. Each file is 0.1-5mb which translates to 5gb - 250gb range. Files are: jpg, avi, pdf, docx,etc.
We used to just store file BLOBs in sql server but I guess it's not the best idea in this scenario because database will be huge in 2 years.
What would be the best way to store that data?
I see a lot of different options there and cannot figure out where to start:
Azure storage, Azure SQL, etc. There's also some hybrid versions in new versions of SQL server.
I use the following approach for multiple systems.
Azure Storage for files. You can create multiple containers and access levels if storing proprietary information.
Azure CDN for serving static content from Azure Storage
Azure DB as a database engine
In Azure DB I store the path the the file, with some additional processing in my applications for how to access the file to build up the final URL to serve the file from. This is due to CDN lacking SSL support on custom domains.
If you need examples or more info, just let me know. I'm sitting at an airport so providing a slightly less detailed answer.
As #Martin mentioned in his answer, Azure Storage is viable, specifically because:
Access is independent of any VM, service, database, etc.
Storage is durable - triple-replicated within a region, and optionally geo-replicated
Individual blobs may be up to 200GB, with storage scaling to 500TB
Azure also provides File Service, which is essentially an SMB share sitting on top of blobs.
While there are database services in Azure (SQL Database Service and DocumentDB), you'll find that these are not really optimized for large binary data storage; they're more optimized for metadata. You can certainly store binary data in each of them, but you'll need to worry about storage limits.
You may also spin up your own database solution via Virtual Machines, along with attached disks (again, backed by durable blob storage). Virtual Machines support up to 32 1TB disks attached to a given VM, whether normal blobs or "premium" SSD-based blobs (each premium disk supporting up to 5000 IOPS). Which database solution you choose is completely up to you; there is no "best" solution.
I am using the MongoDB worker role project to use MongoDB on Azure. I have two separate cloud services, in one of them everything works fine, however in the other, the MongoDB worker role is stuck in a Busy (Waiting for role to start... Calling OnRoleStart state.
I connected to one of the MongoDB worker roles and accessed the MongoDB log file and found the following error:
[rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
There are threads on how to fix this normally, but not with Windows Azure. I did not configure anything for the MongoDB worker role (apart from Azure storage connection strings), and it works in another service, so I don't know why it isn't working for this service. Any idea?
Some time ago I was trying to host RavenDB in Azure as Worker Role and had lot's of issues with it as well.
Today, I believe it's better to run database the "suggested" way on target platform which is as Windows Service according to this "Install MongoDB on Windows" guide. This way you won't have to deal with Azure-specific issues. To achieve this you can:
Use Azure CmdLets along with CsPack.exe to create the package for MondoDB.
A solution similiar to RavenDB Master-Slave reads on Azure which I posted on GitHub.
Sign up for Virtual Machine (beta) on Azure, kick off a machine and install MongoDB manually there.
But I guess the most important question when hosting DB is: where do you plan to store the actual DB?
Azure's CloudDrive, which is a VHD stored in Cloud Storage, has the worst IO performance possible. Not sufficient for normal DB usage I'd say.
Ephemeral storage, a Cloud Service local disk space, has perfect IO, but you lose all data once VM is deleted. This means you usually want to make continious or at least regular backups to Cloud Storage, maybe through CloudDrive.
Azure VM attached disk - has better IO than CloudDrive, but still not as good as Ephemeral storage.
As for the actual troubleshooting to your problem. I'd suggest wrapping OnRoleStart with try-catch, writting it to the log, enabling RDP to the box and then connecting and looking into the actual issue right in place. Another alternative is using IntelliTrace, but you need VS Ultimate for that. Also, don't forget that Azure requires usage of Local Resources if your app needs to make writes to the disk.
I am trying to create a document manager for my winforms application. It is not web-based.
I would like to be able to allow users to "attach" documents to various entities (personnel, companies, work orders, tasks, batch parts etc) in my application.
After lots of research I have made the decision to use the file system to store the files instead of a blob in SQL. I will set up a folder to store all the files, but I will store the document information (filepath, uploaded by, changed by, revision etc) in parent-child relationship with the entity in an sql database.
I only want users to be able to work with the documents through the application to prevent the files and database records getting out of sync. I some how need to protect the document folder from normal users but at the same time allow the application to work with it. My original thoughts were to set the application up with the only username and password with access to the folder and use impersonation to login to the folder and work with the files. From feedback in a recent thread I started I now believe this was not a good idea, and working with impersonation has been a headache.
I also thought about using a webservice but some of our clients just run the application on there laptops with no windows server. Most are using windows server or citrix/windows server.
What would be the best way to set this up so that only the application handles the documents?
I know you said you read about blobs but are you aware of the FILESTREAM options in SQL Server 2008 and onwards? Basically rather than saving blobs into your database which isn't always a good idea you can instead save the blobs to the NTFS file system using transactional NTFS. This to me sounds like exactly what you are trying to achieve.
All the file access security would be handled through SQL server (as it would be the only thing needing access to the folder) and you don't need to write your own logic for adding and removing files from the file system. To remove a file from the file system you just delete the related record in the sql server table and it handles removing it from the file system.
See:
http://technet.microsoft.com/en-us/library/bb933993.aspx
Option 1 (Easy): Security through Obscurity
Give everyone read (and write as appropriate) access to your document directories. Save your document 'path' as the full URI (\\servername\dir1\dir2\dir3\file.ext) so that your users can access the files, but they're not immediately available if someone goes wandering through their mapped drives.
Option 2 (Harder): Serve the File from SQL Server
You can use either a CLR function or SQLDMO to read the file from disk, present it as a varbinary field and reconstruct it at the client side. Upside is that your users will see a copy, not the real thing; makes viewing safer, editing and saving harder.
Enjoy! ;-)
I'd go with these options, in no particular order.
Create a folder on the server that's not accessible to users. Have a web service running on the server (either using IIS, or standalone WCF app) that has a method to upload & download files. Your web service should manage the directory where the files are being stored. The SQL database should have all the necessary metadata to find the documents. In this manner, only your app can get access to these files. Thus the users could only see the docs via the app.
I can see that you chose to store the documents on the file system. I wrote a similar system (e.g. attachments to customers/orders/sales people/etc...) except that I am storing it in SQL Server. It actually works pretty well. I initially worried that so much data is going to slowdown the database, but that turned out to be not the case. It's working great. The only advice I can give if you take this route is to create a separate database for all your attachments. Why? Because if you want to get a copy of the RDBMS for your local testing, you do not want to be copying a 300GB database that's made up of 1GB of actual data and 299GB of attachments.
You mentioned that some of your users will be carrying laptops. In that case, they might not be connected to the LAN. If that is the case, I'd consider storing the files (and maybe metadata itself) in the cloud (EC2, Azure, Rackspace, etc...).