I am currently working on a project where we have data stored on Azure Datalake. The Datalake is hooked to Azure Databricks.
The requirement asks that the Azure Databricks is to be connected to a C# application to be able to run queries and get the result all from the C# application. The way we are currently tackling the problem is that we have created a workspace on Databricks with a number of queries that need to be executed. We created a job that is linked to the mentioned workspace. From the C# application we are calling a number of API's listed here in this documentation to call an instance of the job and wait for it to be executed. However I have not been able to extract the result from any of the APIs listed in the documentation.
My question is this, are we taking the correct approach or is there something we are not seeing? If this is the way to go, what has been your experience in extracting the result from a successfully run job on Azure Databricks from a C# application.
Microsoft has a nice architecture reference solution that might help you get some more insights too.
I'm not sure using the REST API is the best way to go to get your job output from Azure DataBricks.
First of all the REST API has a rate limit per databrick instance. It's not that bad at 30 requests per second but it strongly depend on the scale of your application and other uses of the databrick instance if that is sufficient. It should be enough for creating a job but if you want to poll the job status for completion it might not be enough.
There is also a limited capacity in datatransfer via the REST API.
For example: As per the docs the output api will only returns the first 5MB of a run output. If you want larger results you'll have to store it somewhere else before getting it from your C# application.
Alternative retrieval method
In Short: Use Azure PaaS to your advantage with blobstorage and eventgrid.
This is in no way an exhaustive solution and I'm sure someone can come up with a better one, however this has worked for me in similar usecases.
What you can do is write the result from your job runs to some form of cloud storage connected to databricks and then get the result from that storage location later. There is a step in this tutorial that shows the basic concept for storing the results of a job with SQL data warehouse, but you can use any storage you like, for example Blob storage
Let's say you store the result in blobstorage. Each time a new job output is written to a blob, you can raise an event. You can subscribe to these events via Azure Eventgrid and consume them in your application. There is a .net SDK that will let you to do this. The event will contain a blob uri that you can use to get the data into your application.
Form the docs a blobcreated event will look something like this:
[{
"topic": "/subscriptions/{subscription-id}/resourceGroups/Storage/providers/Microsoft.Storage/storageAccounts/my-storage-account",
"subject": "/blobServices/default/containers/test-container/blobs/new-file.txt",
"eventType": "Microsoft.Storage.BlobCreated",
"eventTime": "2017-06-26T18:41:00.9584103Z",
"id": "831e1650-001e-001b-66ab-eeb76e069631",
"data": {
"api": "PutBlockList",
"clientRequestId": "6d79dbfb-0e37-4fc4-981f-442c9ca65760",
"requestId": "831e1650-001e-001b-66ab-eeb76e000000",
"eTag": "\"0x8D4BCC2E4835CD0\"",
"contentType": "text/plain",
"contentLength": 524288,
"blobType": "BlockBlob",
"url": "https://my-storage-account.blob.core.windows.net/testcontainer/new-file.txt",
"sequencer": "00000000000004420000000000028963",
"storageDiagnostics": {
"batchId": "b68529f3-68cd-4744-baa4-3c0498ec19f0"
}
},
"dataVersion": "",
"metadataVersion": "1"
}]
It will be important to name your blobs with the required information such as job Id and Run Id. You can also create custom events, which will increase the complexity of the solution but will allow you to add more details to your event.
Once you have the blob created event data in your app you can use the storage SDK to get the blobdata for use in your application. Depending on your application logic, you'll also have to manage the job ID and run Id's in the application otherwise you run the risk of having job output in your storage that is no longer attached to a process in your app.
Your use case is to use databricks as a compute engine (something similar to MySQL) and get output into C# application . So the best way is to create tables in databricks and run those queries via ODBC connection .
https://learn.microsoft.com/en-us/azure/databricks/integrations/bi/jdbc-odbc-bi
This way you have more control over sql query output.
Related
I've been tasked with creating a function app that will work with data in a Data Lake that uses parquet Synapse Delta Format files. I've done this with a Synapse notebook in the past, but getting a spark session up and running in a function app is posing some challenges. No matter how I try to use the builder, it keeps trying to connect to local host for debugging, but I need it to accept a different location. I've tried passing in a .Master("") location directly on the SparkSession call and I've tried building a SparkConf and setting the master location there. In both cases, it will try to connect to local host instead and, of course, immediately fail. In the case of SparkConf, it will fail on the SparkConf config = new SparkConf(bool) line and in the case of the SparkSession, it will fail on the line where I try to create the session. I've looked at all of the examples that I can find and they all deal with just setting up a local debuggable instance, which is not what I want or need. It needs to be self-contained to work as a function app and remotely connect to the spark pool I've provided it.
We have multiple durable functions running in the same app service with the same storage space and are unsure how to properly configure them to use separate TaskHubs.
Originally they were all using the same task hub and we encountered a problem that the orchestration never completed (see https://github.com/Azure/azure-functions-durable-extension/issues/1748#issuecomment-805945562 for details).
According to a comment https://github.com/Azure/azure-functions-durable-extension/issues/904#issuecomment-529052862 we should be using separate taskhubs. So I used the TaskHub parameter of the Durableclient class to specify different ones for each durable function (total of 3).
But now the orchestration is sitting in the instances table with status = 'Pending' forever and the Durable Function diagnostics in the portal is giving me an error 'The scale controller is not monitoring the correct task hub(s)!', but theres no useful information about how to fix it.
Does anyone know how to do this properly?
I am building a proof of concept application using Azure Service Fabric and would like to initialize a few 'demo' user actors in my cluster when it starts up. I've found a few brief articles that talk about loading data from a DataPackage, which shows how to load the data itself, but nothing about how to create actors from this data.
Can this be done with DataPackages or is there a better way to accomplish this?
Data packages are just opaque directories with whatever files you want in there for each deployment. It doesn't load or process the data itself, you have to do all the heavy lifting as only your code knows what the data means. For example, if you had a data package named "SvcData", it would deploy the files in that package during deployment. If you had a file StaticDataMaster.json in that directory, you'd be able to access it when you service ran (either in your actor, or somewhere else). For example:
// get the data package
var DataPkg = ServiceInitializationParameters.CodePackageActivationContext.
GetDataPackageObject("SvcData");
// fabric doesn't load data it is just manages for you. data is opaque to Fabric
var customDataFilePath = DataPkg.Path + #"\StaticDataMaster.json";
// TODO: read customDatafilePath, etc.
I am an experienced windows C# developer, but new to the world of Azure, and so trying to figure out a "best practice" as I implement one or more Azure Cloud Services.
I have a number of (external, and outside of my control) sources that can all save files to a folder (or possibly a set of folders). In the current state of my system under Windows, I have a FileSystemWatcher set up to monitor a folder and raise an event when a file appears there.
In the world of Azure, what is the equivalent way to do this? Or is there?
I am aware I could create a timer (or sleep) to pass some time (say 30 seconds), and poll the folder, but I'm just not sure that's the "best" way in a cloud environment.
It is important to note that I have no control over the inputs - in other words the files are saved by an external device over which I have no control; so I can't, for example, push a message onto a queue when the file is saved, and respond to that message...
Although, in the end, that's the goal... So I intend to have a "Watcher" service which will (via events or polling) detect the presence of one or more files, and push a message onto the appropriate queue for the next step in my workflow to respond to.
It should be noted that I am using VS2015, and the latest Azure SDK stuff, so I'm not limited by anything legacy.
What I have so far is basically this (a snippet of a larger code base):
storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
// Create a CloudFileClient object for credentialed access to File storage.
fileClient = storageAccount.CreateCloudFileClient();
// Obtain the file share name from the config file
string sharenameString = CloudConfigurationManager.GetSetting("NLRB.Scanning.FileSharename");
// Get a reference to the file share.
share = fileClient.GetShareReference(sharenameString);
// Ensure that the share exists.
if (share.Exists())
{
Trace.WriteLine("Share exists.");
// Get a reference to the root directory for the share.
rootDir = share.GetRootDirectoryReference();
//Here is where I want to start watching the folder represented by rootDir...
}
Thanks in advance.
If you're using an attached disk (or local scratch disk), the behavior would be like on any other Windows machine, so you'd just set up a file watcher accordingly with FileSystemWatcher and deal with callbacks as you normally would.
There's Azure File Service, which is SMB as-a-service and would support any actions you'd be able to do on a regular SMB volume on your local network.
There's Azure blob storage. These can not be watched. You'd have to poll for changes to, say, a blob container.
You could create a loop that polls the root directory periodically using
CloudFileDirectory.ListFilesAndDirectories method.
https://msdn.microsoft.com/en-us/library/dn723299.aspx
You could also write a small recursive method to call this in sub directories.
To detect differences you can build up an in memory hash map of all files and directories. If you want something like a persistent distributed cache then you can use ie. Redis to keep this list of files/directories. Every time you poll if the file or directory is not in your list then you detected a new file/ directory under root.
You could separate the responsibility of detection and business logic ie. a worker role keeps polling the directory and writes the new files to a queue and the consumer end another worker role/ web service that does the processing with that information.
Azure Blob Storage pushes events through Azure Event Grid. Blob storage has two event types, Microsoft.Storage.BlobCreated and Microsoft.Storage.BlobDeleted. So instead of long polling you can simply react to the created event.
See this link for more information:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-event-overview
I had a very similar requirement. I used BOX application. It has a Webhook feature for events occurring in Files or Folders: such as Add, Move, Delete etc..
Also there are some newer alternatives with Azure Autromation.
I'm pretty new to Azure too, and actually I'm investigating a file watcher type thing. I'm considering something involving Azure Functions, because of this, which looks like a way of triggering some code when a blog is created or updated. There's a way of specifying a pattern too: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob
I've been working to try and convert Microsoft's EWS Streaming Notification Example to a service
( MS source http://www.microsoft.com/en-us/download/details.aspx?id=27154).
I tested it as a console app. I then used a generic service template and got it to the point it would compile, install, and start. It stops after about 10 seconds with the ubiquitous "the service on local computer started and then stopped."
So I went back in and upgraded to C# 2013 express and used NLog to put a bunch of log trace commands to so I could see where it was when it exited.
The last place I can find it is in the example code, SynchronizationChanges function,
public static void SynchronizeChanges(FolderId folderId)
{
logger.Trace("Entering SynchronizeChanges");
bool moreChangesAvailable;
do
{
logger.Trace("Synchronizing changes...");
//Console.WriteLine("Synchronizing changes...");
// Get all changes since the last call. The synchronization cookie is stored in the
// _SynchronizationState field.
// Only the the ids are requested. Additional properties should be fetched via GetItem
//calls.
logger.Trace("Getting changes into var changes.");
var changes = _ExchangeService.SyncFolderItems(folderId, PropertySet.IdOnly, null, 512,
SyncFolderItemsScope.NormalItems,
_SynchronizationState);
// Update the synchronization cookie
logger.Trace("Updating _SynchronizationState");
the log file shows the trace message ""Getting changes into var changes." but not the "Updating _SynchronizationState" message.
so it never gets past var changes = _ExchangeService.SyncFolderItems
I cannot for the life figure out why its just exiting. There are many examples of EWS streaming notifications. I have 3 that compile and run just fine but nobody as far as I can tell has posted an example of it done as a service.
If you don't see the "Updating..." message it's likely the sync threw an exception. Wrap it in a try/catch.
OK, so now that I see the error, this looks like your garden-variety permissions problem. When you ran this as a console app, you likely presented the default credentials to Exchange, which were for your login ID. For a Windows service, if you're running the service with one of the built-in accounts (e.g. Local System), your default credentials will not have access to Exchange.
To rectify, either (1) run the service under the account you did the console app with, or (2) add those credentials to the Exchange Service object.