I am building a proof of concept application using Azure Service Fabric and would like to initialize a few 'demo' user actors in my cluster when it starts up. I've found a few brief articles that talk about loading data from a DataPackage, which shows how to load the data itself, but nothing about how to create actors from this data.
Can this be done with DataPackages or is there a better way to accomplish this?
Data packages are just opaque directories with whatever files you want in there for each deployment. It doesn't load or process the data itself, you have to do all the heavy lifting as only your code knows what the data means. For example, if you had a data package named "SvcData", it would deploy the files in that package during deployment. If you had a file StaticDataMaster.json in that directory, you'd be able to access it when you service ran (either in your actor, or somewhere else). For example:
// get the data package
var DataPkg = ServiceInitializationParameters.CodePackageActivationContext.
GetDataPackageObject("SvcData");
// fabric doesn't load data it is just manages for you. data is opaque to Fabric
var customDataFilePath = DataPkg.Path + #"\StaticDataMaster.json";
// TODO: read customDatafilePath, etc.
Related
I've been tasked with creating a function app that will work with data in a Data Lake that uses parquet Synapse Delta Format files. I've done this with a Synapse notebook in the past, but getting a spark session up and running in a function app is posing some challenges. No matter how I try to use the builder, it keeps trying to connect to local host for debugging, but I need it to accept a different location. I've tried passing in a .Master("") location directly on the SparkSession call and I've tried building a SparkConf and setting the master location there. In both cases, it will try to connect to local host instead and, of course, immediately fail. In the case of SparkConf, it will fail on the SparkConf config = new SparkConf(bool) line and in the case of the SparkSession, it will fail on the line where I try to create the session. I've looked at all of the examples that I can find and they all deal with just setting up a local debuggable instance, which is not what I want or need. It needs to be self-contained to work as a function app and remotely connect to the spark pool I've provided it.
I am currently working on a project where we have data stored on Azure Datalake. The Datalake is hooked to Azure Databricks.
The requirement asks that the Azure Databricks is to be connected to a C# application to be able to run queries and get the result all from the C# application. The way we are currently tackling the problem is that we have created a workspace on Databricks with a number of queries that need to be executed. We created a job that is linked to the mentioned workspace. From the C# application we are calling a number of API's listed here in this documentation to call an instance of the job and wait for it to be executed. However I have not been able to extract the result from any of the APIs listed in the documentation.
My question is this, are we taking the correct approach or is there something we are not seeing? If this is the way to go, what has been your experience in extracting the result from a successfully run job on Azure Databricks from a C# application.
Microsoft has a nice architecture reference solution that might help you get some more insights too.
I'm not sure using the REST API is the best way to go to get your job output from Azure DataBricks.
First of all the REST API has a rate limit per databrick instance. It's not that bad at 30 requests per second but it strongly depend on the scale of your application and other uses of the databrick instance if that is sufficient. It should be enough for creating a job but if you want to poll the job status for completion it might not be enough.
There is also a limited capacity in datatransfer via the REST API.
For example: As per the docs the output api will only returns the first 5MB of a run output. If you want larger results you'll have to store it somewhere else before getting it from your C# application.
Alternative retrieval method
In Short: Use Azure PaaS to your advantage with blobstorage and eventgrid.
This is in no way an exhaustive solution and I'm sure someone can come up with a better one, however this has worked for me in similar usecases.
What you can do is write the result from your job runs to some form of cloud storage connected to databricks and then get the result from that storage location later. There is a step in this tutorial that shows the basic concept for storing the results of a job with SQL data warehouse, but you can use any storage you like, for example Blob storage
Let's say you store the result in blobstorage. Each time a new job output is written to a blob, you can raise an event. You can subscribe to these events via Azure Eventgrid and consume them in your application. There is a .net SDK that will let you to do this. The event will contain a blob uri that you can use to get the data into your application.
Form the docs a blobcreated event will look something like this:
[{
"topic": "/subscriptions/{subscription-id}/resourceGroups/Storage/providers/Microsoft.Storage/storageAccounts/my-storage-account",
"subject": "/blobServices/default/containers/test-container/blobs/new-file.txt",
"eventType": "Microsoft.Storage.BlobCreated",
"eventTime": "2017-06-26T18:41:00.9584103Z",
"id": "831e1650-001e-001b-66ab-eeb76e069631",
"data": {
"api": "PutBlockList",
"clientRequestId": "6d79dbfb-0e37-4fc4-981f-442c9ca65760",
"requestId": "831e1650-001e-001b-66ab-eeb76e000000",
"eTag": "\"0x8D4BCC2E4835CD0\"",
"contentType": "text/plain",
"contentLength": 524288,
"blobType": "BlockBlob",
"url": "https://my-storage-account.blob.core.windows.net/testcontainer/new-file.txt",
"sequencer": "00000000000004420000000000028963",
"storageDiagnostics": {
"batchId": "b68529f3-68cd-4744-baa4-3c0498ec19f0"
}
},
"dataVersion": "",
"metadataVersion": "1"
}]
It will be important to name your blobs with the required information such as job Id and Run Id. You can also create custom events, which will increase the complexity of the solution but will allow you to add more details to your event.
Once you have the blob created event data in your app you can use the storage SDK to get the blobdata for use in your application. Depending on your application logic, you'll also have to manage the job ID and run Id's in the application otherwise you run the risk of having job output in your storage that is no longer attached to a process in your app.
Your use case is to use databricks as a compute engine (something similar to MySQL) and get output into C# application . So the best way is to create tables in databricks and run those queries via ODBC connection .
https://learn.microsoft.com/en-us/azure/databricks/integrations/bi/jdbc-odbc-bi
This way you have more control over sql query output.
I have a Webapp for different customers-DBs which runs on several Application Servers. The customers are each assigned to an instance on a AS.
For several of the customers, certain data is saved in additional SQLite-DBs on the Application Servers themselves; when this kind of data is added, the WebApp tests whether the according SQLite-DB already exists on this AS and if not, it creates it by using the following code:
dbFileName = "C:\\" + dbFileName;
SQLiteConnection.CreateFile(dbFileName);
using (System.Data.SQLite.SQLiteConnection con = new System.Data.SQLite.SQLiteConnection("data source=" + dbFileName))
{
using (System.Data.SQLite.SQLiteCommand com = new System.Data.SQLite.SQLiteCommand(con))
{
con.Open();
The problem is that if I assign the customer to an instance on another AS, the SQLite-db has to be created again, since it can't access the one on the other AS.
Now my idea was to create the SQLite-dbs on some azure storage, where I could access it from every AS, but so far I'm not able to access it via a SQLite-Connection.
I have knowledge of my specific SAS (=Shared Access Signature) and Connectionstrings like the ones specified on https://www.connectionstrings.com/windows-azure/
but I'm not sure which part I should use for the SQLiteConnection.
Is it even possible?
The only examples on connections to Azurestorage that I found so far are via HttpRequests (like How to access Azure blob using SAS in C#) which doesn't help me or can anybody show me a way to use this for my problem?
Please tell me if you need more information, I'm kind of bad at explaining things and not taking into account that many things aren't common knowledge...
You can not use Azure storage blob as a normal file system. The data source in SQLite connection string should be a file path or memory.
If you want to use Azure storage blob, as far as I know, you can only mount Blob storage as a file system with blobfuse on Linux OS. But this is not 100% compatible with normal file systems.
Another choice is to use Azure storage file, it supports SMB protocol. You can mount a network driver and use it.
We have approximately 100 microservices running. Each microservice has an entire set of configuration files such as applicationmanifest.xml, settings.xml, node1.xml, etc.
This is getting to be a configuration nightmare.
After exploring this, someone has suggested:
You can keep configs inside stateful service, then change parameters
through your API.
The problem I see with this, is that there is now a single point of a failure: the service that provides the configuration values.
Is there a centralized solution to maintaining so much configuration data for every microservice?
While a central configuration service seems like the way to do, if you do it you introduce a few problem that you must get right each time). When you have an central configuration service it MUST be updated with the correct configuration before you start your code upgrade and you must of course keep previous configurations around in case your deployment rolls back. Here's the configuration slide that I presented when I was on the Service Fabric team.
Service Fabric ships with the ability to version configuration, you should use that, but not in the manner that Service Fabric recommends. For my projects, I use the Microsoft.Extensions.Configuration for configuration. Capture the configuration events
context.CodePackageActivationContext.ConfigurationPackageAddedEvent += CodePackageActivationContext_ConfigurationPackageAddedEvent;
context.CodePackageActivationContext.ConfigurationPackageModifiedEvent += CodePackageActivationContext_ConfigurationPackageModifiedEvent;
context.CodePackageActivationContext.ConfigurationPackageRemovedEvent += Context_ConfigurationPackageRemovedEvent;
Each of these event handler can call to load the configuration like this
protected IConfigurationRoot LoadConfiguration()
{
ConfigurationBuilder builder = new ConfigurationBuilder();
// Get the name of the environment this service is running within.
EnvironmentName = Environment.GetEnvironmentVariable(EnvironmentVariableName);
if (string.IsNullOrWhiteSpace(EnvironmentName))
{
var err = $"Environment is not defined using '{EnvironmentVariableName}'.";
_logger.Fatal(err);
throw new ArgumentException(err);
}
// Enumerate the configuration packaged. Look for the service type name, service name or settings.
IList<string> names = Context?.CodePackageActivationContext?.GetConfigurationPackageNames();
if (null != names)
{
foreach (string name in names)
{
if (name.Equals(GenericStatelessService.ConfigPackageName, StringComparison.InvariantCultureIgnoreCase))
{
var newPackage = Context.CodePackageActivationContext.GetConfigurationPackageObject(name);
// Set the base path to be the configuration directory, then add the JSON file for the service name and the service type name.
builder.SetBasePath(newPackage.Path)
.AddJsonFile($"{ServiceInstanceName}-{EnvironmentName}.json", true, true)
.AddJsonFile($"{Context.ServiceTypeName}-{EnvironmentName}.json", true, true);
// Load the settings into memory.
builder.AddInMemoryCollection(LoadSettings(newPackage));
}
}
}
// Swap in a new configuration.
return builder.Build();
}
You can now interact with Configuration using the .Net configuration. Last thing to cover is the format of the configuration files. In the PackageRoot | Config directory, you simply include your configuration files. I happen to use the Name of the service + datacenter.
The files internal look like this, where there is a JSON property for each service fabric class.
{
"Logging": {
"SeqUri": "http://localhost:5341",
"MaxFileSizeMB": "100",
"DaysToKeep": "1",
"FlushInterval": "00:01:00",
"SeqDefaultLogLevel": "Verbose",
"FileDefaultLogLevel": "Verbose"
},
"ApplicationOperations": {
"your values here": "<values>"
},
If you stuck this long, the big advantage of this is that the configuration gets deployed at the same time as the code and if the code rolls back, so does the configuration, leaving you in a know state.
NOTE: It seems your question is blurred between whether to use a single configuration service is reliable or whether to use static vs dynamic configuration.
For the debate on static vs dynamic configuration, see my answer to the OP's other question.
A config service sounds reasonable particularly when you consider that Service Fabric is designed to be realiable, even stateful services.
MSDN:
Service Fabric enables you to build and manage scalable and reliable applications composed of microservices that run at high density on a shared pool of machines, which is referred to as a cluster
Develop highly reliable stateless and stateful microservices. Tell me more...
Stateful services store state in a reliable distrubuted dictionary enclosed in a transaction which guarentees the data is stored if the transaction was successful.
OP:
The problem I see with this, is that there is now a single point of a failure: the service that provides the configuration values.
Not necessarily. It's not really the service that is the single point of failure but a "fault domain" as defined by Service Fabric and your chosen Azure data centre deployment options.
MSDN:
A Fault Domain is any area of coordinated failure. A single machine is a Fault Domain (since it can fail on its own for various reasons, from power supply failures to drive failures to bad NIC firmware). Machines connected to the same Ethernet switch are in the same Fault Domain, as are machines sharing a single source of power or in a single location. Since it's natural for hardware faults to overlap, Fault Domains are inherently hierarchal and are represented as URIs in Service Fabric.
It is important that Fault Domains are set up correctly since Service Fabric uses this information to safely place services. Service Fabric doesn't want to place services such that the loss of a Fault Domain (caused by the failure of some component) causes a service to go down. In the Azure environment Service Fabric uses the Fault Domain information provided by the environment to correctly configure the nodes in the cluster on your behalf. For Service Fabric Standalone, Fault Domains are defined at the time that the cluster is set up
So you would probably want to have at least two configuration services running on two separate fault domains.
More
Describing a service fabric cluster
I am an experienced windows C# developer, but new to the world of Azure, and so trying to figure out a "best practice" as I implement one or more Azure Cloud Services.
I have a number of (external, and outside of my control) sources that can all save files to a folder (or possibly a set of folders). In the current state of my system under Windows, I have a FileSystemWatcher set up to monitor a folder and raise an event when a file appears there.
In the world of Azure, what is the equivalent way to do this? Or is there?
I am aware I could create a timer (or sleep) to pass some time (say 30 seconds), and poll the folder, but I'm just not sure that's the "best" way in a cloud environment.
It is important to note that I have no control over the inputs - in other words the files are saved by an external device over which I have no control; so I can't, for example, push a message onto a queue when the file is saved, and respond to that message...
Although, in the end, that's the goal... So I intend to have a "Watcher" service which will (via events or polling) detect the presence of one or more files, and push a message onto the appropriate queue for the next step in my workflow to respond to.
It should be noted that I am using VS2015, and the latest Azure SDK stuff, so I'm not limited by anything legacy.
What I have so far is basically this (a snippet of a larger code base):
storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
// Create a CloudFileClient object for credentialed access to File storage.
fileClient = storageAccount.CreateCloudFileClient();
// Obtain the file share name from the config file
string sharenameString = CloudConfigurationManager.GetSetting("NLRB.Scanning.FileSharename");
// Get a reference to the file share.
share = fileClient.GetShareReference(sharenameString);
// Ensure that the share exists.
if (share.Exists())
{
Trace.WriteLine("Share exists.");
// Get a reference to the root directory for the share.
rootDir = share.GetRootDirectoryReference();
//Here is where I want to start watching the folder represented by rootDir...
}
Thanks in advance.
If you're using an attached disk (or local scratch disk), the behavior would be like on any other Windows machine, so you'd just set up a file watcher accordingly with FileSystemWatcher and deal with callbacks as you normally would.
There's Azure File Service, which is SMB as-a-service and would support any actions you'd be able to do on a regular SMB volume on your local network.
There's Azure blob storage. These can not be watched. You'd have to poll for changes to, say, a blob container.
You could create a loop that polls the root directory periodically using
CloudFileDirectory.ListFilesAndDirectories method.
https://msdn.microsoft.com/en-us/library/dn723299.aspx
You could also write a small recursive method to call this in sub directories.
To detect differences you can build up an in memory hash map of all files and directories. If you want something like a persistent distributed cache then you can use ie. Redis to keep this list of files/directories. Every time you poll if the file or directory is not in your list then you detected a new file/ directory under root.
You could separate the responsibility of detection and business logic ie. a worker role keeps polling the directory and writes the new files to a queue and the consumer end another worker role/ web service that does the processing with that information.
Azure Blob Storage pushes events through Azure Event Grid. Blob storage has two event types, Microsoft.Storage.BlobCreated and Microsoft.Storage.BlobDeleted. So instead of long polling you can simply react to the created event.
See this link for more information:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-event-overview
I had a very similar requirement. I used BOX application. It has a Webhook feature for events occurring in Files or Folders: such as Add, Move, Delete etc..
Also there are some newer alternatives with Azure Autromation.
I'm pretty new to Azure too, and actually I'm investigating a file watcher type thing. I'm considering something involving Azure Functions, because of this, which looks like a way of triggering some code when a blog is created or updated. There's a way of specifying a pattern too: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob