Model Developer's Guide: Python Models

Introduction

The following will run through the process of developing a Python-based model for the Analysis Services environment.

As an example, this tutorial will run through the process of developing a Python-based model that computes the weighted mean of a number of data streams.

Prior Reading

Although not essential, there are some other documents that are beneficial to read before embarking on this tutorial:

The Analysis Services API Tutorial gives an overview of the application programming interface (API) for the Analysis Services environment.
The Model Developer's Guide introduces the general concepts that are useful for all developers of models for the Analysis Services environment (i.e. not just those developing models in Python).

Development Environment

There are a number of Python libraries that are required to develop models for the Analysis Services environment. The as_models package, available here, is the core library which is used to interface your model with the Analysis Services system.

In addition to the core as_models package, there are four other optional libraries which may be needed to interact with other components of the Senaps environment:

The senaps_sensor package available here, is a Python client library for the core Senaps API. It is required for handling "stream" and "multistream" inputs and output ports, or for making general requests to the Senaps API.
The as_client package, available here, is a Python client library for the Analysis Services API. It is required for interacting with the Senaps Analysis Service API.
The tds_client package, available here, is a Python client library for the Thredds Data Server (TDS). It is required for handling "grid" input ports, or for making requests to TDS servers.
The tds_upload package, available here, is a Python client library for uploading data to the Senaps environment's TDS server. It is required for handling "grid" output ports.

These are all distributed with a setup.py script, and can be installed with the following command:

sudo python3 setup.py install # Linux
python3 setup.py install # Windows

All the required dependencies should be automatically installed. There are, however, some optional dependencies that should be installed if using extra functionality:

Pandas is required if using the Sensor Data client's PandasObservationParser class.
netCDF4 is required if using the TDS client's NetCDF Subset Service.

The Python Code

The following Python code defines a "multivariate mean" model:

from senaps_sensor.parsers import PandasObservationParser
from senaps_sensor.models import Observation, UnivariateResult

import json, pandas as pd

def multivariate_mean(context):
    input_stream_ids = [p.stream_id for p in context.ports['inputs']]
    output_stream_id = context.ports['output'].stream_id

    weights_doc = getattr(context.ports.get('weights'), 'value', None) or '{}'
    weights = json.loads(weights_doc)

    # Obtain observation data from `inputs` streams.
    context.update(message='Loading data...')
    streamids = ','.join(input_stream_ids)
    parser = PandasObservationParser()
    limit = 10000
    data = start = None
    while True:
        segment = context.sensor_client.get_observations(streamid=streamids, media='csv', limit=limit, start=start, si=False, parser=parser)
        data = pd.concat((data, segment))

        if len(segment) < limit:
            break

        start = segment.index[-1].strftime('%Y-%m-%dT%H:%M:%S.%fZ')

    # Apply weights to the data.
    context.update(message='Weighting data...')
    for i, (stream_id, weight) in enumerate(weights.iteritems()):
        data[stream_id] *= weight
    sum_of_weights = sum(weights.get(stream_id, 1.0) for stream_id in input_stream_ids)

    # Compute mean and store in `output` stream..
    mean = data.sum(axis=1) / sum_of_weights
    output = Observation()
    output.results = [UnivariateResult(t=t.strftime('%Y-%m-%dT%H:%M:%S.%fZ'),v=v) for t,v in zip(mean.index, mean)]
    context.sensor_client.create_observations(output, streamid=output_stream_id)

The model functions as follows:

Lines 1 and 2 import a few classes from the Python client library (senaps_sensor) which are used to interact with the Sensor Data's observations endpoint
Line 4 imports the json package, used for parsing configuration inputs for the model, and the pandas package, used for manipulating time-series data.
Line 6 declares a function (multivariate_mean) that is the implementation of the multivariate_mean model.
Lines 7 through 11 read the input and output stream IDs and the weight values from the model's ports:
- In line 7, the IDs of the input streams are read from the model's inputs port, a multi-stream port (see "The Manifest" below).
- In line 8, the ID of the output stream is read from the model's output port, a stream port (see "The Manifest" below).
- In line 10, a JSON document is read from the model's weights port, a document port (see "The Manifest" below). If the port is missing or contains an empty string, an empty JSON object is used by default.
- In line 11, the JSON document read on the preceding line is parsed.
Line 14 uses the context's update() method to provide a status update (see "The Context Object" below).
Lines 15 through 26 obtain a Pandas DataFrame object representing the observations in the input streams:
- To interact with the Sensor Data API, the pre-configured API client contained in the context's sensor_client property is used (see "The Context Object" below).
- The observation data is obtained through repeated requests to the Sensor Data observations API, each downloading a separate temporal segment of the complete dataset, as follows:
  - The streamids variable is a comma-separated list of the streams to download observations for.
  - The parser variable contains an instance of PandasObservationParser, used to convert the returned observation data to a Pandas DataFrame object.
  - The limit variable specifies that each segment shall contain at most 10,000 observations.
  - The data variable is used to accumulate the observation data.
  - The start variable tracks the starting timestamp of the current temporal window. The initial value of None causes the algorithm to begin at the start of the observation data.
  - The algorithm proceeds as follows:
    - A segment of data is downloaded from the Sensor Data API by using the get_observations method of the Sensor Data client object.
    - The newly acquired data is concatenated to whatever existing data is contained in the data variable.
    - If the newly acquired data is less than 10,000 observations in length, this implies that the end of the data has been reached, and the algorithm terminates.
    - Otherwise, the start variable is updated such that the next segment will begin after the timestamp of the last observation in the current segment.
- The returned DataFrame is indexed by the observation timestamp, and has one column per input stream (in the same order as given to the get_observation method's streamid parameter).
Lines 30 and 31 apply the weights to each column of the DataFrame:
- Line 30 iterates over each stream ID/weight pair.
- Line 31 selects the column corresponding to the current stream ID, then multiplies it by the weight.
Line 32 computes the sum of the weights by iterating over each of the stream IDs for the input streams, retrieving the corresponding weight (defaulting to 1.0 if no weight is given), and computing their sum.
Line 35 computes the mean by summing each row of the DataFrame, then dividing by the sum of the weights.
Lines 36 through 38 store the mean in the output data stream:
- Lines 36 and 37 create an Observation object of the Sensor Data client library, and sets its results property to a list of UnivariateResult objects, each representing a single observation in the output stream.
- Line 38 uses the pre-configured Sensor Data API client contained in the context's sensor_client property (see "The Context Object" below) to upload those observations into the output stream.

Model Execution

In order to execute the model, the Analysis Services system needs to be able to locate the multivariate_mean function and execute it, passing in the context object.

The mechanism by which this is achieved is quite simple: when the system receives a request to execute the model, it loads the Python file as a module, then looks for a callable (function or object with a __call__ method) that has the same name as the model (in this case, multivariate_mean). It then calls the callable, passing in a "context" object (see the next section for details) tailored to the particular execution request.

For this reason, the model implementation function must have the exact same name as the ID of the model, as declared in the model's manifest (see "The Manifest" below). One implication of this restriction is that Python model IDs must also be valid Python identifiers - that is, they must only contain alphanumeric characters or the underscore (_) character, must not start with a numeric character, and must not be a Python reserved word.

This limitation can be circumvented, if required, using the @model annotation available in the as_models.models package. For example, if it was desirable to instead use multivariate.mean as the model ID (which is an invalid Python identifier due to the . character), this could be remedied as follows:

from senaps_sensor.parsers import PandasObservationParser
from senaps_sensor.models import Observation, UnivariateResult

import json, pandas as pd

from as_models.model import model

@model('multivariate.mean')
def multivariate_mean(context):
	// etc...

The Context Object

When a model implementation function is called in order to execute the model, the sole parameter passed to it is a "context" object. The purpose of this object is to serve as a means of passing input parameters and helper objects to the model, and for the model to pass outputs and status messages back to the Analysis Services environment.

The context object has the following properties:

Property	Description
`model_id`	The ID of the model that is being executed. This may be used (for example) for logging purposes, or to allow multiple models to be implemented within a single implementation function.
`ports`	A mapping of input/output port names to `Port` objects. For example, if your model declares an input port named "input", then the code `context.ports['input']` will return a `Port` object representing that port. The context object also supports property-style access to ports - for example, the same port could have been accessed using the syntax `context.input`. This approach only works for ports whose names are valid Python identifiers - all other ports must use the dictionary lookup style syntax. There are four subclasses of the `Port` class, each corresponding to one of the four port types: An instance of the `StreamPort` class is used to represent "stream" ports. This class exposes a single property `stream_id` which contains the ID of the stream assigned to the port. An instance of the `MultistreamPort` class is used to represent "multistream" ports. This class exposes a single property `stream_ids` which contains a list of the IDs of the streams assigned to the port. An instance of the `DocumentPort` class is used to represent "document" ports. This class exposes a single property `value` which contains the document assigned to the port, as a string. If the port is an output port, then a new value for the document may be set by assigning to the same property. An instance of the `GridPort` class is used to represent grid ports. This class exposes properties named `catalog_url` and `dataset_path` corresponding to the Thredds Data Server's catalog URL and the relative path of the grid dataset respectively. It also exposes a `dataset` property containing a `tds_client.Dataset` object corresponding to the specified Thredds dataset.
`sensor_client`	A pre-configured instance of the Sensor Data API client. The client is automatically configured to talk to the specific Sensor Data instance corresponding to the Analysis Services environment the model is running within. As such, the model author can simply start making requests of the API without having to perform any further configuration. Note that this does not in any way prevent the model author from instantiating their own instance of the Sensor Data client configured in a manner that suits their specific needs. The `senaps_sensor` package must be installed locally in order to use this client.
`analysis_client`	A pre-configured instance of the Analysis Services API client. The client is automatically configured to talk to the same Analysis Services API as is being used to run the model. As with the Sensor Data client, there is no restriction against the model author creating their own instances of the Analysis Services API client if they have a need to do so. The `as_client` package must be installed locally in order to use this client.
`thredds_client`	A pre-configured instance of the TDS Client. The client is automatically configured to talk to the TDS server that is hosted as part of the same environment the model is running within. Again, there is no restriction against the model author creating their own instance of the TDS client if they have a need to do so. The `tds_client` package must be installed locally in order to use this client.
`thredds_client`	A pre-configured instance of the TDS Upload Client. The client is automatically configured to talk to the TDS server that is hosted as part of the same environment the model is running within. As with the other clients, there is no restriction against the model author creating their own instance of the TDS client if they have a need to do so. The `tds_upload` package must be installed locally in order to use this client.

In addition to these properties, the context object provides a method for controlling and reporting on the model's execution:

Method	Description
`update(progress, message)`	May be called to inform the Analysis Services environment on the current progress of the model's execution. This method takes two parameters, both optional, and both accepted as either positional arguments or keyword arguments: `progress` - a floating-point number between `0.0` and `1.0` indicating the overall progress of the model's execution (`0.0` indicating no progress has been made, `1.0` indicating that the model has completed execution). The caller may alternatively specify `None` to signify an indeterminate degree of progress. `message` - a human-readable message string. If either parameter is omitted, the system continues to report whatever value was previously provided, or will report `None` if no value was previously provided.

Method

Description

update(progress, message)

May be called to inform the Analysis Services environment on the current progress of the model's execution.
This method takes two parameters, both optional, and both accepted as either positional arguments or keyword arguments:
progress - a floating-point number between 0.0 and 1.0 indicating the overall progress of the model's execution (0.0 indicating no progress has been made, 1.0 indicating that the model has completed execution). The caller may alternatively specify None to signify an indeterminate degree of progress.

message - a human-readable message string.
If either parameter is omitted, the system continues to report whatever value was previously provided, or will report None if no value was previously provided.

Exception Handling

Any exceptions raised within the model execution function are caught by the system, and will cause the model execution to be terminated and a "failed" status to be relayed to the user. The stack trace of the exception is also relayed as part of the error report to the user.

The Manifest

The following JSON document is the model's manifest:

{
	"baseImage": "1132000a-8b05-4229-8cdc-a7b3bd8fa511",
	"organisationId": "your_org_id",
	"groupIds": ["sandbox"],
	"entrypoint": "model.py",
	"dependencies": [],
	"models": [{
		"id": "multivariate_mean",
		"name": "Multivariate Mean",
		"version": "0.0.1",
		"description": "Computes mean of multiple aligned data streams.",
		"method": "Downloads observation data into a Pandas DataFrame, optionally weights each column, computes sum of each row, then divides by the sum of the weights.",
		"ports": [
		{
			"portName": "inputs",
			"required": true,
			"type": "multistream",
			"description": "The streams to be averaged",
			"direction": "input"
		},
		{
			"portName": "weights",
			"required": false,
			"type": "document",
			"description": "The weighting of each stream, given as a JSON object of stream_id -> weight pairs. If no weight specified for a stream, weight defaults to 1.0.",
			"direction": "input"
		},
		{
			"portName": "output",
			"required": true,
			"type": "stream",
			"description": "The stream to place the averaged data into.",
			"direction": "output"
		}
		]
	}
	]
}

For a detailed discussion of the format of a manifest, please refer to "The Manifest File" in the main Model Developer's Guide.

In this specific case, the manifest declares the following:

The model image will be based on the "Python 3" image (with base image ID 1132000a-8b05-4229-8cdc-a7b3bd8fa511). See "Python Base Images" below for more information.
The model will be "owned" by the csiro organisation.
The model will be "owned" by the sandbox group within the csiro organisation.
The main model code is in the file "model.py"
The model has no additional third-party dependencies (all required dependencies in this case come pre-installed on the selected base image).
There is a single model implemented:
- The model has the ID multivariate_mean, corresponding to the multivariate_mean function in the Python code.
- The model's human-friendly name is "Multivariate Mean".
- The model's version is 0.0.1.
- A description of the model, and a summary of the approach it uses (i.e. the "method") are also provided.
- The model has three ports:
  - The "inputs" port is a multi-stream input port that takes a list of the streams to be averaged.
  - The "weights" port is an optional document input port that takes a JSON object containing stream IDs as keys and weights as values.
  - The "output" port is a stream output port that takes the ID of the stream to place the computed average into.

Python Base Images

There are a number of base images for Python models currently available in the Analysis Services system. To find the available Python base images, perform a GET request on the /base-images endpoint and look for images with a runtimetype of PYTHON.

The basic image for Python development is the "Python 3" base image (image ID 1132000a-8b05-4229-8cdc-a7b3bd8fa511) that contains only a Python 3 interpreter, the dependencies listed in "Development Environment", and the Numpy and Pandas Python libraries.

As of writing, there are also a number of extended Python images that add further packages to this basic image:

The "Python StatsModels" image (image ID 15e283e0-7406-47b9-92ce-9ffab9123db1) is the same as the basic image, but with StatsModels also preinstalled.
The "Python Keras" image (image ID 15e283e0-7406-47b9-92ce-9ffab9123db1) is the same as the basic image, but with Keras also preinstalled.

Where possible, it is best to prefer using the extended images, instead of using the basic image and declaring a dependency in the manifest. For example, rather than using the Python base image 1132000a-8b05-4229-8cdc-a7b3bd8fa511 and declaring a dependency on StatsModels in your manifest, instead use the Python StatsModels image 15e283e0-7406-47b9-92ce-9ffab9123db1. This improves model load time by reducing image size.

There do exist Python 2 equivalents of each of these base images but, due to the imminent discontinuation of Python 2.7, their use is strongly discouraged. They should only be used if your model has dependencies that cannot be used with Python 3.

We cannot guarantee to maintain availability of these base images for any length of time.

Installing the Model

The generic process for installing a new model is discussed in detail in the "Installing Models" section of the Model Developer's Guide.

An alternative approach is to use the Analysis Services command-line client that comes packaged with the as_client Python library. As an example, to install a model whose files (including the manifest file) are present on the local machine under the path /path/to/model/files, something like the following command might be used:

python3 -m as_client https://senaps.eratos.com/api/analysis/models install_model /path/to/model/files --apikey "your_api_key"

Offline Testing

During the development of a new model, it can be quite useful to test the model locally before uploading it.

To support this, the as_models.testing package includes a Context class that can be used to generate a fake context object for testing purposes, which can be manually passed to the model function in order to test it.

This Context class exposes all the same properties and methods as the "real" context object. It also exposes the following methods, which can be used to configure the context before passing it to the model function:

Method	Description
`set_model_id(model_id)`	Sets the context's `model_id` property. Returns the context object itself, allowing method chaining to occur.
`configure_port(name, type, direction, stream_id,stream_ids, value)`	Used to configure one of the model's ports. The method's parameters are as follows: `name` - the name of the port to configure. Always required. `type` - the port type, one of `as_models.ports.STREAM_PORT`, `as_models.ports.STREAM_COLLECTION_PORT` or `as_models.ports.DOCUMENT_PORT`. Always required. `direction` - the port direction, one of `as_models.ports.INPUT_PORT` or `as_models.ports.OUTPUT_PORT`. Always required. `stream_id` - the stream ID (a string) to associate with the port. Only permitted if `type` is `as_models.ports.STREAM_PORT`. `stream_ids` - the stream IDs (a list of strings) to associate with the port. Only permitted if `type` is `as_models.ports.STREAM_COLLECTION_PORT`. `value` - the document value (a string) to associate with the port. Only permitted if `type` is `as_models.ports.DOCUMENT_PORT`.
`configure_sensor_client(url, scheme, host, api_root, port, api_key)`	Used to configure the context's `sensor_client` property. The `senaps_sensor` package must be installed locally in order to use this client. The method's parameters are as follows: `url` - the Sensor Data API base URL. Optional if the `scheme` and `host` parameters are supplied. `scheme` - the URL scheme to use (e.g. `"http"`). If omitted, this is inferred from the `url` parameter. If both are given, this overrides the scheme provided by the `url` parameter. `host` - the Sensor Data API's hostname (e.g. `senaps.eratos.com`). If omitted, this is inferred from the `url` parameter. If both are given, this overrides the hostname provided by the `url` parameter. `api_root` - the URL path at which the Sensor Data API resides (e.g. `/api/sensor/v2`). If omitted, this is inferred from the `url` parameter. If both are given, this overrides the path provided by the `url` parameter. `port` - the port number the Sensor Data API can be contacted on (e.g. port 80 for HTTP). If omitted, this is inferred from the `url` parameter. If both are given, this overrides the port provided by the `url` parameter. 'api_key' - a Sensor Data API key to use for key-based authentication. Typically, only the `url` parameter is supplied alone, or the `scheme`, `host`, `api_root` and/or `port` parameters are supplied.
`configure_analysis_client(url, scheme, host, api_root, port, api_key)`	Used to configure the context's `analysis_client` property. The `as_client` package must be installed locally in order to use this client. The parameters are the same as for the `configure_sensor_client()` method described above.
`configure_thredds_client(url, scheme, host, api_root, port, api_key)`	Used to configure the context's `thredds_client` property. The `tds_client` package must be installed locally in order to use this client. The parameters are the same as for the `configure_sensor_client()` method described above.
`configure_thredds_upload_client(url, scheme, host, api_root, port, api_key)`	Used to configure the context's `thredds_upload_client` property. The `tds_upload` package must be installed locally in order to use this client. The parameters are the same as for the `configure_sensor_client()` method described above.
`configure_clients(url, scheme, host, port,api_key, sensor_path, analysis_path, thredds_path, thredds_upload_path)`	A shorthand method that can be used to configure all four client properties simultaneously. The parameters are the same as for the `configure_sensor_client()` method described above, except that instead of a single `api_root` parameter, there are four parameters (`sensor_path`, `analysis_path`, `thredds_path` and `thredds_upload_path`) that correspond to the API root of the Sensor Data, Analysis Service, TDS and TDS upload clients respectively. These parameters are all optional - if any one is omitted, the corresponding client is not initialised. This method initialises the clients to point to the same host (using the `url`, `scheme`, `host` and/or `port` parameters), and use the same authentication (using `api_key` parameters). As such, it is assumed that the four services are all hosted on the same server, and are distinguished only through their respective URL paths.

Example

The following example demonstrates using these testing facilities in a Python unittest test class:

import getpass
import unittest

from as_models.ports import STREAM_COLLECTION_PORT, STREAM_PORT, INPUT_PORT, OUTPUT_PORT, DOCUMENT_PORT
from as_models.testing import Context

from model import multivariate_mean

SENAPS_URL = 'https://senaps.eratos.com/api/sensor/v2'
MY_API_KEY = '...'


class Tests(unittest.TestCase):
    """ Integration tests to run the model code against live Senaps APIs.
        Before running this test:
        1. Update MY_API_KEY with your own
        2. Check the SENAPS_URL is correct
        3. Ensure the 'test_input_1' and 'test_output_1' streams exist in Senaps and you have permission to read/write streams
    """

    def setUp(self):
        self.context = Context()
        self.context.configure_sensor_client(url=SENAPS_URL, api_key=MY_API_KEY)

    def test_multivariate_mean(self):
        self.context.configure_port("inputs", STREAM_COLLECTION_PORT, INPUT_PORT, stream_ids=["test_input_1"])
        self.context.configure_port("weights", DOCUMENT_PORT, INPUT_PORT, value=None)
        self.context.configure_port("output", STREAM_PORT, OUTPUT_PORT, stream_id=["test_output_1"])

        multivariate_mean(self.context)

if __name__ == '__main__':
    unittest.main()