Model Developer's Guide: Python Models
Introduction
The following will run through the process of developing a Python-based model for the Analysis Services environment.
As an example, this tutorial will run through the process of developing a Python-based model that computes the weighted mean of a number of data streams.
Prior Reading
Although not essential, there are some other documents that are beneficial to read before embarking on this tutorial:
- The Analysis Services API Tutorial gives an overview of the application programming interface (API) for the Analysis Services environment.
- The Model Developer's Guide introduces the general concepts that are useful for all developers of models for the Analysis Services environment (i.e. not just those developing models in Python).
Development Environment
There are a number of Python libraries that are required to develop models for the Analysis Services environment. The as_models
package, available here, is the core library which is used to interface your model with the Analysis Services system.
In addition to the core as_models
package, there are four other optional libraries which may be needed to interact with other components of the Senaps environment:
-
The
senaps_sensor
package available here, is a Python client library for the core Senaps API. It is required for handling "stream" and "multistream" inputs and output ports, or for making general requests to the Senaps API. -
The
as_client
package, available here, is a Python client library for the Analysis Services API. It is required for interacting with the Senaps Analysis Service API. -
The
tds_client
package, available here, is a Python client library for the Thredds Data Server (TDS). It is required for handling "grid" input ports, or for making requests to TDS servers. -
The
tds_upload
package, available here, is a Python client library for uploading data to the Senaps environment's TDS server. It is required for handling "grid" output ports.
These are all distributed with a setup.py
script, and can be installed with the following command:
sudo python3 setup.py install # Linux
python3 setup.py install # Windows
All the required dependencies should be automatically installed. There are, however, some optional dependencies that should be installed if using extra functionality:
- Pandas is required if using the Sensor Data client's
PandasObservationParser
class. - netCDF4 is required if using the TDS client's NetCDF Subset Service.
The Python Code
The following Python code defines a "multivariate mean" model:
from senaps_sensor.parsers import PandasObservationParser
from senaps_sensor.models import Observation, UnivariateResult
import json, pandas as pd
def multivariate_mean(context):
input_stream_ids = [p.stream_id for p in context.ports['inputs']]
output_stream_id = context.ports['output'].stream_id
weights_doc = getattr(context.ports.get('weights'), 'value', None) or '{}'
weights = json.loads(weights_doc)
# Obtain observation data from `inputs` streams.
context.update(message='Loading data...')
streamids = ','.join(input_stream_ids)
parser = PandasObservationParser()
limit = 10000
data = start = None
while True:
segment = context.sensor_client.get_observations(streamid=streamids, media='csv', limit=limit, start=start, si=False, parser=parser)
data = pd.concat((data, segment))
if len(segment) < limit:
break
start = segment.index[-1].strftime('%Y-%m-%dT%H:%M:%S.%fZ')
# Apply weights to the data.
context.update(message='Weighting data...')
for i, (stream_id, weight) in enumerate(weights.iteritems()):
data[stream_id] *= weight
sum_of_weights = sum(weights.get(stream_id, 1.0) for stream_id in input_stream_ids)
# Compute mean and store in `output` stream..
mean = data.sum(axis=1) / sum_of_weights
output = Observation()
output.results = [UnivariateResult(t=t.strftime('%Y-%m-%dT%H:%M:%S.%fZ'),v=v) for t,v in zip(mean.index, mean)]
context.sensor_client.create_observations(output, streamid=output_stream_id)
The model functions as follows:
- Lines 1 and 2 import a few classes from the Python client library (
senaps_sensor
) which are used to interact with the Sensor Data's observations endpoint - Line 4 imports the
json
package, used for parsing configuration inputs for the model, and thepandas
package, used for manipulating time-series data. - Line 6 declares a function (
multivariate_mean
) that is the implementation of themultivariate_mean
model. - Lines 7 through 11 read the input and output stream IDs and the weight values from the model's ports:
- In line 7, the IDs of the input streams are read from the model's
inputs
port, a multi-stream port (see "The Manifest" below). - In line 8, the ID of the output stream is read from the model's
output
port, a stream port (see "The Manifest" below). - In line 10, a JSON document is read from the model's
weights
port, a document port (see "The Manifest" below). If the port is missing or contains an empty string, an empty JSON object is used by default. - In line 11, the JSON document read on the preceding line is parsed.
- In line 7, the IDs of the input streams are read from the model's
- Line 14 uses the context's
update()
method to provide a status update (see "The Context Object" below). - Lines 15 through 26 obtain a Pandas DataFrame object representing the observations in the input streams:
- To interact with the Sensor Data API, the pre-configured API client contained in the context's
sensor_client
property is used (see "The Context Object" below). - The observation data is obtained through repeated requests to the Sensor Data observations API, each downloading a separate temporal segment of the complete dataset, as follows:
- The
streamids
variable is a comma-separated list of the streams to download observations for. - The
parser
variable contains an instance ofPandasObservationParser
, used to convert the returned observation data to a PandasDataFrame
object. - The
limit
variable specifies that each segment shall contain at most 10,000 observations. - The
data
variable is used to accumulate the observation data. - The
start
variable tracks the starting timestamp of the current temporal window. The initial value ofNone
causes the algorithm to begin at the start of the observation data. - The algorithm proceeds as follows:
- A segment of data is downloaded from the Sensor Data API by using the
get_observations
method of the Sensor Data client object. - The newly acquired data is concatenated to whatever existing data is contained in the
data
variable. - If the newly acquired data is less than 10,000 observations in length, this implies that the end of the data has been reached, and the algorithm terminates.
- Otherwise, the
start
variable is updated such that the next segment will begin after the timestamp of the last observation in the current segment.
- A segment of data is downloaded from the Sensor Data API by using the
- The
- The returned
DataFrame
is indexed by the observation timestamp, and has one column per input stream (in the same order as given to theget_observation
method'sstreamid
parameter).
- To interact with the Sensor Data API, the pre-configured API client contained in the context's
- Lines 30 and 31 apply the weights to each column of the DataFrame:
- Line 30 iterates over each stream ID/weight pair.
- Line 31 selects the column corresponding to the current stream ID, then multiplies it by the weight.
- Line 32 computes the sum of the weights by iterating over each of the stream IDs for the input streams, retrieving the corresponding weight (defaulting to
1.0
if no weight is given), and computing their sum. - Line 35 computes the mean by summing each row of the DataFrame, then dividing by the sum of the weights.
- Lines 36 through 38 store the mean in the output data stream:
- Lines 36 and 37 create an
Observation
object of the Sensor Data client library, and sets itsresults
property to a list ofUnivariateResult
objects, each representing a single observation in the output stream. - Line 38 uses the pre-configured Sensor Data API client contained in the context's
sensor_client
property (see "The Context Object" below) to upload those observations into the output stream.
- Lines 36 and 37 create an
Model Execution
In order to execute the model, the Analysis Services system needs to be able to locate the multivariate_mean
function and execute it, passing in the context object.
The mechanism by which this is achieved is quite simple: when the system receives a request to execute the model, it loads the Python file as a module, then looks for a callable (function or object with a __call__
method) that has the same name as the model (in this case, multivariate_mean
). It then calls the callable, passing in a "context" object (see the next section for details) tailored to the particular execution request.
For this reason, the model implementation function must have the exact same name as the ID of the model, as declared in the model's manifest (see "The Manifest" below). One implication of this restriction is that Python model IDs must also be valid Python identifiers - that is, they must only contain alphanumeric characters or the underscore (_
) character, must not start with a numeric character, and must not be a Python reserved word.
This limitation can be circumvented, if required, using the @model
annotation available in the as_models.models
package. For example, if it was desirable to instead use multivariate.mean
as the model ID (which is an invalid Python identifier due to the .
character), this could be remedied as follows:
from senaps_sensor.parsers import PandasObservationParser
from senaps_sensor.models import Observation, UnivariateResult
import json, pandas as pd
from as_models.model import model
@model('multivariate.mean')
def multivariate_mean(context):
// etc...
The Context Object
When a model implementation function is called in order to execute the model, the sole parameter passed to it is a "context" object. The purpose of this object is to serve as a means of passing input parameters and helper objects to the model, and for the model to pass outputs and status messages back to the Analysis Services environment.
The context object has the following properties:
Property | Description |
---|---|
| The ID of the model that is being executed. |
| A mapping of input/output port names to
|
| A pre-configured instance of the Sensor Data API client. |
| A pre-configured instance of the Analysis Services API client. |
| A pre-configured instance of the TDS Client. |
| A pre-configured instance of the TDS Upload Client. |
In addition to these properties, the context object provides a method for controlling and reporting on the model's execution:
Method | Description |
---|---|
| May be called to inform the Analysis Services environment on the current progress of the model's execution.
|
Exception Handling
Any exceptions raised within the model execution function are caught by the system, and will cause the model execution to be terminated and a "failed" status to be relayed to the user. The stack trace of the exception is also relayed as part of the error report to the user.
The Manifest
The following JSON document is the model's manifest:
{
"baseImage": "1132000a-8b05-4229-8cdc-a7b3bd8fa511",
"organisationId": "your_org_id",
"groupIds": ["sandbox"],
"entrypoint": "model.py",
"dependencies": [],
"models": [{
"id": "multivariate_mean",
"name": "Multivariate Mean",
"version": "0.0.1",
"description": "Computes mean of multiple aligned data streams.",
"method": "Downloads observation data into a Pandas DataFrame, optionally weights each column, computes sum of each row, then divides by the sum of the weights.",
"ports": [
{
"portName": "inputs",
"required": true,
"type": "multistream",
"description": "The streams to be averaged",
"direction": "input"
},
{
"portName": "weights",
"required": false,
"type": "document",
"description": "The weighting of each stream, given as a JSON object of stream_id -> weight pairs. If no weight specified for a stream, weight defaults to 1.0.",
"direction": "input"
},
{
"portName": "output",
"required": true,
"type": "stream",
"description": "The stream to place the averaged data into.",
"direction": "output"
}
]
}
]
}
For a detailed discussion of the format of a manifest, please refer to "The Manifest File" in the main Model Developer's Guide.
In this specific case, the manifest declares the following:
- The model image will be based on the "Python 3" image (with base image ID
1132000a-8b05-4229-8cdc-a7b3bd8fa511
). See "Python Base Images" below for more information. - The model will be "owned" by the csiro organisation.
- The model will be "owned" by the sandbox group within the csiro organisation.
- The main model code is in the file "model.py"
- The model has no additional third-party dependencies (all required dependencies in this case come pre-installed on the selected base image).
- There is a single model implemented:
- The model has the ID
multivariate_mean
, corresponding to themultivariate_mean
function in the Python code. - The model's human-friendly name is "Multivariate Mean".
- The model's version is 0.0.1.
- A description of the model, and a summary of the approach it uses (i.e. the "method") are also provided.
- The model has three ports:
- The "inputs" port is a multi-stream input port that takes a list of the streams to be averaged.
- The "weights" port is an optional document input port that takes a JSON object containing stream IDs as keys and weights as values.
- The "output" port is a stream output port that takes the ID of the stream to place the computed average into.
- The model has the ID
Python Base Images
There are a number of base images for Python models currently available in the Analysis Services system. To find the available Python base images, perform a GET
request on the /base-images
endpoint and look for images with a runtimetype
of PYTHON
.
The basic image for Python development is the "Python 3" base image (image ID 1132000a-8b05-4229-8cdc-a7b3bd8fa511
) that contains only a Python 3 interpreter, the dependencies listed in "Development Environment", and the Numpy and Pandas Python libraries.
As of writing, there are also a number of extended Python images that add further packages to this basic image:
- The "Python StatsModels" image (image ID
15e283e0-7406-47b9-92ce-9ffab9123db1
) is the same as the basic image, but with StatsModels also preinstalled. - The "Python Keras" image (image ID
15e283e0-7406-47b9-92ce-9ffab9123db1
) is the same as the basic image, but with Keras also preinstalled.
Where possible, it is best to prefer using the extended images, instead of using the basic image and declaring a dependency in the manifest. For example, rather than using the Python base image 1132000a-8b05-4229-8cdc-a7b3bd8fa511
and declaring a dependency on StatsModels in your manifest, instead use the Python StatsModels image 15e283e0-7406-47b9-92ce-9ffab9123db1
. This improves model load time by reducing image size.
There do exist Python 2 equivalents of each of these base images but, due to the imminent discontinuation of Python 2.7, their use is strongly discouraged. They should only be used if your model has dependencies that cannot be used with Python 3.
We cannot guarantee to maintain availability of these base images for any length of time.
Installing the Model
The generic process for installing a new model is discussed in detail in the "Installing Models" section of the Model Developer's Guide.
An alternative approach is to use the Analysis Services command-line client that comes packaged with the as_client
Python library. As an example, to install a model whose files (including the manifest file) are present on the local machine under the path /path/to/model/files
, something like the following command might be used:
python3 -m as_client https://senaps.eratos.com/api/analysis/models install_model /path/to/model/files --apikey "your_api_key"
Offline Testing
During the development of a new model, it can be quite useful to test the model locally before uploading it.
To support this, the as_models.testing
package includes a Context
class that can be used to generate a fake context object for testing purposes, which can be manually passed to the model function in order to test it.
This Context
class exposes all the same properties and methods as the "real" context object. It also exposes the following methods, which can be used to configure the context before passing it to the model function:
Method | Description |
---|---|
| Sets the context's |
| Used to configure one of the model's ports.
|
| Used to configure the context's
|
| Used to configure the context's |
| Used to configure the context's |
| Used to configure the context's |
| A shorthand method that can be used to configure all four client properties simultaneously. |
Example
The following example demonstrates using these testing facilities in a Python unittest
test class:
import getpass
import unittest
from as_models.ports import STREAM_COLLECTION_PORT, STREAM_PORT, INPUT_PORT, OUTPUT_PORT, DOCUMENT_PORT
from as_models.testing import Context
from model import multivariate_mean
SENAPS_URL = 'https://senaps.eratos.com/api/sensor/v2'
MY_API_KEY = '...'
class Tests(unittest.TestCase):
""" Integration tests to run the model code against live Senaps APIs.
Before running this test:
1. Update MY_API_KEY with your own
2. Check the SENAPS_URL is correct
3. Ensure the 'test_input_1' and 'test_output_1' streams exist in Senaps and you have permission to read/write streams
"""
def setUp(self):
self.context = Context()
self.context.configure_sensor_client(url=SENAPS_URL, api_key=MY_API_KEY)
def test_multivariate_mean(self):
self.context.configure_port("inputs", STREAM_COLLECTION_PORT, INPUT_PORT, stream_ids=["test_input_1"])
self.context.configure_port("weights", DOCUMENT_PORT, INPUT_PORT, value=None)
self.context.configure_port("output", STREAM_PORT, OUTPUT_PORT, stream_id=["test_output_1"])
multivariate_mean(self.context)
if __name__ == '__main__':
unittest.main()