Tools and Project Structure
In the following sections we will go over the steps for the implementation of a MLOps Proof-of-Concept pipeline using IBM Watson tools and services. A template repository with a complete MLOps cycle: versioning data, generating reports on pull requests and deploying the model on releases with DVC and CML using Github Actions and IBM Watson as well as instructions to run the project can be found here.
We won't get into how to create predictive models or preprocessing data, since our main objective is to discuss MLOps and create a development cycle using those concepts.
The main tools discussed in the guide are shown in the following table. As the guide is intended to be modular, a team can swap tools for others depending on the project necessities or preferences.
|IBM Watson ML
|Deploying model as API
|IBM Watson OpenScale
|Monitoring Model in production
|Data and Model Versioning
|Setups IBM infrastructure with script
|Python script testing
|Running tests on local commit
|Creating folder structure and files
The above image is the project's folder structure, we'll talk about each specific part in further details trough out the guide.
resultscontain files which are being stored and versioned by DVC.
notebookscontain Jupyter Notebooks used for the exploratory analysis, development of models, or data manipulation.
srccontains scripts for training and evaluating the model as well as tests and scripts for pipelines and APIs.
This folder structure is going to be implemented in a blank project in Introduction/Starting a New Project with Cookiecutter
The requirements file is a list of all of a project’s dependencies and the specific version of each dependency, including the dependencies needed by the dependencies. It can also be used to create a virtual environment. This is extremely important to avoid conflicts between Python libraries and also ensure the experiments can be reproduced in different machines.
To keep track of the model information we have a
metadata.yaml file, this helps with CI/CD and pipeline automation. Such as updating or deploying the model without the need of user input.
Using Jupyter Notebooks vs. Python Scripts
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is widely used in the fields of Data Science and Machine learning for its versatility in development and documentation of projects, however the usage of notebooks may cause some problems for our development cycle:
Versioning : Since notebooks source code are much more complex , we can't easily visualize the difference between versions using git. There are some tools that can help with that, however.
Reproducibility : A great feature of notebooks is being able to run cells in a non-sequential order, but this is a big problem if we want to reproduce the code, since it's hard to know in what order or which cells where executed, this is especially bad if we want to automate pipeline.
Standardized In/Out : By using scripts we can create pipelines with standardized entries and exits, therefore, we can create universal pipelines since no matter the model what it will receive and return will be in the same format.
Access to Functions : In the
model.pyscript, we define the
evaluatefunction, where the model is declared and trained and the metrics for the evaluation are defined. These functions can be called by other scripts such as
evaluate.pyso we can create pipelines to train the model on a remote instance or evaluate an already trained model file in a consistent form.
def train(data, params): ... return pipeline, logs def evaluate(data, pipeline, OUTPUT_PATH): ... return results
In our project we choose to use scripts instead of Jupyter Notebooks for the reasons cited above, however notebooks could still be used as a form of experimentation of models or processes and the script as a more 'definitive' form.