Skip to content

Tools and Project Structure

In the following sections we will go over the steps for the implementation of a MLOps Proof-of-Concept pipeline using IBM Watson tools and services. A template repository with a complete MLOps cycle: versioning data, generating reports on pull requests and deploying the model on releases with DVC and CML using Github Actions and IBM Watson as well as instructions to run the project can be found here.

Note

We won't get into how to create predictive models or preprocessing data, since our main objective is to discuss MLOps and create a development cycle using those concepts.

drawing

Project Tools

The main tools discussed in the guide are shown in the following table. As the guide is intended to be modular, a team can swap tools for others depending on the project necessities or preferences.

Tools Function Developer
IBM Watson ML Deploying model as API IBM
IBM Watson OpenScale Monitoring Model in production IBM
DVC Data and Model Versioning Iterative
CML Pipeline Automation Iterative
Terraform Setups IBM infrastructure with script HashiCorp
Github Code versioning Github
Github Actions CI/CD Automation Github
Pytest Python script testing Pytest-dev
Pre-commit Running tests on local commit Pre-commit
Cookiecutter Creating folder structure and files Cookiecutter

Folder Structure

drawing

The above image is the project's folder structure, we'll talk about each specific part in further details trough out the guide.

  • data , models and results contain files which are being stored and versioned by DVC.

  • notebooks contain Jupyter Notebooks used for the exploratory analysis, development of models, or data manipulation.

  • src contains scripts for training and evaluating the model as well as tests and scripts for pipelines and APIs.

This folder structure is going to be implemented in a blank project in Introduction/Starting a New Project with Cookiecutter

Requirements

The requirements file is a list of all of a project’s dependencies and the specific version of each dependency, including the dependencies needed by the dependencies. It can also be used to create a virtual environment. This is extremely important to avoid conflicts between Python libraries and also ensure the experiments can be reproduced in different machines.

Metadata File

To keep track of the model information we have a metadata.yaml file, this helps with CI/CD and pipeline automation. Such as updating or deploying the model without the need of user input.

author: guipleite
datetime_creted: 29/03/2021_13:46:23:802394723
model_type: scikit-learn_0.23
project_name: Rain_aus
project_version: v0.3
deployment_uid: e02e481d-4a56-470f-baa9-ae84a583c0a8
model_uid: f29e4cfc-3aab-438a-b703-fabc265f43a3

Using Jupyter Notebooks vs. Python Scripts

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is widely used in the fields of Data Science and Machine learning for its versatility in development and documentation of projects, however the usage of notebooks may cause some problems for our development cycle:

  1. Versioning : Since notebooks source code are much more complex , we can't easily visualize the difference between versions using git. There are some tools that can help with that, however.

  2. Reproducibility : A great feature of notebooks is being able to run cells in a non-sequential order, but this is a big problem if we want to reproduce the code, since it's hard to know in what order or which cells where executed, this is especially bad if we want to automate pipeline.

  3. Standardized In/Out : By using scripts we can create pipelines with standardized entries and exits, therefore, we can create universal pipelines since no matter the model what it will receive and return will be in the same format.

  4. Access to Functions : In the model.py script, we define the train and evaluate function, where the model is declared and trained and the metrics for the evaluation are defined. These functions can be called by other scripts such as train.py and evaluate.py so we can create pipelines to train the model on a remote instance or evaluate an already trained model file in a consistent form.

    def train(data, params):
            ...
            return pipeline, logs
    
    def evaluate(data, pipeline, OUTPUT_PATH):
            ...
            return results
    

In our project we choose to use scripts instead of Jupyter Notebooks for the reasons cited above, however notebooks could still be used as a form of experimentation of models or processes and the script as a more 'definitive' form.