Automate your Machine Learning development pipeline with PyCaret

Automate your Machine Learning development pipeline with PyCaret

Data science is not easy, we all know that. Even programming requires a lot of your cycles to get fully onboarded. Don’t get me wrong, I love being a developer to some extent, but is hard. You can read and watch a ton of videos about how easy is to get into programming, but as with everything in life, if you are not passionate, you may find some roadblocks along the way.

I get it, you may be thinking, “Nice way to start a post!, I’m out dude”, but, let me tell you that even though becoming a data scientist is a challenge, as we are becoming more data-centric, data-aware, and data-dependent, you need to sort these issues out to become a specialist, that’s part of the journey.

As you probably know, data cleansing and data preparation consume on average between 75%-90% of your time towards data analysis. In addition, if you are developing a machine learning model, you must spend time reviewing different algorithms, their accuracy, precision, recall, and error rate, among other key variables.

Now, what about if you have a library that can help you to automate that process and gives you the best alternatives to train your model? That is what PyCaret is all about.

I consider myself a data science enthusiast. I enjoy developing machine learning and deep learning projects, finding patterns in data, and creating breath-taking visualizations. But to be honest, there are some repetitive tasks that I’d love to automate.
PyCaret, according to their developers, “is an open-source, low-code machine learning library in Python that automates machine learning workflows.” Cool, isn’t it?

Ok, I got your attention, right? What does this mean for the end-users starting in data science? Now, you can grab data, use low-code tools, and completely automate the machine learning development projects without worrying about writing hundreds of lines of code. You may be thinking, “This is cheating, right?” Actually, when working on real-life projects, you need to optimize the time you spend on repetitive tasks, and even though incorporating the foundations of machine learning models is a must if you are into data science, you need to leverage any tool available out there that enables you to reduce the “time-to-market” and allow you to focus on those steps that add value to your project.

PyCaret is considered a Low-Code solution. PyCaret encourages non-programmers users to break that entry barrier they face when developing their first Machine Learning models by reducing the number of lines of code, libraries, and tasks required to build projects.

Let’s take a look at how PyCaret can take your development process to the next level. For the purpose of this example, I’ll be using Google Colab, but you can use the notebook or development setup of your choice.

Setting up your project

First thing first, open your notebook and install PyCaret.

<Hello, my friend Colab!>

Now, let’s install PyCaret.

!pip install pycaret

If you are using Google Colab, run the following code at top of your notebook after installing pycaret to display interactive visuals. This is not required on other setups.

from pycaret.utils import enable_colab
enable_colab()

The notebook engine will install Pycaret and all its dependencies. that’s it. Your environment is set up👌

What is binary classification?

Before we move forward, we assume that the learner has some experience with basic classification algorithms, nevertheless, let’s refresh your memory.

Binary classification is a supervised machine learning technique to predict categorical class labels which are discrete and unordered such as Pass/Fail, Positive/Negative, Yes/No, Default/Not-Default, etc.

A few real-world use cases for classification are listed below:

  • Run targeted marketing campaign based on whether customers' behavior.
  • Medical testing to determine if a patient has a certain disease or not — the classification property is the presence of the disease.
  • A “pass or fail” test method or quality control in factories, i.e. deciding if a specification has or has not been met — a go/no-go classification.

Dataset

Using PyCaret documentation, for this tutorial we will use Default of Credit Card Clients Dataset. This dataset contains information on default payments, demographic factors, credit data, payment history, and billing statements of credit card clients in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 features.

Short descriptions of each column are as follows:

  • ID: ID of each client
  • LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)
  • SEX: Gender (1=male, 2=female)
  • EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
  • MARRIAGE: Marital status (1=married, 2=single, 3=others)
  • AGE: Age in years
  • PAY_0 to PAY_6: Repayment status by n months ago (PAY_0 = last month … PAY_6 = 6 months ago) (Labels: -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
  • BILL_AMT1 to BILL_AMT6: Amount of bill statement by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
  • PAY_AMT1 to PAY_AMT6: Amount of payment by n months ago ( BILL_AMT1 = last_month .. BILL_AMT6 = 6 months ago)
  • default: Default payment (1=yes, 0=no) Target Column

Now that we understand our source, let’s download the data.

from pycaret.datasets import get_data
dataset = get_data('credit')

In order to demonstrate the predict_model() function on unseen data, a sample of 1200 records has been withheld from the original dataset to be used for predictions. Keep in mind that this is not a test data set, this unseen data is a subset that will leverage at the end of the project to run our predictions. Another way to think about this is that these records are not available at the time when the machine learning model was trained.

data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Setting up Environment in PyCaret

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret.

It takes two mandatory parameters: a pandas dataframe (I may create a post on pandas later on) and the name of the target column.

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the experiment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

from pycaret.classification import *
exp_clf101 = setup(data = dataset, target = 'default', session_id=123)

Now, scroll to the bottom, and in the highlighted box press enter or type quit to continue.

Once the setup has been successfully executed it prints the information grid which contains several important pieces of information.

  • session_id: A pseudo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.
  • Target Type: Binary or Multiclass. The Target type is automatically detected and shown. There is no difference in how the experiment is performed for Binary or Multiclass problems. All functionalities are identical.
  • Label Encoded: When the Target variable is of type string (i.e. ‘Yes’ or ‘No’) instead of 1 or 0, it automatically encodes the label into 1 and 0 and displays the mapping (0: No, 1: Yes) for reference. In this experiment, no label encoding is required since the target variable is of type numeric.
  • Original Data: Displays the original shape of the dataset. This experiment (22800, 24) means 22,800 samples and 24 features including the target column.
  • Missing Values: When there are missing values in the original data this will show as True. For this experiment, there are no missing values in the dataset.
  • Numeric Features: The number of features inferred as numeric. In this dataset, 14 out of 24 features are inferred as numeric.
  • Categorical Features: The number of features inferred as categorical. In this dataset, 9 out of 24 features are inferred as categorical.
  • Transformed Train Set: Displays the shape of the transformed training set.
  • Transformed Test Set: Displays the shape of the transformed test/hold-out set.

Comparing different models

Now that the setup is completed, let’s compare the different models to pick up those that fit the best for the purpose of the analysis.

best_model = compare_models()

Two simple words of code (not even a line) have been trained and evaluated over 15 models using cross-validation. 🦾

The scoring grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using ‘Accuracy’ (highest to lowest).

By default, compare_models return the best performing model based on default sort order but can be used to return a list of top N models by using n_select parameters.

Let's print the best model.

print(best_model)

For this use case, we got the following results:

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,solver='svd', store_covariance=False, tol=0.0001)

Linear Discriminant Analysis is our best candidate, now let’s create our model.

Creating and Tunning your model

Now that you have a candidate for your project, let's create a model and tune it accordingly.

First, we need to identify our model’s id.

models()

Let’s create our model, shall we?

lda = create_model('lda')

That’s it, you have your model. But <Yeap, always a but>as on any Machine Learning project, you need to tune your model. The good news is that requires only one line to get this running.

tuned_lda = tune_model(lda)

This function automatically tunes the hyperparameters of a model using Random Grid Search a pre-defined search space. The output prints a scoring grid that shows Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC by fold for the best model.

We can print the tune model results

print(tuned_lda)

And, we got the following output based on the tuning process.

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage='auto',solver='lsqr', store_covariance=False, tol=0.0001)

This was the original model without tunning. You can spot the differences.

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,solver='svd', store_covariance=False, tol=0.0001)

Plotting the model

Would be nice to see the result on a plot, right? Oh, wait! we can do that with a simple line of code!

Let’s print the Confusion Matrix. But, what is the confusion matrix? It is a table with 4 different combinations of predicted and actual values. We use it for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves.

plot_model(tuned_lda, plot = 'confusion_matrix')

Now that we have a sense of the True Positive, True Negative, False Positive, and False Negative, let’s plot the ROC Curves.

plot_model(tuned_lda, plot = 'auc')

If this is the first time that you see this plot, the ROC curve allows you to visualize whether your model is performing well for True Positive cases. Depending on your use case, you will want the curve as far as you can over the 0.5 Line, but not way too much to not overfit the model. We will discuss that in later posts. Let’s keep the idea that as long as we are over the 0.5 line, we will be able to truly identify positive cases.

PyCaret offers 15 different plots to use as part of your analysis. You can refer to the plot_model() documentation for further information on the plots.

Another interesting plot would be the precision one.

plot_model(tuned_lda, plot = 'pr')

Or even you can visualize the rank of feature importance.

plot_model(tuned_lda, plot='feature')

Predict on test / hold-out Sample

It is recommendable to perform one final check by predicting the test/hold-out set and reviewing the metrics.

Now, using our final trained model stored in the tuned_lda variable we will predict against the hold-out sample set it up at the data preparation step, and evaluate the metrics to see if they are materially different than the CV results.

predict_model(tuned_lda)

The accuracy on the test set is 0.8197 compared to 0.8292 achieved on the tuned_lda results. This is not a significant difference, considering that a substantial gap between them would spot a overfit model. A non-desire behavior for our Machine Learning project.

Wrapping up your model

At this stage, after setting up and tuning your model, the last step is to finalize it.

To achieve this, guess what? Only one line of code is required. Cool, right?

final_rf = finalize_model(tuned_rf)

We can now use our model to predict data against our data_unseen dataset.

unseen_predictions = predict_model(final_lda, data=data_unseen)unseen_predictions.head(10)

There are many different columns as a result of the output of this command, but let’s focus on the last three columns. The default, label, and score. You can see the predicted value, alongside the Data Label and the Score for the prediction. At a glance, our model is performing well.

Saving your model

Before we depart, we need to save our model to either continue with tests or deploy it into production.

save_model(final_lda,'Final LDA Model 11Feb2022')

And that’s it. We have a Machine Learning model created and ready to be used in production with a few steps.

PyCaret can speed up your development process while you can focus on other tasks such as tunning your model. As I stated at the beginning, Data Science is not easy, but as we evolve into a data-centric ecosystem, we depend on specialists and tools to help companies to make timely decisions based on information.

Hope you have found this post informative. Feel free to share it, we want to reach as many people as we can because knowledge must be shared, right?

For further references, you can visit PyCaret documentation

If you reach this point, Thank you!

<AL34N!X>