APS Failure Prediction With Machine Learning

9 min readMay 14, 2021

Business Problem

APS (Air Pressure System) is an essential system in heavy vehicles that manage braking systems mostly. Normally in heavy trucks, we see different mechanical issues in various components causing component failures. With maintenance, the chance of getting component failure decreases. But it costs money for regular maintenance. Also, the APS system is connected to other systems and failure in APS also impacts other connected systems. With failure type (Failed component as a part of APS system or not) prediction, we can save cost and time.

So the failed components can be part of the APS system. It will be helpful if we can differentiate the failed component and isolate it from the APS system so that it won't take much maintenance of the full APS system and it won't affect other systems connected to APS. With the daily usage data that can be collected from a truck during its operation at failure time, we can categorize whether the failure occurred in the APS system or not after checking in service centers.

Now we can have data related to different components at different conditions and a binary value for the failure type (Related to APS or not). With these data, if we can predict whether the failure would be due to APS or not, it will be helpful and beneficial for the manufacturer by isolating the failed component from the APS system of the truck.

ML Problem Statement

The above problem can be converted to a binary classification machine learning problem, where given sensor information, the model needs to predict whether the failed component is a part of the APS system or not. Here positive class refers to that the failure occurred due to a component of the APS system and the negative class refers to that the APS system is not responsible for the failure.

Constraints:

1. Low Latency: The model should predict results with low latency to avoid future failure in the APS system and avoid high maintenance costs.

2. Cost of Misclassification: The wrong prediction would increase the maintenance cost as well as affect the owner due to future failure in the APS system. Cost should be high for false-negative prediction. The cost of false negatives is much more than the cost of false positives in this case.

Dataset Overview

We can download the data from https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

Total there are 171 columns out of which 170 are the collected features and one column contains the target value (‘Pos’, ‘Neg’). The feature names are not disclosed for proprietary reasons. All these features are numeric features. We have a train and test dataset wherein train set, we have 60000 data points out of which 1000 are positive and 5900 are negative which shows it’s a highly imbalanced data. While building models, we need to consider class imbalance. In the test set, we have 16000 data points. So we have enough data to train and test.

Evaluation Metric

The model performance will be calculated using a confusion matrix, false positives, and false negatives. We will penalize more to the false negative than the false positive. Because heavy truck maintenance is costly and if the model predicts that the failed component is from the APS system, but it's not, then if the manufacturer work on that component, it won't affect much and it will be considered as an extra/additional maintenance operation. But if the model predicts that the failed component isn’t from the APS system, but it actually is, then the manufacturer will miss the maintenance operation and failed to isolate that component. It will result in more failures in that component and the APS system, which will cost much more than the previous case.

So we can set a penalty of 10 for cost1 (false positive) which refers to the cost of unnecessary maintenance checks to be done. Similarly, we can set a penalty of 500 for cost2 (false negative) which refers to the cost of missing a faulty truck, which may cause a breakdown.

Approach

We can go with the below approach to solve this problem.

Collect data and do data preprocessing.
Perform EDA
Train different models with proper metrics and compare the models.
Choose the best model which will give the lowest cost.

Load Data

We have train and test CSV files which include some metadata. So we only extract the observation data from those files. Refer to the below snippet of code.

Train data will be like below.

EDA And Data Preprocessing

First, we need to check the class imbalance. We can see the number of positive class labels is very small compared to negative class labels. As we have ‘pos’ and ‘neg’ as class labels, we can convert them into 1 and 0 respectively for future purposes.

Before going for modeling, we need to get rid of NaN values and outlier data points if any.

Here in this dataset, the NaN values are represented by ‘na’ which is a string. So wherever it's ‘na’ or any non-numeric value, we can convert it to NaN as all the features are numeric.

We can check the distribution of NaN values. Based upon the analysis, we can decide how to deal with the NaN values.

We can see there are 7 columns having NaN values of more than 70%. We can use imputation methods to deal with NaN values. If we want, we can also remove the columns having a very high number of NaN values.

Now we can use SimpleImputation with the median strategy for imputation of NaN values. On top of that StandardScaler will be used to standardize data. As we have 170 features, we can check whether there is any feature containing constant value throughout the whole dataset and we can remove that as it doesn't add value in determining the target class label.

‘cd_000' is the dropped feature. We can use the index of that column to remove the value from the input vector of a single data point.

To check the behavior of data with class labels, we need to perform EDA like box-plot, tSNE, and confirm whether the data are separable or not. We cant use all the columns in univariate analysis like box-plot. So we can select few features to check. Instead of doing a random selection of features, we can check the correlation of target value with each of these features and select the top features having a high correlation with the target class label. Again those features also shouldn't be correlated with each other.

First, we can select the top 20–30 features having a high correlation with the target class label. Then for each of the features, we can calculate the correlated features and obtain the common features between the selected features and correlated features. Finally, we will add the feature having the highest correlation with the target class discarding other features in the common set.

Now before plotting the data, we need to get rid of outliers. So we can compute the percentile values and check whether we have extreme points or not. If we have, then we can remove or replace it with the previous percentile value.

Now we can perform EDA with the processed data. We can perform univariate analysis for each selected column. We can get the below result.

We can use all the selected features and perform tSNE which can give the below result.

Here for the above three columns, both the classes are easily separable. Now can use these data to build models.

Building ML models and Model comparison

We can use classification models like logistic regression, svc (linear and kernel), random forest classifier, and xgbclassifier. Again we can use stacking models with different numbers of randomly selected base models. But for the stacking model, we use sampled rows and sampled columns. For each model, we can do hyperparameter tuning and model evaluation.

Here we are more focused on the cost function. So instead of considering log-loss, f1-score, etc, we need to select the best model having the lowest cost.

We can see, log-loss, misclassified points are very low for the models other than logistic regression and SVM which implies those models are performing better. But for this problem, we need to minimize the cost. In the above model comparison, logistic regression gives the lowest cost. So this is the final model. Now logistic regression model, imputation, and standardScalar objects can be saved for real-time predictions.

Usecase

We can create a web application to use the model for giving predictions to the requested input values. We can have two methods in a service class. The first function is to give a prediction for a single data point as input. The second method will give the model performance for a given set of input of values and class labels.

As it's a web application, we need a controller which will receive HTTP requests and call the service accordingly.

In service class, the first method will take a list of values as input. After preprocessing with imputation and scaling, it will use the saved model to give a prediction as ‘pos’ or ‘neg’.

The second method will take a bunch of inputs of data points and class labels and for each of them, it will call the first method to give a prediction. Using the actual and predicted class labels, it will evaluate the model by giving cost, log-loss, f1-score, number of misclassified points, etc. All these are calculated in another method which is also responsible for plotting the confusion matrix.

Flask can be used to build a web application that will use the above services/ functions and produce results as below.

Example for predict class label for a data point (list of 170 values):

Our model can predict both classes efficiently. But it can be further improved in below ways.

Future Work

The existing model can be improved if we handle NaN values in more efficient ways as below.

Get rid of the features having almost NaN values throughout all the data, means having a number of NaN values of more than 80%.
Instead of using simple imputation, we can use model-based imputation for the features having a number of NaN values between 15% to 80%.
We can use simple imputation only for the features having NaN counts less than 15%.
We can 170 features. Some features may not be important. So forward feature selection method can be used to get rid of unnecessary features.
As we have seen random forest and xgboost worked very well in terms of log-loss and f1-score. So we can get feature importance from those models also to get rid of features that don't add much value.
More granular control over the hyper-parameter can also be an option to improve the model performance.
Instead of the F1 score, the AUC score can also be an metric to evaluate the models.