What is a Machine Learning Pipeline?

A Machine Learning Pipeline is a way to automate the workflow it takes to produce a machine learning model. It consists a sequential steps chained together that to perform various tasks such as data ingestion, data cleaning, feature engineering, model training and deployment.

Why Use A Pipeline?

A Machine Learning workflow consists of a series of steps.

Data ingestion
Data cleaning
Data preprocessing / feature engineering
Modeling
Deployment

A Pipeline helps to break up each of the process by splitting up each section of the project into individual components that can be executed by the user. Therefore having a pipeline ensures order during the transformation process, makes it compact, easy to understand and reproducible.

Pipeline Creation

Creating a Pipeline is done using the Pipeline object sklearn.pipeline.Pipeline

Feature Transformations

In most datasets, there will be numerical and categorical data in the same DataFrame. Machine Learning Algorithms accept numerical values instead of strings and we would need to convert these categorical or string values before passing them into the Algorithm. In order to do this, we use transform these features using preprocessing methods such as LabelEncoder() and OneHotEncoder() or to create new features in the dataset.

Transforming DataFrames with Different Transformations

Depending on the data, we may not want to implement the same transformation on all of the columns.

For example, for a dataset with the following columns of Income and Height. Income data tends to be skewed, where the median is chosen to fill missing values, while height is normally distributed and mean is chosen to fill in missing values. Another set of data could contain numerical and categorical data and they would need to be processed separately.

The typical preprocessing steps apply to the entire DataFrame instead of a specific column. (Example below). Another downside is that we are not able to build in feature engineering steps in the pipeline shown below.

pipeline = Pipeline(
  steps = [
    ('imp', SimpleImputer()),
    ('ohe', OneHotEncoder())
    ])
transformed_data = pipeline.fit_transform(df)

In order to perform different transformations on individual columns, there are a few ways to do this.

Method 1

Select first set of columns in a DataFrame to perform first transformation
Select second set of columns in a DataFrame to perform second transformation
Combine them laterally using sklearn.pipeline.FeatureUnion

Example

1) Create a class to select the features of interest in the DataFrame

from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select the features of the DataFrame
class FeatureSelector(BaseEstimator, TransformerMixin):
  def __init__(self, feature_names):
    self._feature_names = feature_names

  def fit(self, X, y = None):
    return self

  def transform(self, X, y = None):
    return X[self._feature_names]

2) Create individual transformers to process the selected dataset

# Create class to perform numerical transformations
class NumericalTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass

  def fit(self, X, y = None):
    return self

  def transform(self, X, y = None):
    # Perform your transformations or feature engineering

    return X  

class CategoricalTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    pass

  def fit(self, X, y = None):
    return self

  def transform(self, X, y = None):
    # Perform transformations or feature engineering

    return X

3) Putting it together into a pipeline, if we desire, we can chain the existing transformers such as SimpleImputer() and OneHotEncoder() into the pipeline

# Create column of numerical features in the DataFrame
numerical_cols = ['income', 'height']
categorical_cols = ['gender']

# Create a pipeline to process numerical features
# We can chain built in transformations such as SimpleImputer in the pipeline.
numerical_pipeline = Pipeline(
  steps = [
    ('num_selector', FeatureSelector(numerical_cols)),
    ('num_transformer', NumericalTransformer()),
    ('imputer', SimpleImputer()),
    ('std_scaler', StandardScaler())
    ])

# Create a pipeline to process categorical features
categorical_pipeline = Pipeline(
  steps = [
    ('cat_selector', FeatureSelector(categorical_cols)),
    ('cat_transformer', CategoricalTransformer())
    ('ohe', OneHotEncoder())
    ])

# Combine the pipelines
full_pipeline = FeatureUnion(
  transformer_list = [
    ('categorical_pipeline', categorical_pipeline),
    ('numerical_pipeline', numerical_pipeline)
    ])

# Call .fit_transform on the dataframe
transformed_data = full_pipeline.fit_transform(df)

Method 2

Declare a pipeline using sklearn.compose.ColumnTransformer
Add in transformations in the pipeline and select the columns for each transformation

Example

1) Create a function that returns a DataFrame. As the returned value will be the input of the next transformation step, we will return the DataFrame to be processed by the next transformer.

def extract_year(dataframe):
  # Write function to extract the get_year

  return dataframe

2) Create FunctionTransformer object

get_year = FunctionTransformer(get_year, validate = False)

3) Create ColumnTransformer with the new FunctionTransformer in the Pipeline

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
  [
  ('get_year', get_year, ['year']),
  ('imp', SimpleImputer(), ['income', 'height']),
  ('ohe', OneHotEncoder(), ['gender'])
  ])

Note: Do take note that if all of the object (string) variables are not encoded during the transformation, it will return an array of strings which will give an error when you pass it to your Machine Learning Algorithm

Conclusion

There are different ways of creating pipelines and two ways that can be done are shown here. Pipeline creation is an important step in making sure that the transformation step is ordered, compact, easy to understand and reproducible which is crucial.

Tags: Pipeline Transform ColumnTransformer FeatureUnion