当前位置：网站首页>Test your machine learning pipeline

Test your machine learning pipeline

2022-04-23 08:36:00 【Li Guodong】

When it comes to data products , There is often a misunderstanding , I don't think these products can be tested through automation . Although some parts of the pipeline cannot be tested by traditional testing methods due to its experimental and randomness , But most assembly lines can . besides , More unpredictable algorithms can pass a special verification process .

Let's take a look at traditional testing methods , And how we apply these methods to our data / ML In the assembly line .

Test pyramids

The standard simplified test pyramid is shown below ：

This pyramid represents the type of test you will write for your application . Let's start with a lot of unit tests , These unit tests test individual functions independently of other functions . Then we write integration tests to check that the components that we isolated together work as expected . Last , We write UI Acceptance or test , Check that the application works as expected from the user's perspective .

In terms of data products , Pyramids are not much different . We have more or less the same level .

Be careful ： The product will still be UI test , However, this paper focuses on the tests most related to data pipeline .

With the help of some science fiction writers , Let's take a closer look at , In the context of machine learning , What does each mean .

unit testing

Most of the code in the data pipeline includes the data cleaning process . Each function used for data cleansing has a clear goal . for example , Suppose that one of the features we choose for the output model is the change in value between the previous day and the current day . Our code may look like this ：

def add_difference(asimov_dataset):
    asimov_dataset['total_naughty_robots_previous_day'] =        
        asimov_dataset['total_naughty_robots'].shift(1)
 
    asimov_dataset['change_in_naughty_robots'] =    
        abs(asimov_dataset['total_naughty_robots_previous_day'] -
            asimov_dataset['total_naughty_robots'])
 
    return asimov_dataset[['total_naughty_robots', 'change_in_naughty_robots', 
        'robot_takeover_type']]

ad locum , We know for a given input , We expect to get some output , therefore , We can use the following code to test ：

import pandas as pd
from pandas.testing import assert_frame_equal
import numpy as np
from unittest import TestCase
 
def test_change():
    asimov_dataset_input = pd.DataFrame({
    
        'total_naughty_robots': [1, 4, 5, 3],
        'robot_takeover_type': ['A', 'B', np.nan, 'A']
    })
 
    expected = pd.DataFrame({
    
        'total_naughty_robots': [1, 4, 5, 3],
        'change_in_naughty_robots': [np.nan, 3, 1, 2],
        'robot_takeover_type': ['A', 'B', np.nan, 'A']
    })
 
    result = add_difference(asimov_dataset_input)
 
    assert_frame_equal(expected, result)

For each independent function , You will write a unit test , To ensure that every part of the data conversion process has the expected impact on the data . For each function , You should also consider different scenarios （ Is there a if sentence ？ Then all conditions should be tested ）. then , These will be treated as continuous integration at each submission (CI) Part of the pipeline runs .

In addition to checking that the code meets expectations , Unit testing can also help us debug problems . By adding a test that reproduces newly discovered errors , We can ensure that the error is fixed if we think it has been fixed , And we can make sure that the error doesn't happen again .

Last , These tests not only check whether the code meets expectations , It also helps us document our expectations when creating features .

Integration testing

These tests are designed to determine whether individually developed modules work as expected when combined . In terms of data pipeline , You can check the following ：

The data cleansing process produces data sets that fit the model
Model training can process the data provided to it and output the results （ Make sure you can refactor your code in the future ）

therefore , If we take the unit test function above and add the following two functions ：

def remove_nan_size(asimov_dataset):
    return asimov_dataset.dropna(subset=['robot_takeover_type'])
 
def clean_data(asimov_dataset):
    asimov_dataset_with_difference = add_difference(asimov_dataset)
    asimov_dataset_without_na = remove_nan_size(asimov_dataset_with_difference)
 
    return asimov_dataset_without_na

Then we can use the following code to test the combination clean_data Whether the function in will produce the expected result ：

asimov_dataset_input = pd.DataFrame({
    
    'total_naughty_robots': [1, 4, 5, 3],
    'robot_takeover_type': ['A', 'B', np.nan, 'A']
})

expected = pd.DataFrame({
    
    'total_naughty_robots': [1, 4, 3],
    'change_in_naughty_robots': [np.nan, 3, 2],
    'robot_takeover_type': ['A', 'B', 'A']
}).reset_index(drop=True)

result = clean_data(asimov_dataset_input).reset_index(drop=True)

assert_frame_equal(expected, result)

Now suppose that the next thing we need to do is to input the above data into the logistic regression model .

from sklearn.linear_model import LogisticRegression
 
def get_reression_training_score(asimov_dataset, seed=9787):
    clean_set = clean_data(asimov_dataset).dropna()
 
    input_features = clean_set[['total_naughty_robots', 
        'change_in_naughty_robots']]
    labels = clean_set['robot_takeover_type']
 
    model = LogisticRegression(random_state=seed).fit(input_features, labels)
    return model.score(input_features, labels) * 100

Although we don't know the expectations , But we can make sure we always get the same value . Testing this integration is very useful for us , In order to ensure that ：

The model can use data （ Each input has a label , The data type is determined by the type of the selected model , wait ）
We can refactor our code in the future , Without compromising end-to-end functionality .

We can ensure that the results are always the same by providing the same seeds for the random generator . All major libraries allow you to seed （Tensorflow It's a little special , Because it requires you to pass numpy Set seeds , So remember that ）. The test may be as follows ：

from numpy.testing import assert_equal
 
def test_regression_score():
    asimov_dataset_input = pd.DataFrame({
    
        'total_naughty_robots': [1, 4, 5, 3, 6, 5],
        'robot_takeover_type': ['A', 'B', np.nan, 'A', 'D', 'D']
    })
 
    result = get_reression_training_score(asimov_dataset_input, seed=1234)
    expected = 40.0
 
    assert_equal(result, 50.0)

There won't be as many tests like unit tests , But they are still CI Part of the assembly line . You will use these to check the end-to-end functionality of the component , therefore , More major scenarios will be tested .

Machine learning verification

Now we have tested our code , We also need to test ML Whether the component is solving the problem we are trying to solve . When we talk about product development ,ML The original results of the model （ No matter how accurate the statistical method is ） It's almost never the final output you need . These results are usually combined with other business rules before being used by users or other applications . therefore , We need to verify that the model solves the user problem , Not just accuracy /f1-score/ Whether other statistical measures are high enough .

How does this help us ？

It ensures that the model really helps the product solve the problem at hand
- for example , If 20% The accuracy of is not correct, which leads to the patient's inability to obtain the required treatment , Then classifying snake bites as fatal or non fatal is not a good model .
It makes sense in terms of ensuring that the model is produced in the industry
- for example , If the value of the final price displayed to the user is too low / Too high , And in this industry / There is no point in the market , So in order to 70% The accuracy of the model for predicting price changes is not a good model .
It provides an additional layer of decision documentation , Help engineers join the team later in the process .
It provides products in a common language ML Visibility of components ; Let customers 、 Product managers and engineers understand... In the same way .

This validation should be run regularly （ adopt CI Assembly line or cron Homework ）, The results should be visible to the organization . This ensures that organizations can see the progress of data science components , And ensure early detection of problems caused by changed or obsolete data .

summary

ML Components can be tested in many ways , Bring us the following advantages ：

Produce a data-driven method , To ensure that the code performs the intended operation
Make sure that we can refactor and clean up the code without damaging the function of the product
Recording function 、 Decisions and previous mistakes
Offering products ML Visibility of component progress and status

therefore , Don't be scared , If you have the skills to write code , You have the skills to write tests , And gain all the above advantages .

Link to the original text ：Testing your machine learning (ML) pipelines

版权声明
本文为[Li Guodong]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230805449573.html