当前位置:网站首页>Test your machine learning pipeline
Test your machine learning pipeline
2022-04-23 08:36:00 【Li Guodong】
When it comes to data products , There is often a misunderstanding , I don't think these products can be tested through automation . Although some parts of the pipeline cannot be tested by traditional testing methods due to its experimental and randomness , But most assembly lines can . besides , More unpredictable algorithms can pass a special verification process .
Let's take a look at traditional testing methods , And how we apply these methods to our data / ML In the assembly line .
Test pyramids
The standard simplified test pyramid is shown below :
This pyramid represents the type of test you will write for your application . Let's start with a lot of unit tests , These unit tests test individual functions independently of other functions . Then we write integration tests to check that the components that we isolated together work as expected . Last , We write UI Acceptance or test , Check that the application works as expected from the user's perspective .
In terms of data products , Pyramids are not much different . We have more or less the same level .
Be careful : The product will still be UI test , However, this paper focuses on the tests most related to data pipeline .
With the help of some science fiction writers , Let's take a closer look at , In the context of machine learning , What does each mean .
unit testing
Most of the code in the data pipeline includes the data cleaning process . Each function used for data cleansing has a clear goal . for example , Suppose that one of the features we choose for the output model is the change in value between the previous day and the current day . Our code may look like this :
def add_difference(asimov_dataset):
asimov_dataset['total_naughty_robots_previous_day'] =
asimov_dataset['total_naughty_robots'].shift(1)
asimov_dataset['change_in_naughty_robots'] =
abs(asimov_dataset['total_naughty_robots_previous_day'] -
asimov_dataset['total_naughty_robots'])
return asimov_dataset[['total_naughty_robots', 'change_in_naughty_robots',
'robot_takeover_type']]
ad locum , We know for a given input , We expect to get some output , therefore , We can use the following code to test :
import pandas as pd
from pandas.testing import assert_frame_equal
import numpy as np
from unittest import TestCase
def test_change():
asimov_dataset_input = pd.DataFrame({
'total_naughty_robots': [1, 4, 5, 3],
'robot_takeover_type': ['A', 'B', np.nan, 'A']
})
expected = pd.DataFrame({
'total_naughty_robots': [1, 4, 5, 3],
'change_in_naughty_robots': [np.nan, 3, 1, 2],
'robot_takeover_type': ['A', 'B', np.nan, 'A']
})
result = add_difference(asimov_dataset_input)
assert_frame_equal(expected, result)
For each independent function , You will write a unit test , To ensure that every part of the data conversion process has the expected impact on the data . For each function , You should also consider different scenarios ( Is there a if sentence ? Then all conditions should be tested ). then , These will be treated as continuous integration at each submission (CI) Part of the pipeline runs .
In addition to checking that the code meets expectations , Unit testing can also help us debug problems . By adding a test that reproduces newly discovered errors , We can ensure that the error is fixed if we think it has been fixed , And we can make sure that the error doesn't happen again .
Last , These tests not only check whether the code meets expectations , It also helps us document our expectations when creating features .
Integration testing
These tests are designed to determine whether individually developed modules work as expected when combined . In terms of data pipeline , You can check the following :
- The data cleansing process produces data sets that fit the model
- Model training can process the data provided to it and output the results ( Make sure you can refactor your code in the future )
therefore , If we take the unit test function above and add the following two functions :
def remove_nan_size(asimov_dataset):
return asimov_dataset.dropna(subset=['robot_takeover_type'])
def clean_data(asimov_dataset):
asimov_dataset_with_difference = add_difference(asimov_dataset)
asimov_dataset_without_na = remove_nan_size(asimov_dataset_with_difference)
return asimov_dataset_without_na
Then we can use the following code to test the combination clean_data
Whether the function in will produce the expected result :
asimov_dataset_input = pd.DataFrame({
'total_naughty_robots': [1, 4, 5, 3],
'robot_takeover_type': ['A', 'B', np.nan, 'A']
})
expected = pd.DataFrame({
'total_naughty_robots': [1, 4, 3],
'change_in_naughty_robots': [np.nan, 3, 2],
'robot_takeover_type': ['A', 'B', 'A']
}).reset_index(drop=True)
result = clean_data(asimov_dataset_input).reset_index(drop=True)
assert_frame_equal(expected, result)
Now suppose that the next thing we need to do is to input the above data into the logistic regression model .
from sklearn.linear_model import LogisticRegression
def get_reression_training_score(asimov_dataset, seed=9787):
clean_set = clean_data(asimov_dataset).dropna()
input_features = clean_set[['total_naughty_robots',
'change_in_naughty_robots']]
labels = clean_set['robot_takeover_type']
model = LogisticRegression(random_state=seed).fit(input_features, labels)
return model.score(input_features, labels) * 100
Although we don't know the expectations , But we can make sure we always get the same value . Testing this integration is very useful for us , In order to ensure that :
- The model can use data ( Each input has a label , The data type is determined by the type of the selected model , wait )
- We can refactor our code in the future , Without compromising end-to-end functionality .
We can ensure that the results are always the same by providing the same seeds for the random generator . All major libraries allow you to seed (Tensorflow It's a little special , Because it requires you to pass numpy Set seeds , So remember that ). The test may be as follows :
from numpy.testing import assert_equal
def test_regression_score():
asimov_dataset_input = pd.DataFrame({
'total_naughty_robots': [1, 4, 5, 3, 6, 5],
'robot_takeover_type': ['A', 'B', np.nan, 'A', 'D', 'D']
})
result = get_reression_training_score(asimov_dataset_input, seed=1234)
expected = 40.0
assert_equal(result, 50.0)
There won't be as many tests like unit tests , But they are still CI Part of the assembly line . You will use these to check the end-to-end functionality of the component , therefore , More major scenarios will be tested .
Machine learning verification
Now we have tested our code , We also need to test ML Whether the component is solving the problem we are trying to solve . When we talk about product development ,ML The original results of the model ( No matter how accurate the statistical method is ) It's almost never the final output you need . These results are usually combined with other business rules before being used by users or other applications . therefore , We need to verify that the model solves the user problem , Not just accuracy /f1-score/ Whether other statistical measures are high enough .
How does this help us ?
- It ensures that the model really helps the product solve the problem at hand
- for example , If 20% The accuracy of is not correct, which leads to the patient's inability to obtain the required treatment , Then classifying snake bites as fatal or non fatal is not a good model .
- It makes sense in terms of ensuring that the model is produced in the industry
- for example , If the value of the final price displayed to the user is too low / Too high , And in this industry / There is no point in the market , So in order to 70% The accuracy of the model for predicting price changes is not a good model .
- It provides an additional layer of decision documentation , Help engineers join the team later in the process .
- It provides products in a common language ML Visibility of components ; Let customers 、 Product managers and engineers understand... In the same way .
This validation should be run regularly ( adopt CI Assembly line or cron Homework ), The results should be visible to the organization . This ensures that organizations can see the progress of data science components , And ensure early detection of problems caused by changed or obsolete data .
summary
ML Components can be tested in many ways , Bring us the following advantages :
- Produce a data-driven method , To ensure that the code performs the intended operation
- Make sure that we can refactor and clean up the code without damaging the function of the product
- Recording function 、 Decisions and previous mistakes
- Offering products ML Visibility of component progress and status
therefore , Don't be scared , If you have the skills to write code , You have the skills to write tests , And gain all the above advantages .
Link to the original text :Testing your machine learning (ML) pipelines
版权声明
本文为[Li Guodong]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230805449573.html
边栏推荐
- Detailed description of self feeling of auricular point weight loss 0422
- CGM optimizes blood glucose monitoring and management -- Yiyu technology appears in Sichuan International Medical Exchange Promotion Association
- IDEA导入commons-logging-1.2.jar包
- stm32以及freertos 堆栈解析
- Copy array in JS
- Notes on English class (4)
- LINQ学习系列-----1.4 匿名对象
- Knowledge points and problem solutions related to information collection
- Ansible Automation Operation and Maintenance details (ⅰ) Installation and Deployment, Parameter use, list Management, Profile Parameters and user level ansible operating environment Construction
- 虚拟线上展会-线上vr展馆实现24h沉浸式看展
猜你喜欢
[C语言] 文件操作《一》
Green apple film and television system source code film and television aggregation film and television navigation film and television on demand website source code
让地球少些“碳”息 度能在路上
跨域配置报错: When allowCredentials is true, allowedOrigins cannot contain the special value “*“
Get the absolute path of the class according to the bytecode
Shell脚本进阶
Description of the abnormity that the key frame is getting closer and closer in the operation of orb slam
引用传递1
00后最关注的职业:公务员排第二,第一是?
How to generate assembly file
随机推荐
Transformer XL: attention language modelsbbeyond a fixed length context paper summary
监控智能回放是什么,如何使用智能回放查询录像
Generate and parse tokens using JWT
洋桃電子STM32物聯網入門30步筆記一、HAL庫和標准庫的區別
匿名類型(C# 指南 基礎知識)
JVM工具之Arthas使用
LeetCode-199-二叉树的右视图
[learning] audio and video development from scratch (9) -- nuplayer
如何保护开源项目免遭供应链攻击-安全设计(1)
QT reading and writing XML files
跨域配置报错: When allowCredentials is true, allowedOrigins cannot contain the special value “*“
ESP32程序下载失败,提示超时
Notes on English class (4)
Let the earth have less "carbon" and rest on the road
What is RPC
idea配置连接远程数据库MySQL,或者是Navicat连接远程数据库失败问题(已解决)
SYS_CONNECT_BY_PATH(column,'char') 结合 start with ... connect by prior
JSP page coding
idea底栏打开services
word加水印