当前位置:网站首页>Compact CUDA tutorial - CUDA driver API
Compact CUDA tutorial - CUDA driver API
2022-04-23 19:42:00 【Adenialzz】
Streamlining CUDA course ——CUDA Driver API
Driver API summary
CUDA The multistage of API
CUDA Of API There are multiple levels ( The figure below ), Details available :CUDA Environment details .
- CUDA Driver API yes CUDA And GPU The driving level of communication API. In the early CUDA And GPU Communication is directly through Driver API.
cuCtxCreate()
etc.cu
The beginning is basically Driver API. Familiar to usnvidia-smi
Commands are called Driver API. - Later I found out Driver API Too much , The details are too complicated , So it evolved Runtime API,Runtime API Is based on Driver API Developed , common
cudaMalloc()
etc. API All are Runtime API.
CUDA Driver
Environment related
CUDA Driver yes With the release of graphics card driver , want And cudatoolkit Look at it separately .
CUDA Driver Corresponding to cuda.h
and libcuda.so
Two documents . Be careful cuda.h
Will be installed cudatoolkit It includes , however libcuda.so
It is installed with the graphics card driver in our system ( and No Also follow cudatooklit install ). therefore , If you want to copy and move directly libcuda.so
Note that the driver version needs to be adapted to the file .
How to understand CUDA Driver
This concise course is for the bottom Driver API The understanding of the , In order to Conducive to subsequent Runtime API Learning and error debugging .Driver API yes understand cudaRuntime The key to context . therefore , This abridged course is in CUDA Driver The main knowledge points of this part are :
- Context Management mechanism of
- CUDA Development habits of series interfaces ( Error checking method )
- Memory model
About context And memory classification
About context, There are two kinds of :
- Manually managed context:
cuCtxCreate
, Manual management , In a stack push/pop - Automatically managed context:
cuDevicePrimaryCtxRetain
, Automatic management ,Runtime API On this basis
About memory , There are two broad categories. :
- CPU Memory , be called Host Memory
- Pageable Memory: Pageable memory
- Page-Locked Memory: Page locked memory
- GPU Memory ( memory ), be called Device Memory
- Global Memory: Global memory
- Shared Memory: Shared memory
- … other
The above contents will be introduced later .
cuIint Drive initialization
cuInit
The meaning is , Initialize drive API, One global execution is enough , If not implemented , Then all API Will return an error .- No, Corresponding
cuDestroy
, No need to release , Program destruction automatically releases .
Return value check
Version of a
Check correctly and friendly cuda The return value of the function , Conducive to the organizational structure of the program , Make the code more readable , Mistakes are easier to find .
We know cuInit
The type returned is CUresult
, The return value will tell the programmer whether the function succeeded or failed , What are the reasons for the failure .
The logic of the official version of the check , as follows :
// Use a parameterized macro definition to check cuda driver Whether it is initialized normally , And locate the file name of the program error 、 Number of lines and error messages
// Macro definition with do...while Loop can ensure the correctness of the program
#define checkDriver(op) \ do{
\ auto code = (op); \ if(code != CUresult::CUDA_SUCCESS){
\ const char* err_name = nullptr; \ const char* err_message = nullptr; \ cuGetErrorName(code, &err_name); \ cuGetErrorString(code, &err_message); \ printf("%s:%d %s failed. \n code = %s, message = %s\n", __FILE__, __LINE__, #op, err_name, err_message); \ return -1; \ } \ }while(0)
Is a macro definition , We're calling other API When , Check the return value of the function , And print out the error code and error information in case of error , Convenient debugging . such as :
checkDriver(cuDeviceGetName(device_name, sizeof(device_name), device));
If there are uninitialized and other errors , The error message will be clearly printed .
This version is also Nvidia The official version , But there are some problems , For example, the readability of the code is poor , Go straight back to int Type error code, etc . Version 2 is recommended .
Version 2
// Obviously , This kind of code encapsulation , More convenient to use
// Macro definition #define < Macro name >(< Parameter table >) < Macrobody >
#define checkDriver(op) __check_cuda_driver((op), #op, __FILE__, __LINE__)
bool __check_cuda_driver(CUresult code, const char* op, const char* file, int line){
if(code != CUresult::CUDA_SUCCESS){
const char* err_name = nullptr;
const char* err_message = nullptr;
cuGetErrorName(code, &err_name);
cuGetErrorString(code, &err_message);
printf("%s:%d %s failed. \n code = %s, message = %s\n", file, line, op, err_name, err_message);
return false;
}
return true;
}
The obvious , The return value of version 2 、 Code readability 、 The encapsulation is much better than version 1 . Use the same way :
checkDriver(cuDeviceGetName(device_name, sizeof(device_name), device));
// Or add a judgment , Exit in case of error
if (!checkDriver(cuDeviceGetName(device_name, sizeof(device_name), device))) {
return -1;
}
CUcontext
Manual context management
-
context Is a context , Association pair GPU All operations .
-
One context Associated with a graphics card , A graphics card can be used by more than one context relation .
-
Each thread has a stack structure to store context, The top of the stack is currently used context, The due push/pop Function operation context The stack , all API Are based on the current context For operational purposes
Just imagine , If you do anything, you need to pass a device Decide which device to send to perform , Much trouble .context Is to facilitate the management of the current API Where is it device A means of implementation , The stack structure is used to save the previous context device, Thus, it is convenient to control multiple devices .
Automatic context management
- Since high-frequency operations are fixed access to a thread device unchanged , It's not often that the same thread accesses different data multiple times device The situation of , And only one context, Seldom use more context.
- That is, in most cases ,
CreateContext
、PushCurrent
、PopCurrent
This is more than context Management is troublesome - So it came out
cuDevicePrimaryCtxRetain
, Associate the master for the device context, Distribute in this way 、 Set up 、 Release 、 The stack doesn't need us to manage it manually , It's a kind of automatic management context The way primaryContext
: Give me the equipment id, Here you are. context And set it up. , here One device Corresponding to one primary context. Different threads , As long as the equipment id identical ,primary context It's the same , And context It's thread safe .- To be introduced later CUDA Runtime API in , Is to automatically use
cuDevicePrimaryCtxRetain
Of .
DriverAPI memory management
- host memory Is the memory of the computer itself , It can be used CUDA Driver API To apply for and release , It can also be used. C/C++ Of
malloc/free
andnew/delete
To apply for and release . - device memory It's the memory on the graphics card , That is, video memory , There are special ones Driver API To apply and release .
版权声明
本文为[Adenialzz]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231923163837.html
边栏推荐
- [报告] Microsoft :Application of deep learning methods in speech enhancement
- The most detailed network counting experiment in history (2) -- rip experiment of layer 3 switch
- What is a message queue
- 山东大学软件学院项目实训-创新实训-网络安全靶场实验平台(八)
- MySQL lock
- 【h264】libvlc 老版本的 hevc h264 解析,帧率设定
- 高效的串口循环Buffer接收处理思路及代码2
- FFT物理意义: 1024点FFT就是1024个实数,实际进入fft的输入是1024个复数(虚部为0),输出也是1024个复数,有效的数据是前512个复数
- Redis core technology and practice 1 - start with building a simple key value database simplekv
- Kubernetes入门到精通-在 Kubernetes 上安装 OpenELB
猜你喜欢
Building googlenet neural network based on pytorch for flower recognition
MySQL syntax collation (3)
山东大学软件学院项目实训-创新实训-网络安全靶场实验平台(五)
Zero base to build profit taking away CPS platform official account
Deep learning -- Summary of Feature Engineering
Kubernetes入门到精通-裸机LoadBalence 80 443 端口暴露注意事项
Class loading process of JVM
[transfer] summary of new features of js-es6 (one picture)
Application of DCT transform
RuntimeError: Providing a bool or integral fill value without setting the optional `dtype` or `out`
随机推荐
MySQL syntax collation (3)
Software College of Shandong University Project Training - Innovation Training - network security shooting range experimental platform (8)
点云数据集常用处理
精简CUDA教程——CUDA Driver API
Kibana reports an error server is not ready yet. Possible causes
Steps to build a deep learning environment GPU
The usage of slice and the difference between slice and array
Project training of Software College of Shandong University - Innovation Training - network security shooting range experimental platform (V)
Common processing of point cloud dataset
Hot reload debugging
RuntimeError: Providing a bool or integral fill value without setting the optional `dtype` or `out`
Class loading mechanism
The platinum library cannot search the debug process records of some projection devices
Distinction between pointer array and array pointer
对普通bean进行Autowired字段注入
Pdf reference learning notes
Kubernetes entry to mastery - bare metal loadbalance 80 443 port exposure precautions
考试系统进入试卷优化思路
An algorithm problem was encountered during the interview_ Find the mirrored word pairs in the dictionary
Codeworks round 783 (Div. 2) d problem solution