当前位置:网站首页>hyperscan --- 1
hyperscan --- 1
2022-04-23 02:03:00 【The water cup likes drinking water】
Introduce
Hyperscan Is a regular expression engine , Designed to provide high performance 、 The ability to match multiple expressions at the same time and the flexibility of scanning operations .hyperscan The working steps include compile mode and scan mode .
hyperscan Basic information
Official website :Home Page - Hyperscan.io
User's Manual : Preface — Hyperscan 5.4.0 documentation
compile
The schema string is compiled to generate a database of immutable schemas , The scan interface can then be used to scan the target data buffer of a given mode , Return any matching results from the data buffer .Hyperscan A streaming mode is also provided , In this pattern, matching across multiple blocks in the flow is detected
The binary mechanism that compiles a set of regular expressions is called pattern database. We can go through the following 3 individual API, To perform compilation operations :
- hs_compile(): Compile a regular expression into a pattern database
- hs_compile_multi(): Compile a set of regular expressions into a pattern database.
- hs_compile_ext_multi(): Basic and 2 equally , But some more parameters can be set , Basic is used to help us limit some inspection accessories , So that the matching process ends faster , We'll follow up with .
When compiling regular expressions , We need to decide , What kind of method do we use for subsequent matching ?hyperscan Allow us to use the following three ways :
- Stream Pattern : Stream pattern matching , That is, the data to be matched is not a continuous piece , such as tcp stream, The data that our regularization may match is cross block .
- Block Pattern : The data to be matched is clear , In this block .
- Vector Pattern : stay Stream and Block The pattern between , Data in a specified series Block in .
Which mode to use , We need to weigh , Simply from the performance comparison ,BLock The model must be the best , Because than Stream Patterns are , Less tracking of internal state across blocks . but Stream It can solve the problem of too much subcontracting Block What happened here .
Supported regularities
hyperscan Support for regularity is better than PCRE Less , But for most of us, it's enough , We can take a simple look at :
- Match of all string characters and character escape
- Character class (charactor class) . (dot), [abc], and [^abc], as well as \s, \d, \w, \v, And its negative form (\S, \D, \W, \V, and \H).
- Quantifier form ?,*,+,{n}, {m,n}, {n,}
- Multiple choice mode foo|bar
- anchor ^, $, \A, \Z and \z.
- Options i,m,s,x etc.
About the semantics of matching Semantics
The grammar is and PCRE same , But semantics Semantics Is that standard regular expressions are different . There are several different places
- Multimode matching :hyperscan The match is Match more than one at a time Regular expressions , This sum PCRE Inside expression 1| expression 2| expression 3 In this way, matching from left to right is different
- disorder : Because it is multimode matching , So multiple expressions are matched together , There is no obvious order , Although it is basically based on who gets to the matching boundary first and who ends first
- By default, only matching tails are returned offset: By default , On the way back , Just know the matching end Of offset, If you want to know begin offset. Set when compilation is required flag(HS_FLAG_SOM_LEFTMOST), But it affects performance
- Full matching : For instance from fooxyzbarbar Inside matching /foo.*bar/, Default PCRE Using greedy matching , Only match fooxyzbarbar, but hyperscan Will match fooxyzbar and fooxyzbarbar Two .
stay stream In mode , image PCRE Semantically, it is unlikely to support the longest match . Let's take the example above , Regular is /foo.*bar/, If the data is divided into the following 3 individual block Come on :
block 1 | block 2 | block 3
fooxyzbar | baz |qbar
Stream The pattern matches up to the first block It's over , because block 2 It doesn't match , Considering efficiency ,hyperscan Will not wait indefinitely, whether there is a bar 了 . otherwise , If the first 500 individual block There is a bar, What should I do ?
SOM
SOM yes Start of Match. Indicates where to start matching , We've also introduced , By default hyperscan Record only end of Match, Don't record Start of Match. If you have to record SOM, We can
Set at compile time HS_FLAG_SOM_LEFTMOST This flag. But after setting, there are the following disadvantages :
- Reduce hyperscan Supported regular styles . It may appear when compiling 『Pattern too large』 Such a mistake
- increase Stream The state of the pattern , It's easy to understand , There are more records
- Performance issues
- And others flag Are not compatible . such as HS_FLAG_SINGLEMATCH and HS_FLAG_PREFILTER.
Besides , Consider performance , We are using SOM When , You can set a threshold , namely start offset and end offset Difference between , It's too long to end , Just start over , To reduce excessive internal state tracking
Extension parameters
We are hs_compile_ext_multi The extension parameters mentioned in ,
- flags: flags Mark
- min_offset: Used to identify the smallest match offset
- max_offset: Used to identify the maximum match offset
- min_length: How long is the minimum match
- edit_distance: Levenshtein The distance parameter , For fuzzy matching .
For example, regular expressions /foo.*bar/, If specified min_offset yes 10,max_offset yes 15 Words ,foobar and foo0123456789bar It won't match , and foo0123bar and foo0123456bar It will match .
If edit_distance yes 2 Words ,foobar, fooba, fobr, fo_baz, foooobar and /foobar/ Will match
Scan matching
hyperscan For the top 3 Patterns , There are also different scan function , It's all about hs_scan start .
Once a regular match is made , Will call back the following function
typedef (* match_event_handler)(unsigned int id, unsigned long long from, unsigned long long to, unsigned int flags, void *context)
id: Regular expression identification number specified at compile time
from: The start position of the match is offset
to: The end position of the match is offset
flags: Reserved identification
context: User defined parameters
Return value : The return of this function is not 0 Stop matching if , Otherwise, continue to match , Until you find a satisfactory .
Stream Pattern
stay stream The following functions will be called in the mode
- hs_open_stream()
- hs_scan_stream()
- hs_close_stream()
After matching regular , Callbacks callback,callback If the function returns non 0, Then terminate the matching process . Although in callback It stopped , But actually stream The state machine still stays in one state .
If you continue to call hs_scan_stream, Will return immediately HS_SCAN_TERMINATED. Finally, the caller needs to call a hs_close_stream, To release resources .
Again , because Stream The internal status of the cross block scan needs to be recorded , So there will be some performance loss .
hyperscan It's also a match from the front to the back , When you meet $ \b Wait for the character , Not at the moment stream A callback occurs , Only after receiving the next stream perhaps stream The callback can only be determined when it is closed
Stream Management of
In addition to the above mentioned stream Operation function of , There are the following API But the use of
- hs_reset_stream(): resets a stream to its initial state; this is equivalent to calling hs_close_stream() but will not free the memory used for stream state.
- hs_copy_stream(): constructs a (newly allocated) duplicate of a stream.
- hs_reset_and_copy_stream(): constructs a duplicate of a stream into another, resetting the destination stream first. This call avoids the allocation done by hs_copy_stream().
Besides stream It also supports how to stream The content is compressed to do scan operation
- hs_compress_stream()
- hs_expand_stream()
- hs_reset_and_expand_stream()
Block Pattern
Block The model is very simple , call hs_scan() that will do .
Vector Pattern
Use API hs_scan_vector() from block Regular matching of arrays , From the point of view of the caller , from block list in scan and a) Put these block as stream Several writes or b) Put these block list adopt memcpy Spell it into a big block, The matching effect is the same . But we still need to decide which mode to use according to the actual situation
Scratch Space
As a regular scan , Need some extra memory , To save runtime internal data . If the application on the stack is relatively large , Run time temporary requests affect performance , So we need to apply for these spaces in advance , Our name is
Scratch Space .
We use hs_alloc_scratch() To apply for Scratch Space , Appoint pattern database that will do , We don't need to know the specific space size . If we have more than one database Words , Although we will be right
Every database call hs_alloc_scratch, But in fact, only one copy of the most appropriate size is generated Scratch Space .
- If it's recursion scan, For example, do another callback when calling back scan, It's time to 2 individual Scratch Space
- Without recursion ,Scratch The space should be per-thread Of
- If it's one to write and read more , We recommend using hs_clone_scratch() To replace many times hs_alloc_scratch()
Reference resources
版权声明
本文为[The water cup likes drinking water]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220842105016.html
边栏推荐
- How to change the size of SVG pictures without width in openlayer
- Why is one plus one equal to two
- 不断下沉的咖啡业,是虚假的繁荣还是破局的前夜?
- 2022.4.22-----leetcode.396
- What problems will you encounter when dialing VPS?
- 89 logistic回归用户画像用户响应度预测
- Echo "new password" |passwd -- stdin user name
- 动态代理ip的测试步骤有哪些?
- Network jitter tool clumsy
- How to set computer IP?
猜你喜欢
![NPM yarn startup error [resolved]](/img/8a/8351c33e30da8e08ac0dcb7c7c868e.png)
NPM yarn startup error [resolved]

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimensio

Chinese scientists reveal a new mechanism for breaking through the bottleneck of rice yield

MySQL basic record

006_redis_jedis快速入门

Basic knowledge of software testing, you can meet the interviewer after reading it

【汇编语言】从最底层的角度理解“堆栈”

教程】如何用GCC“零汇编”白嫖MDK

Challenges often faced by client project management

拨号服务器是什么,有什么用处?
随机推荐
Longest common subsequence (record path version)
C语言实现Base64编码/解码
009_Redis_RedisTemplate入门
How does Axure set the content of the text box to the current date when the page is loaded
在使用代理IP前需要了解哪些分类?
代理IP可用率是不是等同于代理IP的效率?
Error in face detection and signature of Tencent cloud interface
Performance introduction of the first new version of cdr2022
什么是api接口?
2022.4.10-----leetcode. eight hundred and four
【汇编语言】从最底层的角度理解“堆栈”
Esp32 message queue using FreeRTOS
关于局域网浅谈
2022.4.20-----leetcode.388
89 régression logistique prédiction de la réponse de l'utilisateur à l'image de l'utilisateur
我国科学家揭示突破水稻产量瓶颈新机制
PHP & laravel & master several ways of generating token by API and some precautions (PIT)
BGP服务器在什么业务场景会被用到?
一加一为什么等于二
2018 China Collegiate Programming Contest - Guilin Site J. stone game