当前位置:网站首页>How to use DBA_ hist_ active_ sess_ History analysis database history performance problems

How to use DBA_ hist_ active_ sess_ History analysis database history performance problems

2022-04-23 06:58:00 Sebastien23

background

In many cases , When there is a performance problem with the database , We don't have a chance to gather enough diagnostic information , such as system state dump perhaps hang analyze, Even when the problem happens DBA Not at all . This makes it very difficult for us to diagnose the problem . So in this case , Can we collect some information afterwards to analyze the cause of the problem ? stay Oracle 10G Or later , The answer is yes . In this article, we will introduce a method through dba_hist_active_sess_history A method of analyzing problems based on data .

Apply to

Oracle 10G Or later , This article applies to any platform .

details

stay Oracle 10G in , We introduced AWR and ASH Sampling mechanism , There is a view gv$active_session_history Meeting Per second The clock will all nodes of the database Active Session Take a sample , and dba_hist_active_sess_history Will be gv$active_session_history The data in Every time 10 second Sample once and persist . Based on this feature , We can analyze dba_hist_active_sess_history Of Session Sampling , To locate the exact time frame in which the problem occurred , And the of each sampling point can be observed top event and top holder. Here is an example to illustrate in detail .

1. Dump During the problem ASH data

In order not to affect the production system , We can approximate the period of the problem ASH data export Come out and analyze on the tester .
be based on dba_hist_active_sess_history Create a new table t_ash, Then pass it through exp/imp Import to the tester . Execute... On the database where the problem occurred exp:

SQL> conn user/passwd
SQL> create table t_ash as select * from dba_hist_active_sess_history where SAMPLE_TIME between 
TO_TIMESTAMP ('<time_begin>', 'YYYY-MM-DD HH24:MI:SS') and 
TO_TIMESTAMP ('<time_end>', 'YYYY-MM-DD HH24:MI:SS');
$ exp user/passwd file=t_ash.dmp tables=(t_ash) log=t_ash.exp.log

Then import it into the tester :

$ imp user/passwd file=t_ash.dmp log=t_ash.imp.log

2. Verify exported ASH Time range

The proposal USES Oracle SQL Developer To prevent the output result from being broken into lines, which is not easy to observe .

set line 200 pages 1000
col sample_time for a25
col event for a40
alter session set nls_timestamp_format='yyyy-mm-dd hh24:mi:ss.ff';

select
 t.dbid, t.instance_number, min(sample_time), max(sample_time), count(*) session_count
  from t_ash t
 group by t.dbid, t.instance_number
 order by dbid, instance_number;

INSTANCE_NUMBER    MIN(SAMPLE_TIME)    MAX(SAMPLE_TIME)    SESSION_COUNT
1    2015-03-26 21:00:04.278    2015-03-26 22:59:48.387    2171
2    2015-03-26 21:02:12.047    2015-03-26 22:59:42.584    36

From the above output, we can see that the database has a total of 2 Nodes , The sampling time is... In total 2 Hours , node 1 Sampling ratio of nodes 2 A lot more , The problem may occur at the node 1 On .

3. Confirm the exact time range of the problem

Refer to the following script :

select
 dbid, instance_number, sample_id, sample_time, count(*) session_count
  from t_ash t
 group by dbid, instance_number, sample_id, sample_time
 order by dbid, instance_number, sample_time;

INSTANCE_NUMBER    SAMPLE_ID    SAMPLE_TIME    SESSION_COUNT
1    36402900    2015-03-26 22:02:50.985    4
1    36402910    2015-03-26 22:03:01.095    1
1    36402920    2015-03-26 22:03:11.195    1
1    36402930    2015-03-26 22:03:21.966    21
1    36402940    2015-03-26 22:03:32.116    102
1    36402950    2015-03-26 22:03:42.226    181
1    36402960    2015-03-26 22:03:52.326    200
1    36402970    2015-03-26 22:04:02.446    227
1    36402980    2015-03-26 22:04:12.566    242
1    36402990    2015-03-26 22:04:22.666    259
1    36403000    2015-03-26 22:04:32.846    289
1    36403010    2015-03-26 22:04:42.966    147
1    36403020    2015-03-26 22:04:53.076    2
1    36403030    2015-03-26 22:05:03.186    4
1    36403040    2015-03-26 22:05:13.296    1
1    36403050    2015-03-26 22:05:23.398    1

Observe the of each sampling point of the above output active session The number of , A sudden increase in number often means that a problem has occurred . From the above output, it can be determined that the exact time of the problem is 2015-03-26 22:03:21 ~ 22:04:42, The problem lasted about 1.5 minute .
Be careful : Observe whether the above output is out of gear , For example, there is no sampling at some time .

4. Determine the of each sampling point top n event

What we specify here is top 2 event And the sampling time is marked out to observe the situation of all sampling points . If there's a lot of data , You can also open sample_time To observe the situation in a certain period of time . Pay attention to the last column session_count It refers to the waiting time at the sampling point event Of session Number .

select t.dbid,
       t.sample_id,
       t.sample_time,
       t.instance_number,
       t.event,
       t.session_state,
       t.c session_count
  from (select t.*,
               rank() over(partition by dbid, instance_number, sample_time order by c desc) r
          from (select
                 t.*,
                 count(*) over(partition by dbid, instance_number, sample_time, event) c,
                 row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
                  from t_ash t
                /*where sample_time >
                    to_timestamp('2013-11-17 13:59:00',
                                 'yyyy-mm-dd hh24:mi:ss')
                and sample_time <
                    to_timestamp('2013-11-17 14:10:00',
                                 'yyyy-mm-dd hh24:mi:ss')*/
                ) t
         where r1 = 1) t
 where r < 3
 order by dbid, instance_number, sample_time, r;

SAMPLE_ID    SAMPLE_TIME    INSTANCE_NUMBER    EVENT    SESSION_STATE    SESSION_COUNT
36402900    22:02:50.985    1        ON CPU    3
36402900    22:02:50.985    1    db file sequential read    WAITING    1
36402910    22:03:01.095    1        ON CPU    1
36402920    22:03:11.195    1    db file parallel read    WAITING    1
36402930    22:03:21.966    1    cursor: pin S wait on X    WAITING    11
36402930    22:03:21.966    1    latch: shared pool    WAITING    4
36402940    22:03:32.116    1    cursor: pin S wait on X    WAITING    83
36402940    22:03:32.116    1    SGA: allocation forcing component growth    WAITING    16
36402950    22:03:42.226    1    cursor: pin S wait on X    WAITING    161
36402950    22:03:42.226    1    SGA: allocation forcing component growth    WAITING    17
36402960    22:03:52.326    1    cursor: pin S wait on X    WAITING    177
36402960    22:03:52.326    1    SGA: allocation forcing component growth    WAITING    20
36402970    22:04:02.446    1    cursor: pin S wait on X    WAITING    204
36402970    22:04:02.446    1    SGA: allocation forcing component growth    WAITING    20
36402980    22:04:12.566    1    cursor: pin S wait on X    WAITING    219
36402980    22:04:12.566    1    SGA: allocation forcing component growth    WAITING    20
36402990    22:04:22.666    1    cursor: pin S wait on X    WAITING    236
36402990    22:04:22.666    1    SGA: allocation forcing component growth    WAITING    20
36403000    22:04:32.846    1    cursor: pin S wait on X    WAITING    265
36403000    22:04:32.846    1    SGA: allocation forcing component growth    WAITING    20
36403010    22:04:42.966    1    enq: US - contention    WAITING    69
36403010    22:04:42.966    1    latch: row cache objects    WAITING    56
36403020    22:04:53.076    1    db file scattered read    WAITING    1
36403020    22:04:53.076    1    db file sequential read    WAITING    1

From the above output, we can find that the most serious waiting during the problem is cursor: pin S wait on X, Wait for the rush hour event Of session The number has reached 265 individual , Next is SGA: allocation forcing component growth, Fastigium session by 20 individual .

Be careful :

  1. Confirm again whether the above output is out of gear , Is there some time when there is no sampling .
  2. Pay attention to those session_state by ON CPU Output , Compare ON CPU The number of processes is the same as your OS Physics CPU The number of , If it approaches or exceeds physical CPU Number , Then you need to check OS At that time CPU Resource status , such as OSWatcher/NMON Tools such as , high CPU Run Queue May cause the problem , Of course, it may also be the result of the problem , Need to combine OSWatcher and ASH Time order to verify .

5. Observe the waiting chain of each sampling point

The principle is through dba_hist_active_sess_history.blocking_session Records of the holder To pass connect by Cascade query , Find out the final holder. stay RAC Environment , For each node ASH The sampling time is not consistent in many cases , So you can use this SQL The second paragraph of the note sample_time Make a little modification to make the difference between different nodes 1 The sampling time of seconds can be compared ( Note that it is best to also partition by Medium sample_time Make corresponding changes ). In this output isleaf=1 All is final holder, and iscycle=1 Represents deadlock ( That is, in the same sampling point a etc. b,b etc. c, and c And so on a, If this continues to happen , Then it is particularly noteworthy ). Using the following query, we can observe the blocking chain .

select
 level                     lv,
 connect_by_isleaf         isleaf,
 connect_by_iscycle        iscycle,
 t.dbid,
 t.sample_id,
 t.sample_time,
 t.instance_number,
 t.session_id,
 t.sql_id,
 t.session_type,
 t.event,
 t.session_state,
 t.blocking_inst_id,
 t.blocking_session,
 t.blocking_session_status
  from t_ash t
/*where sample_time >
    to_timestamp('2013-11-17 13:55:00',
                 'yyyy-mm-dd hh24:mi:ss')
and sample_time <
    to_timestamp('2013-11-17 14:10:00',
                 'yyyy-mm-dd hh24:mi:ss')*/
 start with blocking_session is not null
connect by nocycle
 prior dbid = dbid
       and prior sample_time = sample_time
          /*and ((prior sample_time) - sample_time between interval '-1'
          second and interval '1' second)*/
       and prior blocking_inst_id = instance_number
       and prior blocking_session = session_id
       and prior blocking_session_serial# = session_serial#
 order siblings by dbid, sample_time;

LV    ISLEAF    ISCYCLE    SAMPLE_TIME    INSTANCE_NUMBER    SESSION_ID    SQL_ID    EVENT    SESSION_STATE    BLOCKING_INST_ID    BLOCKING_SESSION    BLOCKING_SESSION_STATUS
1    0    0    22:04:32.846    1    1259    3ajt2htrmb83y    cursor:    WAITING    1    537    VALID
2    1    0    22:04:32.846    1    537    3ajt2htrmb83y    SGA:    WAITING            UNKNOWN

Note that in order to make the output easy to read , We will wait for event Abbreviated , The same below . As can be seen from the output above , At the same sampling point (22:04:32.846), node 1 session 1259 Waiting for the cursor: pin S wait on X, It is called node 1 session 537 Blocking , And node 1 session 537 Waiting again SGA: allocation forcing component growth, also ASH No samples were collected holder, So here cursor: pin S wait on X It's just a superficial phenomenon , Here's the problem SGA: allocation forcing component growth.

6. Based on the first 5 Step principle to find out the final value of each sampling point top holder

Such as the following SQL Each sampling point is listed top 2 Of blocker session, The final blocking probability is calculated session Count ( Reference resources blocking_session_count).

select t.lv,
       t.iscycle,
       t.dbid,
       t.sample_id,
       t.sample_time,
       t.instance_number,
       t.session_id,
       t.sql_id,
       t.session_type,
       t.event,
       t.seq#,
       t.session_state,
       t.blocking_inst_id,
       t.blocking_session,
       t.blocking_session_status,
       t.c blocking_session_count
  from (select t.*,
               row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
          from (select t.*,
                       count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
                       row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
                  from (select
                         level              lv,
                         connect_by_isleaf  isleaf,
                         connect_by_iscycle iscycle,
                         t.*
                          from t_ash t
                        /*where sample_time >
                            to_timestamp('2013-11-17 13:55:00',
                                         'yyyy-mm-dd hh24:mi:ss')
                        and sample_time <
                            to_timestamp('2013-11-17 14:10:00',
                                         'yyyy-mm-dd hh24:mi:ss')*/
                         start with blocking_session is not null
                        connect by nocycle
                         prior dbid = dbid
                               and prior sample_time = sample_time
                                  /*and ((prior sample_time) - sample_time between interval '-1'
                                  second and interval '1' second)*/
                               and prior blocking_inst_id = instance_number
                               and prior blocking_session = session_id
                               and prior
                                    blocking_session_serial# = session_serial#) t
                 where t.isleaf = 1) t
         where r1 = 1) t
 where r < 3
 order by dbid, sample_time, r;

SAMPLE_TIME    INSTANCE_NUMBER    SESSION_ID    SQL_ID    EVENT    SEQ#    SESSION_STATE    BLOCKING_SESSION_STATUS    BLOCKING_SESSION_COUNT
22:03:32.116    1    1136    1p4vyw2jan43d    SGA:    1140    WAITING    UNKNOWN    82
22:03:32.116    1    413    9g51p4bt1n7kz    SGA:    7646    WAITING    UNKNOWN    2
22:03:42.226    1    1136    1p4vyw2jan43d    SGA:    1645    WAITING    UNKNOWN    154
22:03:42.226    1    537    3ajt2htrmb83y    SGA:    48412    WAITING    UNKNOWN    4
22:03:52.326    1    1136    1p4vyw2jan43d    SGA:    2150    WAITING    UNKNOWN    165
22:03:52.326    1    537    3ajt2htrmb83y    SGA:    48917    WAITING    UNKNOWN    8
22:04:02.446    1    1136    1p4vyw2jan43d    SGA:    2656    WAITING    UNKNOWN    184
22:04:02.446    1    537    3ajt2htrmb83y    SGA:    49423    WAITING    UNKNOWN    10
22:04:12.566    1    1136    1p4vyw2jan43d    SGA:    3162    WAITING    UNKNOWN    187
22:04:12.566    1    2472        SGA:    1421    WAITING    UNKNOWN    15
22:04:22.666    1    1136    1p4vyw2jan43d    SGA:    3667    WAITING    UNKNOWN    193
22:04:22.666    1    2472        SGA:    1926    WAITING    UNKNOWN    25
22:04:32.846    1    1136    1p4vyw2jan43d    SGA:    4176    WAITING    UNKNOWN    196
22:04:32.846    1    2472        SGA:    2434    WAITING    UNKNOWN    48

Note the above output , Like the first line , On behalf of the 22:03:32.116, node 1 Of session 1136 It finally blocked 82 individual session. Look down time , Visible nodes 1 Of session 1136 It was the most serious problem during the problem holder, It's blocked at every sampling point 100 Multiple session, And it keeps waiting SGA: allocation forcing component growth, Pay attention to its seq# You will find that event Of seq# Changing , Indicates that the session Not completely hang live , Because the time happens at night 22:00 about , This is obviously due to the automatic collection of Statistics job Lead to shared memory resize cause , So we can combine Scheduler/MMAN Of trace as well as dba_hist_memory_resize_ops The output of further determines the problem .

Be careful :

  1. blocking_session_count Refers to a holder Finally blocked session Count , such as a ← \leftarrow b ← \leftarrow c (a By b Blocking ,b Has been c Blocking ), Only calculate c blocked 1 individual session, Because in the middle b Repetition may occur in different blocking chains .
  2. If the final holder Has not been ash sampling ( Usually because of holder At leisure ), such as a ← \leftarrow c also b ← \leftarrow c (a By c Blocking , also b Also by c Blocking ), however c No sampling , Then the above script cannot c Statistics to the end holder in , This may lead to some omissions .
  3. Attention comparison blocking_session_count The number of is the same as 3 The total number of each sampling point queried in this step session_count Count , If the of each sampling point blocking_session_count The number is far less than the total session_count Count , That means most session There is no record of holder, Therefore, the results of this query do not represent .
  4. stay Oracle 10g in ,ASH did not blocking_inst_id Column , In all the above scripts , You just need to remove the column , therefore 10g Of ASH Generally, it can only be used to diagnose the problem of single node .

Other things about ASH Application

In addition to ASH Data to find holder outside , We can also use it to get a lot of information ( Based on the database version ):

  • Such as through PGA_ALLOCATED Column to calculate the maximum of each sampling point PGA, total PGA To analyze ora-4030/Memory Swap Related issues ;
  • adopt TEMP_SPACE_ALLOCATED Column to analyze temporary tablespace usage ;
  • adopt IN_PARSE/IN_HARD_PARSE/IN_SQL_EXECUTION Column to analyze SQL be in parse Or the execution stage ;
  • adopt CURRENT_OBJ#/CURRENT_FILE#/CURRENT_BLOCK# To make sure I/O Related to the object waiting to happen .

References
[1] https://blogs.oracle.com/database4cn/dbahistactivesesshistory

版权声明
本文为[Sebastien23]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230557416737.html