当前位置：网站首页>Scrapy modifies the time in the statistics at the end of the crawler as the current system time

Scrapy modifies the time in the statistics at the end of the crawler as the current system time

2022-04-23 07:47:00 【Brother Bing】

Scrapy Modify the time in the statistics at the end of the crawler to the current system time

One 、 The problem background
Two 、 Problem analysis
3、 ... and 、 resolvent
Four 、 Effect display

One 、 The problem background

scrapy At the end of each run, a pile of statistical information will be displayed , Among them, there are statistical time data , however ！！！ What time was that UTC Time （0 The time zone ）, It's not the system local time we're used to , And the total running time of the crawler inside is calculated in seconds , Not in line with our daily habits , So I flipped scrapy Source code , Find the relevant content and rewrite it , Feeling ok , Ladies and gentlemen, take it with you ！

Two 、 Problem analysis

Through log information , Find the corresponding class that counts the running time of the crawler ：scrapy.extensions.corestats.CoreStats

The log information is displayed as follows ：

#  Extended configuration 
2021-05-10 10:43:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',  #  Signal collector , There is information about the running time of the crawler 
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
#  Statistics 
2021-05-10 10:44:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
      'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1,
 'downloader/request_bytes': 1348,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 10256,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 18.806005,  #  The total time taken for the crawler to run 
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 5, 10, 2, 44, 10, 418573),  #  Reptile end time 
 'httpcompression/response_bytes': 51138,
 'httpcompression/response_count': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2021, 5, 10, 2, 43, 51, 612568)}  #  Reptile start time 
2021-05-10 10:44:10 [scrapy.core.engine] INFO: Spider closed (finished)

The screenshot of the source code is as follows ：

3、 ... and 、 resolvent

rewrite CoreStats class

# -*- coding: utf-8 -*-
#  Rewrite the signal collector 
import time
from scrapy.extensions.corestats import CoreStats


class MyCoreStats(CoreStats):

    def spider_opened(self, spider):
        """ The crawler starts running """
        self.start_time = time.time()
        start_time_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(self.start_time))  #  Convert format 
        self.stats.set_value(' Reptile start time : ', start_time_str, spider=spider)

    def spider_closed(self, spider, reason):
        """ The crawler finished running """
        #  Reptile end time 
        finish_time = time.time()
        #  Convert time format 
        finish_time_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(finish_time))
        #  Calculate the total running time of the crawler 
        elapsed_time = finish_time - self.start_time
        m, s = divmod(elapsed_time, 60)
        h, m = divmod(m, 60)
        self.stats.set_value(' Reptile end time : ', finish_time_str, spider=spider)
        self.stats.set_value(' The total time taken for the crawler to run : ', '%d when :%02d branch :%02d second ' % (h, m, s), spider=spider)
        self.stats.set_value(' Reptile end reason : ', reason, spider=spider)

Modify profile information

EXTENSIONS = {
      
   'scrapy.extensions.corestats.CoreStats': None,  #  Disable the default data collector 
   ' Project name .extensions.corestats.MyCoreStats': 500,  #  Custom collector enabled signal 
}

Four 、 Effect display

	2021-05-10 11:11:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
	{
    'downloader/exception_count': 5,
	 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
	 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
	 'downloader/request_bytes': 1976,
	 'downloader/request_count': 6,
	 'downloader/request_method_count/GET': 6,
	 'downloader/response_bytes': 10266,
	 'downloader/response_count': 1,
	 'downloader/response_status_count/200': 1,
	 'httpcompression/response_bytes': 51139,
	 'httpcompression/response_count': 1,
	 'log_count/INFO': 10,
	 'response_received_count': 1,
	 'scheduler/dequeued': 6,
	 'scheduler/dequeued/memory': 6,
	 'scheduler/enqueued': 6,
	 'scheduler/enqueued/memory': 6,
	 ' Reptile end reason ': 'finished',
	 ' Reptile start time : ': '2021-05-10 11:10:39',
	 ' Reptile end time : ': '2021-05-10 11:11:03',
	 ' The total time taken for the crawler to run : ': '0 when :00 branch :24 second '}

版权声明
本文为[Brother Bing]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230625585327.html