Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Overview

PremiershipPlayerAnalysis

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. Note : My understanding is the squad data on this site can change at any time so your results might be different

Improvement : Calculate age to finer degree than just years

The was developed in Jupyter Notebook and this walkthrough willl assume you are doing the same

Once you have ran the scraping

original = pd.DataFrame(playersList) # Convert the data scraped into a Pandas DataFrame 

original.to_csv('premiershipplayers.csv') # Keep a back up of the data to save time later if required 

df2 = original.copy() # Working copy of the DataFrame (just in case) 


df2.info()


   
    
RangeIndex: 578 entries, 0 to 577
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   club         578 non-null    object
 1   name         578 non-null    object
 2   shirtNo      572 non-null    object
 3   nationality  562 non-null    object
 4   dob          562 non-null    object
 5   height       500 non-null    object
 6   weight       474 non-null    object
 7   appearances  578 non-null    object
 8   goals        578 non-null    object
 9   wins         578 non-null    object
 10  losses       578 non-null    object
dtypes: object(11)
memory usage: 49.8+ KB

   

*** A total of 578 player. ***

6 without shirt number

16 without nationality listed

16 without dob listed

78 without height listed

104 without weight listed

Cleanup Data

  1. Remove spaces and newline from dob, appearances, goals, wins and losses columns

  2. Change type of dob to date

  3. change type of appearances, goals, wins, losses to int

     df2['dob'] = df2['dob'].str.replace('\n','').str.strip(' ')
     df2['appearances'] = df2['appearances'].str.replace('\n','').str.strip(' ')
     df2['goals'] = df2['goals'].str.replace('\n','').str.strip(' ')
     df2['wins'] = df2['wins'].str.replace('\n','').str.strip(' ')
     df2['losses'] = df2['losses'].str.replace('\n','').str.strip(' ')
    
     # change type of dob, appearances, goals, wins, losses
     from datetime import  date
    
     df2['dob'] = pd.to_datetime(df2['dob'],format='%d/%m/%Y').dt.date
     df2["appearances"] = pd.to_numeric(df2["appearances"])
     df2["goals"] = pd.to_numeric(df2["goals"])
     df2["wins"] = pd.to_numeric(df2["wins"])
     df2["losses"] = pd.to_numeric(df2["losses"])
     df2['height'] = df2['height'].str[:-2]
     df2["height"] = pd.to_numeric(df2["height"])
     
     
     # Create age column
    
     today = date.today()
    
     def age(born):
         if born:
             return today.year - born.year - ((today.month, 
                                           today.day) < (born.month, 
                                                         born.day))
         else:
             return np.nan
    
     df2['age'] = df2['dob'].apply(age)
    

10 Oldest Players

    df2.sort_values('age',ascending=False).head(10)

image

10 Youngest Players

    df2.sort_values('age',ascending=True).head(10)

image

Squad Sizes

    df2.groupby(['club'])['club'].count().sort_values(ascending=False)

image

Team's Average Player Age

    plt.ylim([20, 30])
    df2.groupby(['club'])['age'].mean().sort_values(ascending=False).plot.bar()

image

Burnley appear to not only have one of the highest average player ages but also the owest number of registered players

Top 10 Premiership Appearances

    df2.sort_values('appearances',ascending=False).head(10)

image

Collective Premiership Appearances per Club

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False)

image

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False).plot.bar()

image

10 Tallest Playes

    df2.sort_values('height',ascending=False).head(10)

image

10 Shortest Playes

    df2.sort_values('height',ascending=True).head(10)

image

Nationality totals of Players

    pd.set_option('display.max_rows', 100)
    df.groupby(['nationality'])['club'].count().sort_values(ascending=False)

Nationality totals per club

    pd.set_option('display.max_rows', 500)
    df.groupby(['club','nationality'])['nationality'].count()
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

Michael Milton 1 Oct 26, 2021
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Statistical Analysis πŸ“ˆ focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis πŸ“ˆ This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

PokΓ©mon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022