bigdata_analyse 大数据分析项目

Last update: Dec 30, 2022

Related tags

Data Analysis bigdata_analyse

Overview

bigdata_analyse

大数据分析项目

wish

采用不同的技术栈，通过对不同行业的数据集进行分析，期望达到以下目标：

了解不同领域的业务分析指标
深化数据处理、数据分析、数据可视化能力
增加大数据批处理、流处理的实践经验
增加数据挖掘的实践经验

tip

项目主要使用的编程语言是 python、sql、hql
.ipynb 可以用 jupyter notebook 打开，如何安装, 可以参考 jupyter notebook

jupyter notebook 是一种网页交互形式的 python 编辑器，直接通过 pip 安装，也支持 markdown，很适合用来做数据分析可视化以及写文章、写示例代码等。

list

主题	处理方式	技术栈	数据集下载
1 亿条淘宝用户行为数据分析	离线处理	清洗 hive + 分析 hive + 可视化 echarts	阿里云或者百度网盘提取码：5ipq
1000 万条淘宝用户行为数据实时分析	实时处理	数据源 kafka + 实时分析 flink + 可视化（es + kibana）	百度网盘提取码：m4mc
300 万条《野蛮时代》的玩家数据分析	离线处理	清洗 pandas + 分析 mysql + 可视化 pyecharts	百度网盘提取码：paq4
130 万条深圳通刷卡数据分析	离线处理	清洗 pandas + 分析 impala + 可视化 dbeaver	百度网盘提取码：t561
10 万条厦门招聘数据分析	离线处理	清洗 pandas + 分析 hive + 可视化 ( hue + pyecharts ) + 预测 sklearn	百度网盘提取码：9wx0
7000 条租房数据分析	离线处理	清洗 pandas + 分析 sqlite + 可视化 matplotlib	百度网盘提取码：9en3
6000 条倒闭企业数据分析	离线处理	清洗 pandas + 分析 pandas + 可视化 (jupyter notebook + pyecharts)	百度网盘提取码：xvgm

refer

https://tianchi.aliyun.com/dataset/

https://opendata.sz.gov.cn/data/api/toApiDetails/29200_00403601

https://www.kesci.com/home/dataset

Owner

Way

Way

GitHub Repository

Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

17 Aug 21, 2022

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

1.4k Dec 31, 2022

A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

2 Aug 21, 2022

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

1 Jun 28, 2022

Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

3 Dec 07, 2022

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

3 Oct 30, 2021

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

193 Nov 29, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

791 Jan 04, 2023

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

6 Nov 30, 2022

Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

13 Nov 10, 2021

Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

1.2k Dec 31, 2022

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

2 Jul 01, 2022

Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

72 Jan 05, 2023

Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

2.1k Jan 05, 2023

Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023

A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

0 Sep 07, 2021

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

NSE-timeseries-form-CSV-file-creator-and-SQL-appender- This creates a ohlc timeseries from downloaded CSV files from National Stock Exchange India (NS

1 Oct 02, 2022

Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021

follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

2 May 02, 2022