2021 2학기 데이터크롤링 기말프로젝트

Last update: Aug 16, 2022

Related tags

Text Data & NLP data_crawling

Overview

공지

주제

웹 크롤링을 이용한 취업 공고 스케줄러

스케줄

주제 정하기
코딩하기
핵심 코드 설명 + 피피티 구조 구상 // 12/4 토
피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토
영상 편집 // ~12/11 토

웹크롤러

사람인_평균연봉 1000개

주제 선정 배경

마지막 학기를 보내며 취업 전선에 뛰어들려 하니 여러 가지 생각해야 할 게 많았다. 학교라는 좁은 사회를 벗어나 더 큰 물에 뛰어들려 보니 겁부터 났다. 수영 전 준비운동을 하듯 내가 취업하기 위해 먼저 채용 정보를 수집해야 겠다고 생각했다.
IT 내에서도 트렌드와 어떤 분야에서 사람을 많이 구하는지 알고 싶었다. 그를 위해 스택 오버플로우에서 User-Agent 를 확인 후 채용 공고 크롤링을 수행했다.
우리나라 내에서 각자의 분야에 종사하는 사람들이 평균 연봉으로 얼마를 받는지 알고 싶어서 여러 취업 사이트 중 하나인 '사람인'에서 User-Agent 를 확인 후 평균 연봉 정보를 크롤링했다. 최근 1000개만 수행해보았다. (10000개 해도 될 듯하다.)

데이터 수집 방법

사람인, 스택오버플로우에서의 채용 공고를 긁어오기로 했다.
따로 만든 크롤러 파일(연봉정보, 채용공고)에서 CSV 로 데이터를 추출한다.

크롤링 작업 중 핵심 코드 설명

연봉 정보 파일은 주석 달기 완료

분석 방법

주제어(키워드) 빈도 분석
주제어(키워드) 중요도 분석
텍스트 마이닝
참고한 링크

결론

어떠한 분야에서의 국내 평균 연봉은 이렇다!
요새는 세계적으로 IT 내 이쪽 분야가 트렌드다! 사람을 많이 뽑는다!

참고자료

사람인 사이트
스택 오버플로우 사이트

과제 수행에서 어려웠던 점

User-Agent 에서 크롤링을 허락해주는 사이트 중 URL 에 페이지의 숫자가 나타나는 사이트를 찾기 어려웠다.
직무 별

PPT 구성

[1] - 주제
[2] - 주제 선정 배경
[3] - 데이터 수집 방법
[4] - 크롤링 작업 중 핵심 소스 코드 설명
[5] - 분석방법/모델
[6] - 결론
[7] - 참고자료
[8] - 과제 수행에서 어려웠던 점

PPT 상세 구성

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
- 나라별 구인 직종
사람인
- 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
  - 최고 같은 경우 은행이나 다른 업종
  - 최저 같은 경우 서비스 업종
- 기업형태에 따른 연봉 구간 (중소/중견/대)
- 산업(업종)에 따른 연봉 구간
- 코스닥/코스피에 따른 연봉 구간 차이?
현재 취업하려고 하는 사람들에게 어떤 직무가 자신에게 나을지 판단 -> 결론
- 직무별 수요에 따라서 결과 표시 (스택)
- 연봉을 중요시 여긴다면 결과 표시 (사람인)

분석 결과

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
  - 분석 결과 여따 써줘요
  - 대략 밑에 작성하라는 의미
  - Front / Back
  - 직무 8개 별로
- 나라별 구인 직종
- 사람인
  - 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
    - 최고 같은 경우 은행이나 다른 업종
    - 최저 같은 경우 서비스 업종
  - 기업형태에 따른 연봉 구간 (중소/중견/대)
  - 산업(업종)에 따른 연봉 구간
  - 코스닥/코스피에 따른 연봉 구간 차이?

Owner

Choi Eun Jeong

Frontend Developer with React & React Native

Choi Eun Jeong

GitHub Repository

A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

9 Feb 15, 2022

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

9.7k Jan 09, 2023

Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

5 Dec 08, 2021

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 02, 2023

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

1 Dec 17, 2021

Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

16.1k Jan 08, 2023

Deduplication is the task to combine different representations of the same real world entity.

Deduplication is the task to combine different representations of the same real world entity. This package implements deduplication using active learning. Active learning allows for rapid training wi

63 Nov 17, 2022

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural languag

1.1k Jan 03, 2023

All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

2 Dec 31, 2021

Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022

Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

423 Jan 01, 2023

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

46 Dec 14, 2022

Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

1.8k Jan 07, 2023

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

61 Dec 10, 2022

A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

0 Aug 06, 2021

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD 🪁 Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁 Features Easy to Deploy Heroku Supp

6 Oct 18, 2022

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Arabic-Phonetic-Output You can input the phonetic version of any Arabic text her

1 Dec 30, 2021

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

20 Jul 14, 2022