1/18 웹스크래핑 - 복습, scrapy

Himoonhee 2021. 1. 18. 21:46

2021. 1. 18. 21:46

728x90

강사님 깃헙주소

https://github.com/joneconsulting/cloud-service

joneconsulting/cloud-service

Contribute to joneconsulting/cloud-service development by creating an account on GitHub.

github.com

gartner top technology trends 2021

https://www.aitimes.kr/news/articleView.html?idxno=18089

코로나 이후 2021년, 가트너 '9가지 전략기술' 발표... 인공지능, 분산클라우드, 인터넷오브액션 등

2021년 어느 날, 산업 현장의 직원들은 COVID-19 대유행 때 문을 닫은 후 다시 직장으로 돌아왔을 때, 그들은 몇 가지 차이점을 발견했다. 센서나 RFID 태그는 직원들이 정기적으로 손을 씻는지 여부

www.aitimes.kr

programming language ranking 2021

stackoverflow programming language ranking 2021

https://www.wearedevelopers.com/magazine/top-programming-languages-to-learn

Top in-demand programming languages to learn in 2021

Long-established languages such as Java, C, or JavaScript are usually the right choice for beginners. But are they still future-proof? These are the players of tomorrow that you should learn in 2021.

www.wearedevelopers.com

https://stackoverflow.blog/2020/05/27/2020-stack-overflow-developer-survey-results

The 2020 Developer Survey results are here! - Stack Overflow Blog

We're excited to share the results of our 10th annual developer survey! 65,000 developers shared their thoughts on the state of software today.

stackoverflow.blog

https://withhsunny.tistory.com/57

MAC 터미널 현재 위치에서 vscode 열기

You can also run VS Code from the terminal by typing 'code' after adding it to the path: 1. Launch VS Code. 2. Open the Command Palette (Ctrl+Shift+P) and type 'shell command' to find the Shell C..

withhsunny.tistory.com

맥 터미널에서 vs code 실행시키기

**학습한 내용

파이썬 프로그래밍(24h)
웹 스크래핑, 시각화, 데이터 분석(24h)

Request, BeautifulSoap, Seaborn, Pandas
Selenium
Scrapy

MariaDB 연동 (RDBMS)

django + bootstrap(32h)
git
postman

Pandas_exercise

SF Salaries Exercise¶

import pandas as pd

Read Salaries.csv as a dataframe called sal.

sal = pd.read_csv('Salaries.csv')

Check the head of the DataFrame.

sal.head(2)

Use the .info() method to find out fow many entries there are.

sal.info()

What is the average BasePay?

sal['BasePay'].mean()

What is the highest amount of OvertimePay in the dataset?

sal['OvertimePay'].max()

What is the job title of JOSEPH DRISCOLL? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll).

sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']

How much does JOSEPH DRISCOLL make (including benefits)?

sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']

What is the name of highest paid person (including benefits)?

sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]['EmployeeName']
sal.iloc[sal['TotalPayBenefits'].idxmax()]

What is hte name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid

sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].min()]

What was the average (mean) BasePay of all employees per year? (2011-2014) ?

sal.groupby('Year').mean()['BasePay']

How many unique job titles are there?

sal['JobTitle'].nunique()

What are the top 5 most common jobs?

sal['JobTitle'].value_counts().head(5)

How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013) ?

sum(sal[sal['Year'] == 2013]['JobTitle'].value_counts() ==1)

How many people have the word Chief in thier job title? (This is pretty tricky) (using lambda expression)

def chief_string(title: str) -> bool:
    if 'chief' in title.lower().split():
        return True
    else :
        return False

sum(sal['JobTitle'].apply(lambda x : chief_string(x)))

Bonus: Is there a correlation between length of the Job Title string and Salary?

sal['title_len'] = sal['JobTitle'].apply(len)
sal[['TotalPayBenefits', 'title_len']].corr()

2. Ecommerce Purchases Exercise

** Import pandas and read in the Ecommerce Purchases csv file and set it to a DataFrame called ecom. **

import pandas as pd

ecom = pd.read_csv('Ecommerce Purchases.csv')

Check the head of the DataFrame.

ecom.head(2)

** How many rows and columns are there? **

ecom.info()

** What is the average Purchase Price? **

ecom['Purchase Price'].mean()

** What were the highest and lowest purchase prices? **

ecom['Purchase Price'].max()
ecom['Purchase Price'].min()

** How many people have English 'en' as their Language of choice on the website? **

ecom[ecom['Language'] == 'en'].count()

** How many people have the job title of "Lawyer" ? **

ecom[ecom['Job'] == 'Lawyer'].info()

** How many people made the purchase during the AM and how many people made the purchase during PM ? ** *(Hint: Check out value_counts() ) *

ecom['AM or PM'].value_counts()

** What are the 5 most common Job Titles? **

ecom['Job'].value_counts().head(5)

** Someone made a purchase that came from Lot: "90 WT" , what was the Purchase Price for this transaction? **

ecom[ecom['Lot'] == '90 WT']['Purchase Price']

** What is the email of the person with the following Credit Card Number: 4926535242672853 **

ecom[ecom['Credit Card'] == 4926535242672853]['Email']

* How many people have American Express as their Credit Card Provider *and made a purchase above $95 ?**

ecom[(ecom['CC Provider'] == 'American Express') & (ecom['Purchase Price']>95)].count()

** Hard: How many people have a credit card that expires in 2025? **

sum(ecom['CC Exp Date'].apply(lambda x : x[3:]) == '25')

** Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...) **

ecom['Email'].apply(lambda x : x.split('@')[1]).value_counts().head(5)

Great Job!¶

3. convenient_stat

%matplotlib inline
import pandas as pd
import matplotlib as plt

##1. csv 파일 불러오기 (convenient_store.csv)

con = pd.read_csv('convenient_store.csv')
con.head(5)

##2. 전체 컬럼 정보, null 값 유무 확인

con.info()

##3. 개수, 평균, 편차, 최소, 최대값 확인

con.describe()

##4. 지역에 대한 통계, 개수, 유니크한 정보, 제일 빈도가 높은 지역

# con['area']
con.area.describe()

##5. 시간 당 급여가 6500원 이상인 지역의 편의점 정보 출력 (상위 10개만)

con[con['hourly_wage']>=6500].head(10)

##6. 시간 당 급여가 높은 순서로 정렬 (sort_value() 함수 사용, 상위 10개만 출력)

con.sort_values(by='hourly_wage', ascending=0).head(10)

##7. 영등포구에서 시간 당 급여가 6000원 이상인 편의점 검색

con[((con['area'].apply(lambda x : x[:4])) =='영등포구') & (con['hourly_wage']>=6000)]

##8. CU 편이점만 출력 (상위 10개만)

con[con.company.str.contains('CU')].head(10)

##9. 지역 컬럼(location)을 추가한 다음, in Seoul 이라는 값 저장, 상위 5개 출력

con['location'] = 'in Seoul'
con.head(5)

##10. 6000원 이상 컬럼 추가(more_than_6000) -> True, False 값 저장 (상위 20개 출력)

con['more_than_6000'] = con.hourly_wage>6000
con.head(20)

##11. more_than_6000 컬럼에서 True인 데이터들의 평균, 개수, 편차 등의 정보 출력

con[con.more_than_6000==True].describe()

##12. more_than_6000 이름의 함수를 생성하고, 6000원이상인 경우 A group, 아니면 B group을 반환하는 함수 생성

def more_than_6000(x: int)-> int:
    if x >= 6000:
        return 'A group'
    else:
        return 'B group'

##13. more_than_6000_f 컬럼 생성하고 more_than_6000 함수의 결과를 저장

con['more_than_6000_f'] = con.hourly_wage.map(lambda x: more_than_6000(x))

##14. 지금까지의 결과 상위 10개를 출력

con.head(10)

##15-1. more_than_6000가 True인 데이터의 지역과 시간당 급여를 가진 새로운 데이터프레임 생성(data2)

##15-2. data2 데이터를 시간당 급여 순으로 정렬 (높은순)

data2 = con[con['more_than_6000'] == True][['area1','hourly_wage']]
data2.sort_values(by='hourly_wage', ascending=0)

##16. data2를 darta2.csv 파일로 저장

data2.to_csv('data2.csv', index=False)

##17. 시간당 급여를 histogram 으로 표시 (matplotlib hist() 사용)

con.hourly_wage.hist(bins=10)
plt.pyplot.show()

##18. 시간당 급여를 box 차트로 표시

con.boxplot(column='hourly_wage', return_type='dict')

##19. 시간당 급여를 box 차트로 표시(이름순으로)

con.boxplot(column='hourly_wage', by='name')

##19. 시간당 급여를 box 차트로 표시(지역순으로)

con.boxplot(column='hourly_wage', by='area1')

##20. 한글 표시되게 matplotlib 지정

#강사님버전

import matplotlib.font_manager as fm
font_list = [(f.name, f.fname) for f in fm.fontManager.ttflist if 'Gothic' in f.name]
print(font_list)
font_name = fm.FontProperties(fname='/System/Library/Fonts/Supplemental/AppleGothic.ttf').get_name()
print(font_name)
plt.rc('font', family=font_name)

[('Noto Sans Gothic', '/System/Library/Fonts/Supplemental/NotoSansGothic-Regular.ttf'), ('Apple SD Gothic Neo', '/System/Library/Fonts/AppleSDGothicNeo.ttc'), ('Hiragino Maru Gothic Pro', '/System/Library/Fonts/ヒラギノ丸ゴ ProN W4.ttc'), ('AppleGothic', '/System/Library/Fonts/Supplemental/AppleGothic.ttf')]

AppleGothic

#참고버전

from matplotlib import rc

rc('font', family='AppleGothic')
plt.rcParams['axes.unicode_minus'] = False

##22-1. 지역구별 box 차트(플롯)

##22-2. 폰트 사이즈 6

con.boxplot(column='hourly_wage', by='area')
plt.pyplot.xticks(fontsize=6)

##23-1. 지역구별 box 차트(플롯), 지역구가 세로로 표시

##23-2. 폰트 사이즈 6

con.boxplot(column='hourly_wage', by='area1', vert=False)
plt.pyplot.xticks(fontsize=6)

**Selenium

웹 애플리케이션 테스트 프레임워크
웸 사이트에서 버튼 클릭과 같이 이벤트 처리 가능
JavaScript 실행 가능
웹 브라우저 실행을 대신하기 위한 Web Driver 설치 -> Selenium이 사용하기 위한 웹 브라우저

- http://chromedriver.chromium.org/downloads

Downloads - ChromeDriver - WebDriver for Chrome

WebDriver for Chrome

chromedriver.chromium.org

$pip install selenium
input 태그에 name이나 id같은 선택자가 있으면 selenium에서 테스트가능

**Scrapy

수많은 웹 페이지로부터 정보를 수집 -> 빅데이터로 활용
Scrapying을 위한 라이브러리

$pip install scrapy

Scrapy Shell

$scrapy shell

(pip install --upgrade setuptools

pip install pypiwin32

pip install twisted)

pip install scrapy

scrapy shell

fetch('https://news.naver.com/main/list.nhn?mode=LSD&mid=sec&sid1=001')

view(response)

print(response.text)

728x90

'CLOUD > Python' 카테고리의 다른 글

1/19 웹크롤링 - selenium, scrapy, 여러페이지 크롤링 (2)	2021.01.19
1/11 파이썬 6차시 - 웹스크래핑(MariaDB연동), 시각화 (0)	2021.01.11
1/6 파이썬 5차시 - 웹스크래핑(csv, xml, json타입 처리), 데이터 분석 (0)	2021.01.06
1/5 파이썬 4차시 - 깃파이참연동, 모듈과 패키지, 예외처리, 파일다루기, 웹스크래핑 (0)	2021.01.05
1/4 파이썬 3차시 - 함수가이드라인, pythonic code, 람다, 객체와클래스, 모듈과 패키지 (0)	2021.01.04

몽구와 함께하는 개발자의 길

1/18 웹스크래핑 - 복습, scrapy

SF Salaries Exercise¶

Great Job!¶

'CLOUD > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바