[파이썬] 웹 크롤링 개념 및 웹 데이터 수집 방법

2024. 1. 11. 02:08

파이썬 - 웹 크롤링 개념 및 데이터 수집 방법

웹 크롤링(Crawling)이란?
정적 웹페이지 데이터 수집 방법 (requests vs urllib.request)
동적 웹페이지 데이터 수집 방법 (selenium)
api로 데이터 받아오기
참고사항
1. 웹 접속이 안될경우

1. 웹 크롤링(Crawling)이란?

웹 크롤링(Crawling)이란 웹페이지에서 특정 데이터를 추출하여 수집하는 작업을 말합니다.

요즘 많은 관심을 받고 있는 빅데이터 분석을 위해서는 데이터 확보가 중요한데 웹 크롤링을 통해서 데이터를 수집하여 활용하는 경우가 많이 있습니다.

1) 수집방법

정적수집 : 정적 웹 페이지의 데이터 수집 (속도 빠름)
- 사용 라이브러리 : requests or urllib
- 파싱 라이브러리 : beautifulsoup
동적수집 : 동적/정적 웹 페이지의 데이터 수집 (속도 느림)
- 사용 라이브러리 : selenium
- 파싱 라이브러리 : selenium or beautifulsoup

2) BeautifulSoup과 html Parser 종류

BeautifulSoup은 태그 형식으로된 html 문서를 파싱해주는 라이브러리 입니다. 이때 html 문서를 파싱해주는 parser의 종류는 아래와 같습니다.

html.parser : 파이썬 기본 html Parser로 설치 필요없음.
- ex) BeautifulSoup(contents, 'html.parser')
lxml : xml Parser로 속도가 빠름. lxml 설치가 필요하나 아나콘다에는 포함되어 있음.
- ex) BeautifulSoup(contents, 'lxml')
html5lib : 웹 브라우저와 같은 방식으로 유효한 html5 파싱, 속도가 느리며 html5lib 설치 필요.
- 설치방법 : pip install html5lib
- ex) BeautifulSoup(contents, 'html5lib')

2. 정적 웹페이지 데이터 수집 방법 (requests vs urllib.request)

1) urllib.request 라이브러리 사용방법

from bs4 import BeautifulSoup
from urllib.request import urlopen

contents = urlopen("https://kadosholy.tistory.com/172")

bs = BeautifulSoup(contents, "html.parser")
title = bs.select_one("h2")
print(title.text)

[실행결과]
[CSS] 선택자 종류 및 사용방법 (태그, id, class, 속성, 자식, 하위, 형제, 가상클래스, 연결)

2) requests 라이브러리 사용방법

from bs4 import BeautifulSoup
import requests

response = requests.get("https://kadosholy.tistory.com/172")
contents = response.content

bs = BeautifulSoup(contents, "html.parser")
title = bs.select_one("h2")
print(title.text)

[실행결과]
[CSS] 선택자 종류 및 사용방법 (태그, id, class, 속성, 자식, 하위, 형제, 가상클래스, 연결)

3. 동적 웹페이지 데이터 수집 방법 (selenium)

1) selenium 사용예제

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.daum.net")
driver.find_element(By.CSS_SELECTOR, '#q').send_keys('파이썬')
driver.find_element(By.CSS_SELECTOR, '.inner_search > .ico_pctop.btn_search').click()

items = driver.find_elements(By.CSS_SELECTOR, 'a.keyword')
for item in items:
    print(item.text)

4. api로 데이터 받아오기

1) 네이버 api 사용 예제

import requests, json, urllib.request

url = "https://openapi.naver.com/v1/datalab/search"
headers = {"X-Naver-Client-Id":"myid", "X-Naver-Client-Secret":"mykey", "Content-Type":"application/json"}
params = {"startDate":"2023-01-01", "endDate":"2023-12-31", ... }

response = requests.post(url, headers=headers, data=json.dumps(params))
response.json()

2) 카카오 api 사용 예제

import requests

url = "https://dapi.kakao.com/v2/local/search/address.json"
headers = {"Authorization": "KakaoAK " + "발급받은restapi_key"}
params = { "query": "강남구" }

response = requests.get(url, headers=headers, params=params)
response.json()

5. 참고사항

1) 웹 접속이 안 될 경우

headers에 'user-agent' or 'Referer' 를 추가하여 시도해 본다. ('user-agent'는 웹브라우저의 개발자 모드에서 확인)

url = "접속할 사이트의 url"
headers = {
    'user-agent' : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    'Referer' : "접속할 사이트의 url"
}
response = requests.get(url, headers = headers)
print(response)

저작자표시 비영리 변경금지

'IT 개발 > Python' 카테고리의 다른 글

[파이썬] 공공데이터 포털 OpenAPI 사용방법 및 예제 (4)	2024.01.15
[파이썬] 웹 크롤링, 정적 웹 데이터 수집 (requests와 BeautifulSoup) (0)	2024.01.14
[파이썬] 정규표현식 개념 및 사용방법 (0)	2023.12.15
[파이썬] 오라클 데이터베이스 연결 및 사용방법 (0)	2023.12.11
[파이썬] 파일 입출력 사용방법 (파일 읽기/쓰기) (0)	2023.12.11

KADOSHoly