03. BeautifulSoup vs Scrapy

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

EXCELSIOR

03. BeautifulSoup vs Scrapy 본문

Python/Web Crawling

03. BeautifulSoup vs Scrapy

Excelsior-JH 2017. 5. 2. 15:39

BeautifulSoup와 Scrapy는 둘 다 웹 크롤링(Web Crawling)을 해주는 Python 패키지들이다.

1. BeautifulSoup VS Scrapy

1) BeautifulSoup

- html 문서에서 원하는 정보를 손쉽게 가져올 수 있는 방법을 제공한다.

- 자동으로 인코등을 유니코드로 변환하여 UTF-8로 출력해준다.

- lxml, html5lib 파서(Parser)를 이용한다.

- https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 에서 기본적인 사용법을 익힐 수 있다.

2) Scrapy

- web scraper framework

- 다양한 selector 지원

- 파이프 라인

- 로깅

- 이메일

- https://docs.scrapy.org/en/latest/intro/tutorial.html 에서 튜토리얼 및 사용법을 익힐 수 있다.

2. Scrapy project 생성

- 아래의 명령어를 통해 프로젝트를 생성할 수 있다.(예제에서는 tutorial이라는 프로젝트를 생성했습니다.)

# 가상환경으로 진입 source activate crawler # scrapy project 생성 scrapy startproject [프로젝트명]

- 설치된 경로를 따라 들어가면 아래의 그림과 같은 파일들을 확인할 수 있다.

1) items.py

- 데이터를 크롤링해 올 때 해당 데이터를 클래스(class)형태로 만들 수 있다.

- 예를 들어, title, link, author 세 가지 항목을 가져오고 싶을 때 items.py에서 지정하면 된다.

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

2) pipelines.py

- 데이터를 크롤링해 온 후 데이터를 처리해 줄 때 사용한다.

- 예를 들어, 중복체크, 필터링 및 데이터베이스 입력 등 후처리를 해줄 때 사용한다.

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TutorialPipeline(object):
    def process_item(self, item, spider):
        return item

3) settings.py

- 프로젝트 모듈간 연결 및 설정 정의를 해주는 파일이다.

4) spiders 폴더

- spiders 폴더 안에 크롤링할 내용들을 프로그래밍하면 된다.

저작자표시

'Python > Web Crawling' 카테고리의 다른 글

05. Scrapy callback을 이용하여 링크(url)안의 내용 크롤링 하기 (0)	2017.05.19
04. Scrapy를 이용한 뉴스 크롤링 하기 (19)	2017.05.07
02. 아나콘다(Anaconda)를 이용한 웹크롤링 개발환경 설정 (0)	2017.05.02
01. 웹 환경의 이해 (1)	2017.05.01
웹 크롤링 스터디 게시판입니다 (0)	2017.05.01