04. Scrapy를 이용한 뉴스 크롤링 하기

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

EXCELSIOR

04. Scrapy를 이용한 뉴스 크롤링 하기 본문

Python/Web Crawling

04. Scrapy를 이용한 뉴스 크롤링 하기

Excelsior-JH 2017. 5. 7. 22:40

이번 포스팅은 앞의 게시글을 토대로 웹크롤링을 위한 환경설정 후 Scrapy를 이용하여 뉴스기사에 대한 크롤링을 하여 JSON, CSV, MongoDB에 저장하는 방법에 대한 글이다.

1. robots.txt (로봇 배제 표준)

웹 크롤링에 앞서 크롤링하고자 하는 사이트가 크롤링이 가능한지 아닌지 부터 알아 보아야한다.

이를 확인할 수 있는 것이 바로 '로봇배제표준'이라고 하고 'robots.txt'에서 확인할 수 있다.

해당사이트 주소 뒤에 '/robots.txt'를 입력하면 된다.

로봇 배제 표준은 웹 사이트에 로봇이 접근하는 것을 방지하기 위한 규약으로, 일반적으로 접근 제한에 대한 설명을 robots.txt에 기술한다.

이 규약은 1994년 6월에 처음 만들어졌고, 아직 이 규약에 대한 RFC는 없다.

이 규약은 권고안이며, 로봇이 robots.txt 파일을 읽고 접근을 중지하는 것을 목적으로 한다. 따라서, 접근 방지 설정을 하였다고 해도, 다른 사람들이 그 파일에 접근할 수 있다.

출처: https://ko.wikipedia.org

1) 네이버(Naver) 뉴스(http://news.naver.com/robots.txt)

네이버 뉴스의 경우 아래와 같이 모든 로봇을 차단하고 있기 때문에 크롤링이 가능하지 않다.

User-agent: Yeti
Allow: /main/imagemontage
Disallow: /
User-agent: *
Disallow: /

2) 다음(Daum) 뉴스(http://media.daum.net/robots.txt)

다음 뉴스는 newsview만 제외하고 로봇의 접근을 허용하기 때문에 크롤링이 가능하다.

User-agent: *
Allow: /
Disallow: /*/newsview

따라서, 다음 뉴스에서 중앙일보-정치 기사를 크롤링하기로 했다.

2. Scrapy 프로젝트 생성

뉴스를 크롤링할 Scrapy 프로젝트를 아래와 같이 생성한다.

1
2
3
4
5
6
7
8
9
10
# 웹크롤링 가상환경으로 진입
cjh@CJHui-MacBook-Pro:~$ source activate crawler
 
# Scrapy 프로젝트 생성
(crawler) cjh@CJHui-MacBook-Pro:~$ scrapy startproject newscrawling
New Scrapy project 'newscrawling', using template directory '/Users/cjh/anaconda/envs/crawler/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /Users/cjh/newscrawling
You can start your first spider with:
    cd newscrawling
    scrapy genspider example example.com
Colored by Color Scripter
cs

위와 같이 프로젝트를 생성한 뒤 경로를 따라 들어가보면 아래와 같은 'spider'폴더와 파일들을 확인할 수 있다.

'spider'폴더와 'items.py', 'pipelines.py', 그리고 'settings.py'의 역할은 여기서 확인할 수 있다.

3. 크롤링할 페이지 구조 파악하기

먼저 소스코드를 작성하기 전 해당 페이지 [다음뉴스 > 중앙일보 > 정치] (크롤링 당시 날짜: 2017.05.04) 의 구조를 파악하는 것이 중요하다.

해당 페이지를 들어가 보면 아래의 그림과 같이 [제목 - 기사(말줄임 처리)]로 구성된 리스트형식으로 구성되어있다.

① 크롤링할 기사제목(title)

② 크롤링할 기사내용(article)은 아래의 그림처럼 말줄임 처리가 되어있어 전체 기사내용을 가져오지 못한다. 따라서 전체 기사내용을 가져오기 위해 다음과 같은 방법을 택했다.

i) 기사제목(title)과 해당 기사의 링크(url)을 먼저 크롤링한 뒤 csv파일로 저장한다.

ii) csv파일로 저장된 기사의 링크를(url)를 불러와 전체 기사내용(article)을 다시 크롤링 해준다.

위와 같은 방법을 하기 위해 item.py에 링크를 크롤링하는 클래스(class)와 기사내용을 크롤링하는 클래스, 총 두개의 클래스를 생성해줘야 한다.

③ 해당 페이지 번호에 접근하여 크롤링해줘야 하므로 소스코드 작성시 페이지 수에 대한 처리가 필요하다.

④ 각종 인터넷 브라우저에는 오른쪽 마우스 버튼 클릭시 '검사'(chrome의 경우) 라는 항목이 존재한다.

⑤ scrapy에서는 XPath를 이용하여 크롤링할 정보를 가져올 수 있다. 아래의 기사제목(title)의 XPath는 '//*[@id="mArticle"]/div[2]/ul/li[15]/div/strong/a' 이다. 여기서 기사제목은 //*[@id="mArticle"]/div/ul/li/div/strong/a 태그를 공통으로 가지고있다는 것을 파악할 수 있다.

(xpath는 크롤링할 사이트의 태그를 보고 삽질을 조금해보시면 어느정도 이해가 가실겁니다....ㅜㅜ)

4. 파일 작성

1) items.py : 크롤링할 데이터를 정의해주는 파일

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class NewscrawlingItem(scrapy.Item):
    # define the fields for your item here like:
    source = scrapy.Field() # 신문사
    category = scrapy.Field() # 카테고리
    title = scrapy.Field() # 제목
    url = scrapy.Field() # 기사링크
    date = scrapy.Field() # 날짜
    article = scrapy.Field() # 
    pass
 
Colored by Color Scripter
cs

2) spiders 폴더 안의 newsSpider.py: 크롤링할 로직 및 내용들을 작성하는 파일

① NewsUrlSpider 클래스: 기사제목과 기사의 링크를 가져오는 클래스이며, {source(신문사), category(카테고리), title(기사제목), url(기사링크), date(날짜)}을 크롤링한다.

② NewsSpider 클래스: 기사의 내용을 크롤링하는 클래스이며, {source(신문사), category(카테고리), title(기사제목), date(날짜), article(기사내용)}을 크롤링한다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# -*- coding: utf-8 -*-
 
import scrapy
import time
import csv
from newscrawling.items import NewscrawlingItem
 
class NewsUrlSpider(scrapy.Spider):
    name = "newsUrlCrawler"
 
    def start_requests(self):
        press = [8, 190, 200] # 8: 중앙, 190: 동아, 200: 조선
        pageNum = 2
        date = [20170501]
        #date = [20170501, 20170502, 20170503, 20170504, 20170505, 20170506, 20170507, 20170508]
 
        for cp in press:
            for day in date:
                for i in range(1, pageNum, 1):
                    yield scrapy.Request("http://media.daum.net/cp/{0}?page={1}&regDate={2}&cateId=1002".format(cp, i, day),
                                         self.parse_news)
 
    def parse_news(self, response):
        for sel in response.xpath('//*[@id="mArticle"]/div[2]/ul/li/div'):
            item = NewscrawlingItem()
 
            item['source'] = sel.xpath('strong/span[@class="info_news"]/text()').extract()[0]
            item['category'] = '정치'
            item['title'] = sel.xpath('strong[@class="tit_thumb"]/a/text()').extract()[0]
            item['url'] = sel.xpath('strong[@class="tit_thumb"]/a/@href').extract()[0]
            item['date'] = sel.xpath('strong[@class="tit_thumb"]/span/span[@class="info_time"]/text()').extract()[0]
 
            print('*'*100)
            print(item['title'])
 
            time.sleep(5)
 
            yield item
 
 
class NewsSpider(scrapy.Spider):
    name = "newsCrawler"
 
    def start_requests(self):
        with open('newsUrlCrawl.csv') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                yield scrapy.Request(row['url'], self.parse_news)
 
    def parse_news(self, response):
        item = NewscrawlingItem()
 
        item['source'] = response.xpath('//*[@id="cSub"]/div[1]/em/a/img/@alt').extract()[0]
        item['category'] = '정치'
        item['title'] = response.xpath('//*[@id="cSub"]/div[1]/h3/text()').extract()[0]
        item['date'] = response.xpath('/html/head/meta[contains(@property, "og:regDate")]/@content').extract()[0][:8]
        item['article'] = response.xpath('//*[@id="harmonyContainer"]/section/div[contains(@dmcf-ptype, "general")]/text()').extract() \
                          + response.xpath('//*[@id="harmonyContainer"]/section/p[contains(@dmcf-ptype, "general")]/text()').extract()
 
        print('*'*100)
        print(item['title'])
        print(item['date'])
 
        time.sleep(5)
 
        yield item
Colored by Color Scripter
cs

3) pipelines.py : 데이터 가공 및 DB저장을 수행하는 파일

이번 포스팅에서는 JSON, CSV, MongoDB 총 세가지 방법으로 크롤링한 데이터를 저장하는 코드를 작성하였다.

나중에 사용할 때는 세 가지 중 하나를 선택하면 된다. 이번 포스팅에서는 MongoDB에 저장하는 클래스를 이용하였다.

MongoDB설치에 관련된 것은 Mac-OS에-MongoDB-설치-및-Robomongo-설치를 확인하면 된다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from __future__ import unicode_literals
from scrapy.exporters import JsonItemExporter, CsvItemExporter
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
 
import pymongo
 
 
#JSON파일로 저장하는 클래스
class JsonPipeline(object):
    def __init__(self):
        self.file = open("newsCrawl.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()
 
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
 
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
 
#CSV 파일로 저장하는 클래스
class CsvPipeline(object):
    def __init__(self):
        self.file = open("newsUrlCrawl.csv", 'wb')
        self.exporter = CsvItemExporter(self.file, encoding='utf-8')
        self.exporter.start_exporting()
 
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
 
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
 
#MongoDB에 저장하는 
class MongoDBPipeline(object):
 
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
 
    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!". format(data))
 
        if valid:
            self.collection.insert(dict(item))
            log.msg("News added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
 
        return item
Colored by Color Scripter
cs

4) settings.py: 기본설정을 정의해주는 파일이며, pipelines.py에서 정의한 클래스에 대해 어떤 클래스를 사용할건지 정의해준다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# -*- coding: utf-8 -*-
 
# Scrapy settings for newscrawling project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'newscrawling'
 
SPIDER_MODULES = ['newscrawling.spiders']
NEWSPIDER_MODULE = 'newscrawling.spiders'
LOG_LEVEL='ERROR'
#
# Url 크롤링시 CSVPipeline 설정
# ITEM_PIPELINES = {'newscrawling.pipelines.CsvPipeline': 300, }
 
# 기사 내용 크롤링시 MongoDBPipeline 설정
ITEM_PIPELINES = {'newscrawling.pipelines.MongoDBPipeline': 300,}
 
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "news_crawl"
MONGODB_COLLECTION = "news"
Colored by Color Scripter
cs

5. Scrapy 실행

1) 기사제목(title) 및 기사링크(url) 크롤링하여 CSV 파일로 저장하기

이제 소스코드 작성이 끝났으니 Scrapy를 실행하여 크롤링해보도록 한다.

먼저, 기사제목(title)과 기사링크(url)을 크롤링하여 CSV 파일로 저장한다. 이때 settings.py를 다음과 같이 변경해준다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 
BOT_NAME = 'newscrawling'
 
SPIDER_MODULES = ['newscrawling.spiders']
NEWSPIDER_MODULE = 'newscrawling.spiders'
LOG_LEVEL='ERROR'
#
# Url 크롤링시 CSVPipeline 설정
ITEM_PIPELINES = {'newscrawling.pipelines.CsvPipeline': 300, }
 
# 기사 내용 크롤링시 MongoDBPipeline 설정
#ITEM_PIPELINES = {'newscrawling.pipelines.MongoDBPipeline': 300,}
 
#MONGODB_SERVER = "localhost"
#MONGODB_PORT = 27017
#MONGODB_DB = "news_crawl"
#MONGODB_COLLECTION = "news"
Colored by Color Scripter
cs

그런다음 터미널(Terminal)에서 아래의 명령어를 통해 NewsUrlSpider클래스의 name인 'newsUrlCrawler'(NewsSpider.py 참고)을 실행한다.

명령을 실행하게되면 아래의 그림과 같이 출력된다. 또한 newsUrlCrawl.csv 파일이 생성된것을 확인할 수 있다.

1
2
3
4
cjh@CJHui-MacBook-Pro:~$ source activate crawler
(crawler) cjh@CJHui-MacBook-Pro:~$ cd PycharmProjects/
(crawler) cjh@CJHui-MacBook-Pro:~/PycharmProjects$ cd newscrawling/
(crawler) cjh@CJHui-MacBook-Pro:~/PycharmProjects/newscrawling$ scrapy crawl newsUrlCrawler
cs

2) 기사내용(article) 크롤링하여 MongoDB에 저장하기

1)번에서 저장한 newsUrlCrawl.csv파일의 url을 읽어 기사내용을 크롤링하여 MongoDB에 저장한다. 그전에 MongoDB가 실행되어 있어야 한다.

settings.py 파일을 아래와 같이 변경해 준다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
BOT_NAME = 'newscrawling'
 
SPIDER_MODULES = ['newscrawling.spiders']
NEWSPIDER_MODULE = 'newscrawling.spiders'
LOG_LEVEL='ERROR'
#
# Url 크롤링시 CSVPipeline 설정
# ITEM_PIPELINES = {'newscrawling.pipelines.CsvPipeline': 300, }
 
# 기사 내용 크롤링시 MongoDBPipeline 설정
ITEM_PIPELINES = {'newscrawling.pipelines.MongoDBPipeline': 300,}
 
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "news_crawl"
MONGODB_COLLECTION = "news"
Colored by Color Scripter
cs

그런다음 터미널(Terminal)에서 아래의 명령어를 통해 NewsSpider클래스의 name인 'newsCrawler'(NewsSpider.py 참고)을 실행한다.
명령을 실행하게되면 아래의 그림과 같이 출력된다. 또한 Robomongo를 통해 MongoDB에 크롤링한 기사내용이 저장된 것을 확인할 수 있다.

1
2
3
4
cjh@CJHui-MacBook-Pro:~$ source activate crawler
(crawler) cjh@CJHui-MacBook-Pro:~$ cd PycharmProjects/
(crawler) cjh@CJHui-MacBook-Pro:~/PycharmProjects$ cd newscrawling/
(crawler) cjh@CJHui-MacBook-Pro:~/PycharmProjects/newscrawling$ scrapy crawl newsCrawler
cs

저작자표시

'Python > Web Crawling' 카테고리의 다른 글

06. Newspaper 모듈을 이용하여 뉴스 기사 크롤링하기 (1)	2017.06.05
05. Scrapy callback을 이용하여 링크(url)안의 내용 크롤링 하기 (0)	2017.05.19
03. BeautifulSoup vs Scrapy (1)	2017.05.02
02. 아나콘다(Anaconda)를 이용한 웹크롤링 개발환경 설정 (0)	2017.05.02
01. 웹 환경의 이해 (1)	2017.05.01