05. Scrapy callback을 이용하여 링크(url)안의 내용 크롤링 하기

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

EXCELSIOR

05. Scrapy callback을 이용하여 링크(url)안의 내용 크롤링 하기 본문

Python/Web Crawling

05. Scrapy callback을 이용하여 링크(url)안의 내용 크롤링 하기

Excelsior-JH 2017. 5. 19. 00:49

이번 포스팅은 앞의 포스팅인 [04. Scrapy를 이용한 뉴스 크롤링 하기]와 같은 내용이지만, Scrapy의 callback을 이용하여 크롤링한 url안에 뉴스기사를 크롤링 해오는 방법이다.

우선, 앞에서 포스팅한 내용 중 3번을 다시 보도록 하자.

아래의 빨간 박스안의 내용에서 보듯이 해당 뉴스기사의 링크(url)을 크롤링한 뒤 다시 크롤링을 해주는 매우 귀찮은 방법을 사용했었다.

실제로 테스트를 할때마다 Scrapy 명령어를 두 번이나 입력해줘야 하고, pipelines.py에서 한번은 CsvPipeline 또한번은 MongoDBPipeline 클래스를 번갈아 바꿔주면서 테스트를 진행했어야 했다.

하지만, 이를 Scrapy의 callback을 이용해서 간단하게 해결할 수 있다. Callback에 관한 설명은 Scrapy-Document에서 확인할 수 있다.

아래의 소스코드는 [04. Scrapy를 이용한 뉴스 크롤링 하기]의 newsSpider.py를 아래와 같이 변경해 주었다.

비교해보면 알 수 있듯이 소스코드가 엄청 깔끔해진 것을 알 수 있다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# -*- coding: utf-8 -*-
 
import scrapy
import datetime
import time
import csv
from newscrawling.items import NewscrawlingItem
 
 
class NewsUrlSpider(scrapy.Spider):
    name = "newsCrawler"
 
    def start_requests(self):
        press = [8, 190, 200]  # 8: 중앙, 190: 동아, 200: 조선
        pageNum = 11
 
        for cp in press:
            for term in range(0, 23):
                # datetime을 이용한 특정 기간 출력 (20170417 ~ 20170509)
                date = (datetime.date(2017, 4, 17) + datetime.timedelta(+term)).strftime('%Y%m%d')
                for i in range(1, pageNum, 1):
                    yield scrapy.Request(
                        "http://media.daum.net/cp/{0}?page={1}&regDate={2}&cateId=1002".format(cp, i, date),
                        self.parse_url)
 
    def parse_url(self, response):
        for sel in response.xpath('//*[@id="mArticle"]/div[2]/ul/li/div'):
            request = scrapy.Request(sel.xpath('strong[@class="tit_thumb"]/a/@href').extract()[0],
                                     callback=self.parse_news)
 
            print('*' * 100)
            print(sel.xpath('strong[@class="tit_thumb"]/a/@href').extract()[0])
            #time.sleep(5)
            yield request
 
 
    def parse_news(self, response):
        item = NewscrawlingItem()
 
        item['source'] = response.xpath('//*[@id="cSub"]/div[1]/em/a/img/@alt').extract()[0]
        item['category'] = '정치'
        item['title'] = response.xpath('//*[@id="cSub"]/div[1]/h3/text()').extract()[0]
        item['date'] = response.xpath('/html/head/meta[contains(@property, "og:regDate")]/@content').extract()[0][:8]
        item['article'] = response.xpath('//*[@id="harmonyContainer"]/section/div[contains(@dmcf-ptype, "general")]/text()').extract() \
                          + response.xpath('//*[@id="harmonyContainer"]/section/p[contains(@dmcf-ptype, "general")]/text()').extract()
 
        item['_id'] = response.url.split("/")[-1]
        print('*' * 100)
        print(item['title'])
        print(item['date'])
 
        #time.sleep(5)
        yield item
 
Colored by Color Scripter
cs

저작자표시

'Python > Web Crawling' 카테고리의 다른 글

06. Newspaper 모듈을 이용하여 뉴스 기사 크롤링하기 (1)	2017.06.05
04. Scrapy를 이용한 뉴스 크롤링 하기 (19)	2017.05.07
03. BeautifulSoup vs Scrapy (1)	2017.05.02
02. 아나콘다(Anaconda)를 이용한 웹크롤링 개발환경 설정 (0)	2017.05.02
01. 웹 환경의 이해 (1)	2017.05.01