「globaltimes」Scrapy爬取globaltimes英语新闻站点

globaltimes

目标站点与分析

访问http://www.globaltimes.cn　站点，可以看到目标站点分为几大新闻板块，大板块下还有其他子板块，其中还包含了视频，图片等板块。在这里只爬取新闻板块吧。

访问新闻详情页可以看到网站链接类似于｀http://www.globaltimes.cn/content/*.shtml｀　的链接，所以随便输入一个数字，就可以访问到具体的新闻详情页了，可以极大的减少正则书写量啊。

新闻详情页可以采集的信息有新闻板块，新闻类型，新闻标题，新闻来源,作者,发布时间，新闻正文，分享数量等主要信息。所以Scrapy的Item就确定好了。

在这里插入图片描述

初始化爬虫项目

scrapy startproject globaltimes

创建后的目录如下所示：

在这里插入图片描述

运行命令，尝试启动爬虫看看：

scrapy crawl globaltimes

返回如下错误，说明新建项目时scrapy并没有给我们新建一个spider文件呀

KeyERROR: 'Spider not found: globaltimes'

既然新建项目没有新建spider文件只能命令行创建了。

制作爬虫

scrapy genspider global 　　'globaltimes.cn'
############ 爬虫名称　　爬取域的范围

内容如下

# -*- coding: utf-8 -*-
import scrapy
class GlobalSpider(scrapy.Spider):
    name = 'global'
    allowed_domains = ['globaltimes.cn']
    start_urls = ['http://globaltimes.cn/']
    def parse(self, response):
    pass

此时运行项目，可以看到项目已经能正常运行了。

注意：Python2.x需要在__init__.py 加入中文编码支持

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

请求页面

# -*- coding: utf-8 -*-
import scrapy
class GlobalSpider(scrapy.Spider):
    name = 'global'
    allowed_domains = ['globaltimes.cn']
    start_urls = ['http://globaltimes.cn/']
    def parse(self, response):
    base_url = "http://www.globaltimes.cn/content/"
    #循环新的url请求加入待爬队列，并调用回调函数 parse_page
    for page in range(1,1136094):
            print(base_url+str(page)+'.shtml')
            yield scrapy.request(base_url+str(page)+'.shtml', dont_filter=True, callback=self.parse_page)
        pass
    def parse_page(self,response):
        print response.body
        pass

运行项目可以在控制台看到某个页面的具体信息啦，而且网站也没有作反爬虫措施，又可以剩下IP代理池和用户代理池的工作量。

定义数据字段

仔细审查网页可以得到作者,来源,发布时间这三个字段属于一大段文字，所以在定义数据字段时需要特殊处理，另外，分享数量是通过js生成的,所以这个字段要放弃掉了。

在items.py文件下定义数据字段

import scrapy
class GlobaltimesItem(scrapy.Item):
     # define the fields for your item here like:
     # name = scrapy.Field()
    url=scrapy.Field()
    title = scrapy.Field()
    module= scrapy.Field()
    type = scrapy.Field()
    info =scrapy.Field()
    content=scrapy.Field()
    pass

分析页面xpath与采集

使用chrome浏览器审查元素的功能可以快速定位xpath

通过xpath获取数据字段

# -*- coding: utf-8 -*-
import scrapy

from globaltimes.items import GlobaltimesItem
class GlobalSpider(scrapy.Spider):
name = 'global'
allowed_domains = ['globaltimes.cn']
start_urls = ['http://globaltimes.cn/']
def parse(self, response):
base_url = "http://www.globaltimes.cn/content/"
#循环新的url请求加入待爬队列，并调用回调函数 parse_page
for page in range(1136000,1136094):
print(base_url+str(page)+'.shtml')
yield scrapy.Request(base_url+str(page)+'.shtml', dont_filter=True, callback=self.parse_page)
pass
def parse_page(self,response):
item['url']=response.url
item['title']=response.xpath('//*[@id="left"]/p[2]/h3/text()').extract()
item['info']=response.xpath('//*[@id="left"]/p[3]/p[1]/text()').extract()
item['module']=response.xpath('//*[@id="left"]/p[1]/a/text()').extract()
item['type']=response.xpath('//*[@id="left"]/p[5]/p[1]/a/text()').extract()
item['content']=response.xpath('//*[@id="left"]/p[4]').extract()
yield item
pass

注意：使用获取到的xpath其后需追加/text()以获得html元素里面的文本内容

使用项目管道处理数据

在项目被蜘蛛抓取后，它被发送到项目管道，它通过顺序执行的几个组件来处理它。

每个项目管道组件（有时称为“Item Pipeline”）是一个实现简单方法的Python类。他们接收一个项目并对其执行操作，还决定该项目是否应该继续通过流水线或被丢弃并且不再被处理。

项目管道的典型用途是：

清理HTML数据

验证抓取的数据（检查项目是否包含特定字段）

检查重复（并删除）

将刮取的项目存储在数据库中

在`settings.py`中开启管道

ITEM_PIPELINES = {
'globaltimes.pipelines.GlobaltimesPipeline': 300,
}

在`pipelines.py`中编写处理规则并保存到my sql

１.　创建数据库

CREATE DATABASE IF NOT exists english_news DEFAULT CHARSET utf8;

创建数据表

create table qy_globaltimes(
id int not null auto_increment,
news_id int(11) not null unique,
module varchar(255) null,
type varchar(255) null,
url varchar(255) null,
author varchar(255) null,
source varchar(255) null,
title varchar(255) null,
content longtext null,
published_time datetime　 null,
create_time datetime null,
update_time datetime null,
primary key(id)
)engine=InnoDB default charset = utf8

3．定义Ｍysql连接参数

在settings.py新增配置

MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'english_news' #数据库名字，请修改
MYSQL_USER = 'root' #数据库账号，请修改
MYSQL_PASSWD = '147258369' #数据库密码，请修改

MYSQL_PORT = 3306 #数据库端口

５.处理数据并保存到Ｍysql中

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re
import time
import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.conf import settings
class GlobaltimesPipeline(object):
def __init__(self):
dbargs = dict(
host=settings['MYSQL_HOST'],
port=settings['MYSQL_PORT'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
charset = 'utf8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode = True,
)
self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs)
print('Mysql 连接成功')
def process_item(self, item, spider):
news={}
news['news_id']=self.get_id("".join(item['url']))
news['module']="".join(item['module']).strip()
news['type']="".join(item['type']).strip()
news['url']="".join(item['url'])
news['author']=self.get_author("".join(item['info']))
news['source']=self.get_source("".join(item['info']))
news['title']="".join(item['title']).strip()
news['content']="".join(item['content'])
news['published_time']=self.get_published_time("".join(item['info']))
#执行sql
res=self.dbpool.runInteraction(self.insert_item,news)
#错误时回调
res.addErrback(self.on_error,spider)
return item
pass
def on_error(self,failure,spider):
spider.logger.error(failure)
def insert_item(self,conn,item):
create_time=update_time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
sql="insert into english_news.qy_globaltimes "
sql+="(news_id,module,type,url,author,source,title,content,published_time,create_time,update_time)"
sql+=" values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
sql+=" on DUPLICATE key update update_time="+"'"+update_time+"'"
params=(item['news_id'],item['module'],item['type'],item['url'],
item['author'],item['source'],item['title'],
item['content'],item['published_time'],create_time,update_time)
conn.execute(sql,params)
print('插入数据库成功:'+item['url'])
def get_id(self,url):
return re.findall("(\w*[0-9]+)\w*",url)[0]
def get_author(self,info):
author_p = r"(?<=By).+?(?=Source:)"
author_pattern = re.compile(author_p)
author_match = re.search(author_pattern,info)
if author_match:
return author_match.group(0).strip()
else:
return ''
def get_source(self,info):
source_p = r"(?<=Source:).+?(?=Published)"
source_pattern = re.compile(source_p)
source_match = re.search(source_pattern,info)
if source_match:
return source_match.group(0).strip()
else:
return ''
def get_published_time(seld,info):
published_p=r"(?<=Published:).+?(.*)"
published_pattern=re.compile(published_p)
published_match=re.search(published_pattern,info)
if published_match:
return published_match.group(0).strip().replace('/','-')
else:
return ''

运行爬虫

scrapy crawl global

运行成功，globaltimes英语新闻网站的文章就可以慢慢存储到Ｍysql里了

Scrapy爬取globaltimes英语新闻站点

globaltimes

目标站点与分析

初始化爬虫项目

制作爬虫

请求页面

定义数据字段

分析页面xpath与采集

使用项目管道处理数据

在`settings.py`中开启管道

在`pipelines.py`中编写处理规则并保存到my sql

运行爬虫

相关阅读

栏目导航

推荐阅读

热门阅读

Scrapy爬取globaltimes英语新闻站点

globaltimes

目标站点与分析

初始化爬虫项目

制作爬虫

请求页面

定义数据字段

分析页面xpath与采集

使用项目管道处理数据

在settings.py中开启管道

在pipelines.py中编写处理规则并保存到mysql

运行爬虫

相关阅读

栏目导航

推荐阅读

热门阅读

在`settings.py`中开启管道

在`pipelines.py`中编写处理规则并保存到my sql