「政策网」python 国家政策网爬虫

政策网

国家政策网 http://www.gov.cn/zhengce/index.htm

看起来就很与众不同，但打开源代码之后发现也没什么不一样嘛，在下选择的关键词是养老，可搜得26条政策链接，按节点名称属性及文字爬取URL。URL中p值决定网页页数，可建立for 循环语句将6页网址中的URL抓取。定义函数readhtml(path)，其中，path为搜索网页链接。

for n in range(6):
 url = r'http://sousuo.gov.cn/s.htm?q=&n=10&p='+str(n)+'&t=paper&advance=true&title=%E5%85%BB%E8%80%81&content=&puborg=&pcodeJiguan=&pcodeYear=&pcodeNum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=&sortType=1&nocorrect='         #指定要抓取的网页url，必须以http开头
 res = urllib.request.urlopen(url)  #调用urlopen()从服务器获取网页响应(respone)，其返回的响应是一个实例
 html = res.read().decode('utf-8')  #调用返回响应示例中的read()，可以读取html
 soup = BeautifulSoup(html, 'lxml')
 result = soup.find_all('p',class_ = 'result')#result = soup.find_all('p',class_ = 'result')
 #print(result)
 #使用查询结果再创建一个BeautifulSoup对象,对其继续进行解析
 download_soup = BeautifulSoup(str(result), 'lxml')
 urls=[]
 url_all = download_soup.find_all('a')
 for a_url in url_all:
        a_url = a_url.get('href')
        urls.APPend(a_url)
        print(a_url)
        txt('hello',a_url)

print('finish')

接下来就是把每篇政策内容爬下来，首先对链接用进行解析，soup = BeautifulSoup(html, 'lxml')

所有政策标题内容存放在标签p中，按节点p搜索即可，根据已有的爬取方式我一个个试了一个月，得出以下p.get_text()是唯一适合的。

def get_text(soup):
    # 读取纯文本
    for p in soup.select('p'):
        t = p.get_text()
        # print(t)#输出文本

python 国家政策网爬虫

政策网

相关阅读

栏目导航

推荐阅读

热门阅读