必威体育Betway必威体育官网
当前位置:首页 > IT技术

爬虫(二)—解析真实网页(猫途鹰)

时间:2019-11-03 12:12:10来源:IT技术作者:seo实验室小编阅读:74次「手机版」
 

猫途鹰

爬虫的爬取过程为,从服务器获得网页数据,通过Python的相关解析库进行数据的提取分析。过程如下,其中对遇到的问题进行汇总记录解答!


1.服务器与本地的交换机制

  • 客户端向服务器发送请求request,服务器给客户端回应response
  • get、post是最常用的http协议的两种。

Created with Raphaël 2.1.2客户端客户端服务器服务器requestresponse


2.解析真实网页的办法

from bs4 import BeautifulSoup
import requests
import time

urls = ['https://www.trIpadvisor.cn/Attractions-g187147-Activities-c47-oa{}-Paris_Ile_de_France.html#FILTERED_LIST'.format(str(i)) for i in range(0,180,30)]

headers = {
    'User-Agent': 'Mozilla/5.0 (windows NT 10.0; Win64; x64) APPleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
    'Cookie': 'ServerPool=X; TART=%1%enc%3AJrLr2lvxwNlLbH9Cmhye81h4fhzdErdMYRa5jIgQ%2BzMdQzRRGJHP%2BEwOn0Pk%2B7RjAypFF0poxcI%3D; TAunique=%1%enc%3A2cU7ADHy9Eo%2BkIWbO8dIhRvcFs06Zy%2F3vKdc%2Bd3i34gVAETMq8nxvA%3D%3D; TASSK=enc%3AAO0kkqxQ6UQrxO%2Fhkulabq0%2FgYgi6LuHCDMDfxtJkh4LERyb5A9E2%2FKatL80BtAkileXZDy3kvSOK7CHrCLzCQ23W40ydDWAbiH2fJ1WXXdRpNYcX%2FqFl3XA4gaaqM6ZeA%3D%3D; VRMCID=%1%V1*id.16631*llp.%2F-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title-m16631*e.1529666639404; _ga=GA1.2.130210118.1529061841; _gid=GA1.2.969501517.1529061841; _smt_uid=5b23a1d2.4e54e361; __gads=ID=826c32b0d192b76d:T=1529061847:S=ALNI_MaQj-S3SBC0F86Wrv6BWEmJRhlB0A; CommercePopunder=SuppressAll*1529061865893; ki_r=; TAAuth3=3%3Adf9baacbcf8f189f276b1a5c29e15b62%3AABSHViFhFqb1vgGz0nQ1zKy3RFlL3VHov1qFBzyJY1diYONpPht1Vnv2LCsUNojv60oiLMYJzj8gWWMB1Gkji%2FNpJw%2FwPFAZ7lkigK3UdltaJehxgMM1MGd7i%2BbXmId%2Fs7HB5w%2F1ezojK0b7n9MQXUdQliAXeStS1SzWK%2BRMop3nNuU3H6o3oOHl9Rt4ltQKUw%3D%3D; MobileLastViewedList=%1%%2FAttractions-g187147-Activities-c47-Paris_Ile_de_France.html; interstitialCounter=-1; TATravelInfo=V2*AY.2018*AM.6*AD.24*DY.2018*DM.6*DD.25*A.2*MG.-1*HP.2*FL.3*DSM.1529067170928; CM=%1%premiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CSPHRSess%2C%2C-1%7crestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7CWshadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C4%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CRestPremRSess%2C%2C-1%7CCpmPopunder_1%2C1%2C1529148251%7CCCSess%2C%2C-1%7CCpmPopunder_2%2C1%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7Ct4b-sc%2C%2C-1%7CRestAdsPers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CLaFourchette+banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CSPMCSess%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7Cmds%2C%2C-1%7CRBAPers%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CSPHRPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPartPers%2C%2C-1%7CRestPremRPers%2C%2C-1%7Csh%2C%2C-1%7CLastPopunderId%2C137-1859-null%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7CCPNC%2C%2C-1%7C; TAReturnTo=%1%%2FAttractions-g187147-Activities-Paris_Ile_de_France.html; royBATty=TNI1625!AHvfFP6GU%2Blwk4iVZ0AzyrpCCufht6MXowsnGvilj0IjbceNq1euKmzBt2GMOqFWaUSHiMCOUhrHs%2Fiu0fHYMWBajyJ97jRyEttR9yaX840tAKQUND6vW0o3JIcYXgjdkO3J4lFTseSHKDIZem%2FBrHlR1JF9frXGbBh3kQvWi8Xk%2C1; ki_t=1529061844713%3B1529061844713%3B1529067216226%3B1%3B28; TAsession=%1%V2ID.21FD898339223DC08F21308BF888E17F*SQ.134*MC.16631*LR.https%3A%2F%2Fsp0%5C.baidu%5C.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc%5C.php%3Ftpl%3Dtpl_11534_17355_13016%26l%3D1504452536%26wd%3D%25E7%258C%25AB%25E9%2580%2594%25E9%25B9%25B0%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D3211*LP.%2F-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title-m16631*PR.427%7C*LS.DemandLoadAjax*GR.75*TCPAR.41*TBR.76*EXEX.67*ABTR.62*PHTB.85*FS.25*cpu.80*HS.recommended*ES.popularity*AS.popularity*DS.5*SAS.popularity*FPS.oldFirst*TS.7067C40EA7A60B512E55A582616B88D6*FA.1*DF.0*MS.-1*RMS.-1*FLO.187147*TRA.true*LD.187147; TAUD=LA-1529057833220-1*RDD-1-2018_06_15*HDD-4144951-2018_06_24.2018_06_25*LD-9463016-2018.6.24.2018.6.25*LG-9463017-2.0.F.'
}

def get_data(url, data=None):
    wb_data = requests.get(url, headers=headers)
    time.sleep(4)
    soup = BeautifulSoup(wb_data.text, 'lxml')
    titles = soup.select('p.listing_title > a')
    stars = soup.select('p.wrap > p.rs.rating > span[alt]')
    views = soup.select('span.more > a')
    for title, star, view in zip(titles, stars, views):
        data = {
           'title': title.get_text(),
            'star': star.get('alt'),
           'view': view.get_text(),
         }
        print(data)

for single_url in urls:
    get_data(single_url)
  1. 导入库 。
  2. headers里的内容是告诉服务器自己的身份,类似于登陆了账号密码。
  3. 要想爬数据,没有网站怎么爬,先来个网站压压惊 urls = 'xxxxx'(找针对的目标),为实现多个网页的爬取,看每页网址的差别,并使用urls=[‘xxx{}xx’.format(str(i)) for i in range(0,x,x,)]来将每页网址做成一个列表。
  4. 给网站使一个眼色(request),让它给我来个回应(response)wb_data = requests.get(url,headers=headers) ,wb_data就是response对象。
  5. 对wb_data进行分析,soup = BeautifulSoup(wb_data.text,'lxml')。response对象不能直接被BeautifulSoup解析,需要转换成文本格式,所以会有wb_data.text。
  6. 通过url网站中各种标签属性来定位自己需要提取的信息,所提取的信息为列表。
  7. 为保证列表中数据的一一对应,使用 zip(xx,xxx,xx)来实现对data字典中的 键 赋 值。
  8. 为防止访问过快,导致网页反爬,使用time.sleep(sec)来间隔时间访问。
  9. 因为访问为猫途鹰 ,可以爬取相关内容,但是图片的爬取没有解决,后续会深入学习。

文章最后发布于: 2018-06-15 21:12:40

相关阅读

天猫积分年底清零吗?如何查看积分?

我们都知道在天猫买东西会有积分赠送,可以用来抵现,那么天猫积分年底清零吗?天猫积分的有效期是一年,一现在已经到了2018年年底了,马

天猫2018年双11海报,一场字体设计的battle

天猫双11马上要来了,作为电商人的你此刻还好吗?数英网独家首发了2018年天猫双11的一系列天猫双11海报,这也标志着各大品牌和商家为了

天猫精灵是什么意思?有什么用?

天猫精灵是阿里集团发布会最新推出的一款产品,相信很多的网友都已经关注,也有很多的网友都不知道天猫精灵是什么意思,所以接下来的内

熊猫直播被曝破产:IG夺冠后王思聪会放弃熊猫直播吗?

A5创业网(公众号:iadmin5)3月7日报道,近日网上频繁出现关于熊猫直播破产倒闭的消息。网上有传闻称熊猫直播被曝申请破产,将于本月18日

2018年天猫双11成交额2135亿,你贡献了多少?

在上班的路上到处都能够听到很多讨论双十一成交额的声音,看来大家都对今年的双十一活动成交额特别的激动,可是也正是因为这样,也更加

分享到:

栏目导航

推荐阅读

热门阅读