猫途鹰
爬虫的爬取过程为,从服务器获得网页数据,通过Python的相关解析库进行数据的提取分析。过程如下,其中对遇到的问题进行汇总记录解答!
1.服务器与本地的交换机制
2.解析真实网页的办法
from bs4 import BeautifulSoup
import requests
import time
urls = ['https://www.trIpadvisor.cn/Attractions-g187147-Activities-c47-oa{}-Paris_Ile_de_France.html#FILTERED_LIST'.format(str(i)) for i in range(0,180,30)]
headers = {
'User-Agent': 'Mozilla/5.0 (windows NT 10.0; Win64; x64) APPleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
'Cookie': 'ServerPool=X; TART=%1%enc%3AJrLr2lvxwNlLbH9Cmhye81h4fhzdErdMYRa5jIgQ%2BzMdQzRRGJHP%2BEwOn0Pk%2B7RjAypFF0poxcI%3D; TAunique=%1%enc%3A2cU7ADHy9Eo%2BkIWbO8dIhRvcFs06Zy%2F3vKdc%2Bd3i34gVAETMq8nxvA%3D%3D; TASSK=enc%3AAO0kkqxQ6UQrxO%2Fhkulabq0%2FgYgi6LuHCDMDfxtJkh4LERyb5A9E2%2FKatL80BtAkileXZDy3kvSOK7CHrCLzCQ23W40ydDWAbiH2fJ1WXXdRpNYcX%2FqFl3XA4gaaqM6ZeA%3D%3D; VRMCID=%1%V1*id.16631*llp.%2F-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title-m16631*e.1529666639404; _ga=GA1.2.130210118.1529061841; _gid=GA1.2.969501517.1529061841; _smt_uid=5b23a1d2.4e54e361; __gads=ID=826c32b0d192b76d:T=1529061847:S=ALNI_MaQj-S3SBC0F86Wrv6BWEmJRhlB0A; CommercePopunder=SuppressAll*1529061865893; ki_r=; TAAuth3=3%3Adf9baacbcf8f189f276b1a5c29e15b62%3AABSHViFhFqb1vgGz0nQ1zKy3RFlL3VHov1qFBzyJY1diYONpPht1Vnv2LCsUNojv60oiLMYJzj8gWWMB1Gkji%2FNpJw%2FwPFAZ7lkigK3UdltaJehxgMM1MGd7i%2BbXmId%2Fs7HB5w%2F1ezojK0b7n9MQXUdQliAXeStS1SzWK%2BRMop3nNuU3H6o3oOHl9Rt4ltQKUw%3D%3D; MobileLastViewedList=%1%%2FAttractions-g187147-Activities-c47-Paris_Ile_de_France.html; interstitialCounter=-1; TATravelInfo=V2*AY.2018*AM.6*AD.24*DY.2018*DM.6*DD.25*A.2*MG.-1*HP.2*FL.3*DSM.1529067170928; CM=%1%premiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CSPHRSess%2C%2C-1%7crestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7CWshadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C4%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CRestPremRSess%2C%2C-1%7CCpmPopunder_1%2C1%2C1529148251%7CCCSess%2C%2C-1%7CCpmPopunder_2%2C1%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7Ct4b-sc%2C%2C-1%7CRestAdsPers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CLaFourchette+banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CSPMCSess%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7Cmds%2C%2C-1%7CRBAPers%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CSPHRPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPartPers%2C%2C-1%7CRestPremRPers%2C%2C-1%7Csh%2C%2C-1%7CLastPopunderId%2C137-1859-null%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7CCPNC%2C%2C-1%7C; TAReturnTo=%1%%2FAttractions-g187147-Activities-Paris_Ile_de_France.html; royBATty=TNI1625!AHvfFP6GU%2Blwk4iVZ0AzyrpCCufht6MXowsnGvilj0IjbceNq1euKmzBt2GMOqFWaUSHiMCOUhrHs%2Fiu0fHYMWBajyJ97jRyEttR9yaX840tAKQUND6vW0o3JIcYXgjdkO3J4lFTseSHKDIZem%2FBrHlR1JF9frXGbBh3kQvWi8Xk%2C1; ki_t=1529061844713%3B1529061844713%3B1529067216226%3B1%3B28; TAsession=%1%V2ID.21FD898339223DC08F21308BF888E17F*SQ.134*MC.16631*LR.https%3A%2F%2Fsp0%5C.baidu%5C.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc%5C.php%3Ftpl%3Dtpl_11534_17355_13016%26l%3D1504452536%26wd%3D%25E7%258C%25AB%25E9%2580%2594%25E9%25B9%25B0%26issp%3D1%26f%3D8%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26inputT%3D3211*LP.%2F-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title-m16631*PR.427%7C*LS.DemandLoadAjax*GR.75*TCPAR.41*TBR.76*EXEX.67*ABTR.62*PHTB.85*FS.25*cpu.80*HS.recommended*ES.popularity*AS.popularity*DS.5*SAS.popularity*FPS.oldFirst*TS.7067C40EA7A60B512E55A582616B88D6*FA.1*DF.0*MS.-1*RMS.-1*FLO.187147*TRA.true*LD.187147; TAUD=LA-1529057833220-1*RDD-1-2018_06_15*HDD-4144951-2018_06_24.2018_06_25*LD-9463016-2018.6.24.2018.6.25*LG-9463017-2.0.F.'
}
def get_data(url, data=None):
wb_data = requests.get(url, headers=headers)
time.sleep(4)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('p.listing_title > a')
stars = soup.select('p.wrap > p.rs.rating > span[alt]')
views = soup.select('span.more > a')
for title, star, view in zip(titles, stars, views):
data = {
'title': title.get_text(),
'star': star.get('alt'),
'view': view.get_text(),
}
print(data)
for single_url in urls:
get_data(single_url)
- 导入库 。
- headers里的内容是告诉服务器自己的身份,类似于登陆了账号密码。
- 要想爬数据,没有网站怎么爬,先来个网站压压惊
urls = 'xxxxx'
(找针对的目标),为实现多个网页的爬取,看每页网址的差别,并使用urls=[‘xxx{}xx’.format(str(i)) for i in range(0,x,x,)]来将每页网址做成一个列表。 - 给网站使一个眼色(request),让它给我来个回应(response)
wb_data = requests.get(url,headers=headers)
,wb_data就是response对象。 - 对wb_data进行分析,
soup = BeautifulSoup(wb_data.text,'lxml')
。response对象不能直接被BeautifulSoup解析,需要转换成文本格式,所以会有wb_data.text。 - 通过url网站中各种标签属性来定位自己需要提取的信息,所提取的信息为列表。
- 为保证列表中数据的一一对应,使用 zip(xx,xxx,xx)来实现对data字典中的 键 赋 值。
- 为防止访问过快,导致网页反爬,使用time.sleep(sec)来间隔时间访问。
- 因为访问为猫途鹰 ,可以爬取相关内容,但是图片的爬取没有解决,后续会深入学习。
文章最后发布于: 2018-06-15 21:12:40
相关阅读
我们都知道在天猫买东西会有积分赠送,可以用来抵现,那么天猫积分年底清零吗?天猫积分的有效期是一年,一现在已经到了2018年年底了,马
天猫双11马上要来了,作为电商人的你此刻还好吗?数英网独家首发了2018年天猫双11的一系列天猫双11海报,这也标志着各大品牌和商家为了
天猫精灵是阿里集团发布会最新推出的一款产品,相信很多的网友都已经关注,也有很多的网友都不知道天猫精灵是什么意思,所以接下来的内
A5创业网(公众号:iadmin5)3月7日报道,近日网上频繁出现关于熊猫直播破产倒闭的消息。网上有传闻称熊猫直播被曝申请破产,将于本月18日
在上班的路上到处都能够听到很多讨论双十一成交额的声音,看来大家都对今年的双十一活动成交额特别的激动,可是也正是因为这样,也更加