Python爬虫实战系列4爬取百度新闻发稿信息-发稿网

欢迎大家关注公众号【哈希大数据】

前面我们已经介绍了scrapy的安装、入门教程，以及MongoDB的安装与配置，本篇将分享如何利用scrapy爬取百度新闻信息，并将爬取到的数据存储到MongoDB数据库中。

抓取目标

通过给定关键字，爬取百度新闻中搜索到的所有有关新闻信息，新闻标题、新闻链接、新闻来源。

技术路线

利用scrapy框架实现此次爬取，利用parse函数获得需要抓取的信息，通过items和piplinepipelines的设置实现数据存储到MongoDB数据库中，通过middleware设置user_agent降低网站对爬虫的限制，使得爬取速度可在一个较高的水平上。

目标站点分析

Python爬虫实战系列（3）中已经介绍了如何利用chrome浏览器获取对应标签的XPATH或CSS路径，本次获取新闻信息，采用的解析方式是CSS。

程序的结构设计

步骤1：打开命令行窗口，进入想新建项目的目录

步骤2：在命令行中输入：scrapy startproject xxxx(项目名称) 新建一个项目

步骤3：进入项目文件，在命令行中输入：scrapy genspider <爬虫名称> <爬取网址>新建爬虫

步骤3：更改items.py文件，设置item

步骤4：编写爬取代码

步骤5：更改pipelines和middlewares

步骤6：启动爬虫，打开命令行进入到项目目录下，输入：scrapy crawl <爬虫名称>

items.py代码如下：

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item, Fieldclass mmwzItem(Item):# define the fields for your item here like: # name = Field() article_title = Field() article_url = Field() article_catchroad = Field() article_source = Field()

本例子的爬虫名称为：xinwen，爬虫文件xinwen.py的代码为：

# -*- coding: utf-8 -*-from urllib.parse import urlencodeimport refrom scrapy import Spider, Requestfrom scrapy_mmwz.items import mmwzItemclass XinwenSpider(Spider):name = “xinwen” keyword = ‘比特币’ page = 0 #查詢信息 data = { ‘word’: keyword, ‘pn’: page, ‘cl’: ‘2’, ‘ct’: ‘1’, ‘tn’: ‘news’, ‘rn’: ’20’, ‘ie’: ‘utf – 8’, } # 生成URL的参数部分 params = urlencode(data) base = ‘http://news.baidu.com/ns?’ url = base + params allowed_domains = [“news.baidu.com”] start_urls = [url] def parse(self, response): item = mmwzItem() if response.status == 200: news_lists = response.css(‘#wrapper_wrapper #content_left div.result’) page_number = response.css(‘p#page strong span.pc::text’).extract_first() # print(news_lists) print(‘page_number:’, page_number) if news_lists: for news in news_lists: # news_lists是一个生成器，在调用函数是可以用for循环依次获取结果 lists = { ‘article_url’: news.css(‘.c-title a::attr(href)’).extract_first(), ‘article_title’: news.css(‘.c-title a::text’).extract_first(), ‘article_catchroad’: ‘baidu’, ‘article_source’: re.search(re.compile(‘(.*?)\\xa0’, re.S), news.css(‘p.c-author::text’).extract_first()).group(1) } for field in item.fields: if field in lists.keys(): item[field] = lists.get(field) # 它相当于return只不过一次返回一个 yield item print(response.css(‘p#page a:last-child::text’).extract_first()) if response.css(‘p#page a:last-child::text’).extract_first() == ‘下一页>’: # 獲取下一頁鏈接 next_page = ‘http://news.baidu.com’ + response.css(‘p#page a:last-child::attr(href)’).extract_first() # 實現循環獲取下一頁內容 yield Request(next_page, callback=self.parse)

通过pipelines设置，将爬取到的数据成功保存到MongoDB数据库中，设置详细代码如下：

# -*- coding: utf-8 -*-# Define your item pipelines here## Don’t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongoclass MongoPipeline(object):collection_name = ‘xinwen’ def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get(‘MONGO_URI’), mongo_db=crawler.settings.get(‘MONGO_DATABASE’, ‘items’) ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() # def process_item(self, item, spider): # self.db[‘user’].update({‘url_token’: item[‘url_token’]}, {‘$set’: item}, True) # # return item def process_item(self, item, spider): self.db[self.collection_name].insert(dict(item)) return item

这里需要注意两点，第一首先安装pymongo包，第二在settings.py文件做如下设置：

MONGO_URI = ‘localhost’MONGO_DATABASE = ‘xinwen’

settings.py文件中还有一个需要特别注意的是：

# Obey robots.txt rulesROBOTSTXT_OBEY = False

非常多的网站都有robots协议，不允许爬取本网站信息，如果这个参数设置为True，非常多的网站都不能爬取了。

完整项目代码请参考我的GitHub：https://github.com/kxylxx/scrapy_testproject

为了方便查看爬取到的信息，这里给大家推荐一个MongoDB可视化软件Robo 3T,详细安装使用教程请参考：https://www.cnblogs.com/dacongge/p/7346037.html ，感谢大葱哥的分享。

结果展示及小结

获取到的数据如下图所示：

小结：

通过本次实战希望大家对scrapy框架有更深刻的认识，希望能够根据该案例设计一个基于scrapy框架的爬虫程序，以熟悉整个爬取流程。希望大家能够自己敲写代码，在敲代码的过程中发现问题解决问题。

发稿网（QQ：599515669）是全国领先的在线新闻稿发布平台，团队由资深互联网专家组成，服务内容类涵盖软文发布、软文发布、微信营销、微博营销、视频置顶、百度问答等多种互联网广告行业。发稿网平台有上千媒介编辑、专业写手、段子手、营销专家，为企业、公共机构和个人提供定制化的解决方案，将创意、智慧、技能转化为商业价值和社会价值。发稿网平台凭借多年的网络资源和客户资源积累，发展遥遥领先同行业其他软文平台。

ICP备案+网站制作+网站托管一年只需3000元

Python爬虫实战系列4爬取百度新闻发稿信息

相关推荐

评论抢沙发

ICP备案+网站制作+网站托管一年只需3000元

Python爬虫实战系列4爬取百度新闻发稿信息

相关推荐

评论 抢沙发

评论抢沙发