python爬虫用mongodb的理由-CTO智库

为什么选择用 MongoDB 作为 Python 爬虫的存储方式？以下是一些理由：

支持半结构化数据存储

Python 爬虫的数据来源是互联网，数据的结构形态多种多样，没有统一的数据结构。而 MongoDB 支持半结构化数据的存储，这意味着我们可以直接把爬取得到的原始数据存储到 MongoDB 中，不必麻烦地事先提供一些结构化的模板，这极大的简化了爬虫的开发流程。

可扩展性强

爬虫数据的规模通常是非常大的，而 MongoDB 能够轻松地应对数据量的增长。它还支持分布式部署，这一特性非常契合爬虫在分布式系统上运行的需求。

存储效率高

MongoDB 查询数据的速度非常快，这意味着我们可以快速查找所有爬取的信息。此外，MongoDB 还支持索引，能够提高查询效率。

下面是两个关于使用 MongoDB 存储 Python 爬虫数据的示例：

示例 1：使用 PyMongo 库爬取百度贴吧

此示例展示了如何使用 PyMongo 库和 Python 3.x 爬取百度贴吧，并将数据存储到 MongoDB 中。你可以在安装了 PyMongo 和 BeautifulSoup4 库的 Python 3.x 环境中运行该示例，步骤如下：

下载 PyMongo 库和 BeautifulSoup4 库。

pip install pymongo pip install beautifulsoup4

创建一个名为 baidu_tieba.py 的 Python 文件，复制以下代码。

```python
import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient

client = MongoClient()
db = client['baidu_tieba']
collection = db['posts']

def get_html(url):
response = requests.get(url)
if response.status_code == 200:
return response.text
return None

def parse_html(html):
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('#thread_list li.j_thread_list'):
post = {}
post['title'] = li.select('.j_th_tit')[0].text.strip()
post['author'] = li.select('.frs-author-name')[0].text.strip()
post['reply'] = li.select('.threadlist_rep_num')[0].text.strip()
post['view'] = li.select('.threadlist_text')[0].text.strip()
yield post

def save_to_mongodb(posts):
for post in posts:
collection.insert_one(post)

if name == 'main':
url_format = 'https://tieba.baidu.com/f?kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&ie=utf-8&pn={}'
for i in range(1):
url = url_format.format(i * 50)
html = get_html(url)
posts = parse_html(html)
save_to_mongodb(posts)
```

该程序首先创建了一个名为 baidu_tieba 的 MongoDB 数据库和一个名为 posts 的 Collection。接下来，爬取百度贴吧的主页并进行数据解析，最后把解析结果保存到 MongoDB 中。解析出来的信息包括帖子标题，作者，回复数和浏览数。

示例 2：使用 Scrapy 爬取带有 AJAX 请求的网站

此示例展示了如何使用 Scrapy 和 Python 3.x 爬取一个带有 AJAX 请求的网站，并将数据存储到 MongoDB 中。你可以在安装了 Scrapy、PyMongo 和 pymongo_splash 库的 Python 3.x 环境中运行该示例，步骤如下：

下载 Scrapy、PyMongo 和 pymongo_splash 库。

pip install scrapy pip install pymongo pip install pymongo_splash

创建一个名为 quote_spider.py 的 Scrapy 爬虫文件，复制以下代码。

```python
import scrapy
from pymongo import MongoClient
from scrapy_splash import SplashRequest

class QuoteSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

   client = MongoClient()
   db = client['quotes']
   collection = db['quotes']

   script = """
       function main(splash, args)
           assert(splash:go(args.url))
           splash:wait(1)
           splash:runjs('document.getElementById("load-more").scrollIntoView();')
           splash:wait(1)
           return splash:html()
       end
   """

   def start_requests(self):
       for url in self.start_urls:
           yield SplashRequest(url=url, callback=self.parse, endpoint='execute', args={
               'lua_source': self.script
           })

   def parse(self, response):
       for quote in response.css('div.quote'):
           text = quote.css('span.text::text').get()
           author = quote.css('span small::text').get()
           tags = quote.css('div.tags a.tag::text').getall()
           yield {
               'text': text,
               'author': author,
               'tags': tags,
           }

       next_page = response.css('li.next a::attr(href)').get()
       if next_page is not None:
           yield SplashRequest(response.urljoin(next_page), callback=self.parse, endpoint='execute', args={
               'lua_source': self.script
           })

   def closed(self, reason):
       self.client.close()

```

该爬虫在爬取 quotes.toscrape.com 网站时，需要通过 Splash 处理 AJAX 请求获取真正的数据。爬虫在爬取网站后，将获取到的 quote 信息以字典形式保存到 Mongo 中。