爬虫 - 景天博客

IntelliJ 没有 bs4 的模块的问题：

将光标移动到此名称，然后按Alt-Enter。应该有一个安装模块的选项。由于IntelliJ已经知道虚拟环境，因此它会将模块安装在正确的位置，以便您的项目可用。

mac上有多个Python版本时指定版本安装库

要给Python3安装requests

sudo python3 -m pip install requests

sudo python3 -m pip install lxml

Beautiful Soup使用

参考

功能

from bs4 import BeautifulSoup
import requests

url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'

wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')

四大对象种类

Tag：标签及其内容
- soup.title、soup.head、soup.a
- 标签名：soup.head.name
- 标签属性：soup.p.attrs
  - 获取属性值：soup.p[‘class’]、soup.p.get(‘class’)
NavigableString：标签内部的文字
- soup.p.string
BeautifulSoup：可以当做一个特殊的 Tag
Comment：可以当做一个特殊的 BeautifulSoup，输出的内容仍然不包括注释符号
- if type(soup.a.string)==bs4.element.Comment: 判断是否为 Comment 类型

遍历文档树

直接子节点
- .contents：可以将tag的子节点以列表的方式输出
- .children：是一个 list 生成器对象，可以通过遍历取得对象
所有子孙节点
- .descendants：可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。
节点内容
- .string
多个内容
- .strings
- .stripped_strings：可以去除多余空白内容
父节点
- .parent
- .parents：全部父节点
兄弟节点
- .next_sibling、 .previous_sibling
- .next_siblings .previous_siblings：全部兄弟节点
前后节点：它并不是针对于兄弟节点，而是在所有节点，不分层次
- .next_element .previous_element
- .next_elements .previous_elements：所有前后节点

搜索文档树

find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件
- name: name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
``` // 传字符串 soup.find_all(‘b’) // 传正则表达式,通过正则表达式的 match() 来匹配内容 for tag in soup.find_all(re.compile(“^b”)): print(tag.name) // 传列表,将与列表中任一元素匹配的内容返回 soup.find_all([“a”, “b”]) // True,可以匹配任何值 for tag in soup.find_all(True): print(tag.name) // 传方法 def has_class_but_no_id(tag): return tag.has_attr(‘class’) and not tag.has_attr(‘id’)
```
  soup.find_all(has_class_but_no_id)
 		```
```
- attrs:
> 注意：如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性 ``` soup.find_all(id=’link2’) soup.find_all(href=re.compile(“elsie”)) soup.find_all(href=re.compile(“elsie”), id=’link1’) // 在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以 soup.find_all(“a”, class_=”sister”) # [Elsie,
Lacie,

Tillie]
```
  // 有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性, 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
  data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
  data_soup.find_all(attrs={"data-foo": "value"})
 		```
```
- recursive:调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False
- text: 通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True
``` soup.find_all(text=”Elsie”)
```
  # [u'Elsie']
 
  soup.find_all(text=["Tillie", "Elsie", "Lacie"])
  # [u'Elsie', u'Lacie', u'Tillie']
 
  soup.find_all(text=re.compile("Dormouse"))
  [u"The Dormouse's story", u"The Dormouse's story"]
 		```
```
- limit:可以使用 limit 参数限制返回结果的数量
soup.find_all("a", limit=2)
- kwargs:
find( name , attrs , recursive , text , **kwargs )
find_parents() find_parent()
find_next_siblings() find_next_sibling()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() 和 find_previous()

两种路径描述

// Copy selector: ".wrap"代表CSS样式
body > div.wrap > div > div:nth-child(3) > div.box_con > ul > li:nth-child(1) > a > img

// Copy XPath
/html/body/div[2]/div/div[3]/div[2]/ul/li[1]/a/img

练习

from bs4 import BeautifulSoup
import requests

url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'

wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')

images = soup.select('body > div.wrap > div > div:nth-child(3) > div.box_con > ul > li:nth-child(1) > a > img')
// 会选择除li标签中所有的img
images = soup.select('body > div.wrap > div > div:nth-child(3) > div.box_con > ul > li > a > img')

info = []

for image, dec, name in zip(images, decs, names):
    print(image, dec, name)
    data = {
        'image' : image.get('src'),
        'dec' : dec.get_text(),
        'name' : list(name.stripped_strings)
    }
    info.append(data)

print(info)

路径不需要全路径： div.box_con > ul > li > a > img也可以，只要能筛选出路径路径可以是其他形式：img[width="160"],选择图片宽度是160

mongodb 数据库

import pymongo

client = pymongo.MongoClient('localhost', 27017)
xiaozhu = client['xiaozhu']
bnb_info = xiaozhu['bnb_info']

bnb_info.insert_one({"name": "zhangsan", "age": 18})

mysql 数据库

import pymysql

conn = pymysql.connect(host="localhost", port=3306, user="root", password="wang0302", db="word")
cursor = conn.cursor()

cursor.execute('insert into sentence values (%s)', qus)

conn.commit()
conn.close()

让数据说话

提出正确的问题
- 解释现象
- 验证假设
- 探索方向
通过数据论证，寻找答案
- 对比
- 细分
- 溯源
解读数据，回答问题
- 数据的正确性
- 样本问题
- 因果关联
- 忽略前提

爬虫

IntelliJ 没有 bs4 的模块的问题：

mac上有多个Python版本时指定版本安装库

Beautiful Soup使用

功能

四大对象种类

遍历文档树

搜索文档树

Lacie,

Tillie]

两种路径描述

练习

mongodb 数据库

mysql 数据库

让数据说话

FEATURED TAGS

FRIENDS