python爬虫三种数据解析

最近在学习爬虫，对爬虫中的数据解析规则作一下记录。

re解析

元字符

.    匹配除换⾏符以外的任意字符
\w   匹配字⺟或数字或下划线
\s   匹配任意的空⽩符
\d   匹配数字
\n   匹配⼀个换⾏符
\t   匹配⼀个制表符

^    匹配字符串的开始
$    匹配字符串的结尾

\W   匹配⾮字⺟或数字或下划线
\D   匹配⾮数字
\S   匹配⾮空⽩符
a|b  匹配字符a或字符b
()   匹配括号内的表达式，也表示⼀个组
[…]  匹配字符组中的字符
[^…] 匹配除了字符组中字符的所有字符

量词

*     重复零次或更多次
+     重复⼀次或更多次
?     重复零次或⼀次
{n}   重复n次
{n,}  重复n次或更多次
{n,m} 重复n到m次

贪婪匹配和惰性匹配

.*   贪婪匹配

.*?  惰性匹配 (让*尽可能少出现的结果)

.*?  表示尽可能少的匹配, .*表示尽可能多的匹配

bs4解析

Beautiful Soup以代码举例

import requests
from bs4 import BeautifulSoup
import time

# url
url = "https://www.youmeitu.com/weimeitupian/"
page = "https://www.youmeitu.com"
# 请求
resp = requests.get(url)
resp.encoding = 'utf-8'
#用BeautifulSoup处理
main_page = BeautifulSoup(resp.text, "html.parser")
alist = main_page.find("div", class_="TypeList").find_all("a", class_="TypeBigPics")

for a in alist:
    # 添加链接头，网页为相对路径
    href = page + a.get('href')
    # 请求子页面
    child_resp = requests.get(href)
    child_resp.encoding = 'utf-8'
    child_text = child_resp.text
    child_page = BeautifulSoup(child_text, "html.parser")
    #找到<img>标签下的src用于下载图片
    div = child_page.find("div", class_="ImageBody").find("img")
    img = div.get("src")
    src = page + img
    img_resp = requests.get(src)
    # img_resp.content
    img_name = src.split("/")[-1]
    #写入文件
    with open("img/" + img_name, mode="wb") as f:
        f.write(img_resp.content)
        print("over!!! " + img_name)
        time.sleep(1)

print("all_over!")

resp.close()
f.close()

常见检索方法有 x.find(), x.find_all()
其中find_all()返回的是迭代器, find只返回第一个值
例如上文的

alist = main_page.find("div", class_="TypeList").find_all("a", class_="TypeBigPics")

xpath解析

xpath 类似于目录索引，较为常用

from lxml import etree
import requests

url = "https://chongqing.zbj.com/search/f/?kw=saas"
resp = requests.get(url)
resp.encoding = "utf-8"
html = etree.HTML(resp.text)

divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div[1]/div")
for div in divs:
    tile = div.xpath("./div/div/a[2]/div[2]/div[2]/p/text()")
    title = "saas".join(tile)
    price = div.xpath("./div/div/a[2]/div[2]/div[1]/span[1]/text()")[0].strip("¥")
    company = div.xpath("./div/div/a[1]/div[1]/p/text()")[1].strip()
    location = div.xpath("./div/div/a[1]/div[1]/div/span/text()")
    print(title, price, company, location)

resp.close()