[Python] 利用Beautiful Soup+Pandas+Pyqt5+Selenum进行python爬虫
Beautiful Soup, pandas, pyqt5是一组非常方便的进行网络爬虫的python模块.
Beautiful Soup主要从解析好的HTML源码中抓取我们所需要的关键内容
Pandas与Beautiful Soup类似, 不过它侧重去抓取源码中的表格信息
pyqt5这里的作用是模拟浏览器去解析源码中的Javasript, 并最终抓取JS实际的返回值.
这里我在我的Flask env下创建了一个测试页面, 用这些模块进行一些简单的页面爬虫测试.
http://flask.showerlee.com/scrapingtest/
安装环境
OS: Windows 7 x64
Python: Python3.6.2
Git Bash: Git-2.15.1.2-64-bit
一. 环境配置:
1. 安装并运行Git bash
2. 安装并测试python版本
# python -V
Python 3.6.2
3. 安装相关爬虫模块
# python -m pip install beautifulsoup4 lxml pandas html5lib pyqt5 selenum
二. Beautiful Soup演示
# vi ~/scrap1.py
Tip: 这里首先去调用io和sys模块是为了改变默认的标准输出为utf-8, 这么做是为了保证无论我们抓取的源页面是什么格式, 都不会在用BS解析时报UnicodeEncodeError.
import io
import sys
import bs4 as bs
import urllib.request
# 改变标准输出的默认编码为utf-8
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
# 获取该页面编码并解码成utf-8
sauce = urllib.request.urlopen(
'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8')
# 利用BS抓取页面源代码,并利用lxml规范格式
soup = bs.BeautifulSoup(sauce, 'lxml')
# 获取页面源代码
print(soup)
# 获取titile标签源代码
print(soup.title)
# 获取titile标签name
print(soup.title.name)
# 获取titile标签字符
print(soup.title.string)
print(soup.title.text)
# 获取第一个p标签源代码
print(soup.p)
# 获取p标签源代码
print(soup.find_all('p'))
# 获取p标签所有内容
for paragraph in soup.find_all('p'):
print(paragraph.text)
# 获取页面所有内容
print(soup.get_text())
# 获取a标签所有内容
for url in soup.find_all('a'):
print(url.text)
# 获取a标签所有href链接
for url in soup.find_all('a'):
print(url.get('href'))
# 获取nav标签源代码
nav = soup.nav
print(nav)
# 获取nav标签URL
for url in nav.find_all('a'):
print(url.get('href'))
# 获取body标签内容
body = soup.body
for paragraph in body.find_all('p'):
print(paragraph.text)
# 获取div标签下body下的内容
for div in soup.find_all('div', class_='body'):
print(div.text)
# 获取table标签源代码
table = soup.table
# table = soup.find('table')
print(table)
# 获取table每行内容
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# python scrap1.py
......
三. Pandas演示
# vi ~/scrap2.py
import pandas as pd
dfs = pd.read_html(
'http://flask.showerlee.com/scrapingtest/', header=0)
for df in dfs:
print(df)
# python scrap2.py
Program Name Internet Points Kittens? 0 Python 932914021 Definitely 1 Pascal 532 Unlikely 2 Lisp 1522 Uncertain 3 D# 12 Possibly 4 Cobol 3 No. 5 Fortran 52124 Yes. 6 Haskell 24 lol.
四. pyqt5演示
这里我们首先不解析JS, 直接利用Beautiful Soup去抓取p标签下class=jstest的内容
# vi ~/scrap3.py
import io
import sys
import bs4 as bs
import urllib.request
# 改变标准输出的默认编码为utf-8
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
# 获取该页面编码并解码成utf-8
sauce = urllib.request.urlopen(
'http://flask.showerlee.com/scrapingtest/').read().decode('utf-8')
# 利用BS抓取页面源代码,并利用lxml规范格式
soup = bs.BeautifulSoup(sauce, 'lxml')
js_test = soup.find('p', class_='jstest')
print(js_test.text)
# python scrap3.py
No js loaded
可以看到实际抓取的为未被JS处理的标签内容
这里利用pyqt5去抓取p标签下class=jstest的内容
# vi ~/scrap4.py
import bs4 as bs
import sys
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
# print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('http://flask.showerlee.com/scrapingtest/')
soup = bs.BeautifulSoup(page.html, 'html.parser')
js_test = soup.find('p', class_='jstest')
print(js_test.text)
if __name__ == '__main__':
main()
# python scrap4.py
js loaded successfully
JS解析成功.
四. selenum演示
首先我们需要从官网下载chrome driver, 并放到脚本同路径的driver目录里.
这里需要查找匹配你当前chrome浏览器版本的driver版本. 这边我的chrome版本为62.0, 所以选择driver版本为2.35.
# vi selenum.py
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time, sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
def scrape():
chromedriver = r".\driver\chromedriver.exe"
URL = "http://flask.showerlee.com/scrapingtest/"
try:
driver = webdriver.Chrome(chromedriver)
driver.set_window_position(-10000, 0)
driver.get(URL)
time.sleep(10)
result = driver.execute_script("return document.body.innerHTML").encode('utf-8')
except TimeoutException as e:
print(e)
soup = BeautifulSoup(result, "lxml")
print(soup)
driver.close()
scrape()
# python selenum.py
...
更多文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
http://pandas.pydata.org/pandas-docs/stable/
http://pyqt.sourceforge.net/Docs/PyQt5/
https://sites.google.com/a/chromium.org/chromedriver/downloads
https://chromedriver.storage.googleapis.com/index.html
本文链接:http://www.showerlee.com/archives/2109
继续浏览:python3

还没有评论,快来抢沙发!