[Python] Multiprocessing多线程任务
在编写Python时, 如果我们需要同一时间内执行多个任务, 我们可以利用python内建模块multiprocessing来让其并行执行某个方法.
1. 随机执行方法
# vi ~/multipro1.py
import multiprocessing
def spawn(num):
print('Sprawned! {}.'.format(num))
if __name__ == "__main__":
for i in range(50):
p = multiprocessing.Process(target=spawn, args=(i,))
p.start()
2. 按顺序执行方法
# vi ~/multipro2.py
import multiprocessing
def spawn(num):
print('Sprawned! {}.'.format(num))
if __name__ == "__main__":
for i in range(50):
p = multiprocessing.Process(target=spawn, args=(i,))
p.start()
# Waiting to finish One by one
p.join()
Tip: 这里if __name__ == "__main__":的作用:
1. 当直接执行该脚本时, 内建变量__name__的值被赋予"__main__", 所以按照if逻辑, python解释器可以继续执行接下来的代码.
2. 当其他脚本去调用(import)该脚本时, __name__的值被赋予当前脚本名"multipro1", 而不是"__main__", 所以按照if逻辑, 保证if下面的代码不会被其他脚本调用.
所以在我们日常编写脚本的时候, 可以推荐按照如下结构, 在自己使用正常的同时, 保证在其他python脚本在调用test.py中的test1(), test2(), test3()的同时, 不会去执行if条件下的方法.
# vi test.py
def test1():
....
def test2():
....
def test3():
....
if __name__ == "__main__":
....
3. 多线程处理返回值.
# vi multipro3.py
from multiprocessing import pool def job(num): return num * 2 if __name__ == '__main__': p = Pool(processes=20) data = p.map(job, range(5)) data2 = p.map(job, [5, 2]) p.close()
4.多线程爬取随机网站URL
# vi ~/multipro4.py
from multiprocessing import Pool
import bs4 as bs
import random
import requests
import string
# Return Four digits domain URL
def random_starting_url():
starting = ''.join(random.SystemRandom().choice(
string.ascii_lowercase) for _ in range(4))
url = ''.join(['http://', starting, '.com'])
return url
# Correct URL if it is relative
def handle_local_links(url, link):
if link.startswith('/'):
return ''.join([url, link])
else:
return link
# Get URL from "a" tag of 'body' tag
def get_links(url):
try:
# Request the URL
resp = requests.get(url)
# return the HTML
soup = bs.BeautifulSoup(resp.text, 'lxml')
# Get 'body' tag of the HTML
body = soup.body
# Get URL from "a" tag of 'body' tag
links = [link.get('href') for link in body.find_all('a')]
# Correct URL if it is relative
links = [handle_local_links(url, link) for link in links]
# Encoding the URL to ascii
links = [str(link.encode('ascii')) for link in links]
return links
except TypeError as e:
print(e)
print('Got a TypeError, probably got a None that we tried to iterate over')
return([])
except IndexError as e:
print(e)
print('No valid link found, return a empty list')
return([])
except AttributeError as e:
print(e)
print('Likely got None for links, so we are throwing this')
return([])
except Exception as e:
print(str(e))
return([])
def main():
# CPU process
process = 5
# The site intends to scrap
site = 3
p = Pool(processes=process)
# Get random URL list
parse_us = [random_starting_url() for _ in range(site)]
# Multiprocessing the URL and parse the result
data = p.map(get_links, parse_us)
# Get each URL in a list
data = [url for url_list in data for url in url_list]
p.close()
# Write to txt file
with open('urls.txt', 'w') as f:
f.write(str(data))
if __name__ == '__main__':
main()
本文链接:http://www.showerlee.com/archives/2157
继续浏览:python3

还没有评论,快来抢沙发!