「玩转Python」打造十万博文爬虫篇 - 好文

前言

这里以爬取博客园文章为例，仅供学习参考，某些AD满天飞的网站太浪费爬虫的感情了。

爬取

*
使用 BeautifulSoup 获取博文

*
通过 html2text 将 Html 转 Markdown

*
保存 Markdown 到本地文件

*
下载 Markdown 中的图片到本地并替换图片地址

*
写入数据库

工具

使用到的第三方类库：BeautifulSoup、html2text、PooledDB

代码

获取博文：
# 获取标题和文章内容 def getHtml(blog): res = requests.get(blog, headers=headers) soup
= BeautifulSoup(res.text, 'html.parser') # 获取博客标题 title = soup.find('h1',
class_='postTitle').text # 去除空格等 title = title.strip() # 获取博客内容 content =
soup.find('div', class_='blogpost-body') # 去掉博客外层的DIV content =
article.decode_contents(formatter="html") info = {"title": title, "content":
content} return info
Html 转 Markdown：
# 这里使用开源第三方库 html2text md = text_maker.handle(info['content'])
保存到本地文件：
def createFile(md, title):
print('系统默认编码：{}'.format(sys.getdefaultencoding())) save_file = str(title)
+".md" # print(save_file) print('准备写入文件：{}'.format(save_file)) # r+
打开一个文件用于读写。文件指针将会放在文件的开头。 # w+ 打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。 # a+
打开一个文件用于读写。如果该文件已存在，文件指针将会放在文件的结尾。文件打开时会是追加模式。如果该文件不存在，创建新文件用于读写。 f =
codecs.open(save_file, 'w+', 'utf-8') f.write(md) f.close()
print('写入文件结束：{}'.format(f.name)) return save_file
下载图片到本地并替换图片地址：
def replace_md_url(md_file): """ 把指定MD文件中引用的图片下载到本地，并替换URL """ if
os.path.splitext(md_file)[1] != '.md':
print('{}不是Markdown文件，不做处理。'.format(md_file)) return cnt_replace = 0 #
日期时间为目录存储图片 dir_ts = time.strftime('%Y%m', time.localtime()) isExists =
os.path.exists(dir_ts) # 判断结果 if not isExists: os.makedirs(dir_ts) with
open(md_file, 'r', encoding='utf-8') as f: # 使用utf-8 编码打开 post = f.read()
matches = re.compile(img_patten).findall(post) if matches and len(matches) > 0:
for match in list(chain(*matches)): if match and len(match) > 0: array =
match.split('/') file_name = array[len(array) - 1] file_name = dir_ts + "/" +
file_name img = requests.get(match, headers=headers) f = open(file_name, 'ab')
f.write(img.content) new_url =
"https://blog.52itstyle.vip/{}".format(file_name) # 更新MD中的URL post =
post.replace(match, new_url) cnt_replace = cnt_replace + 1 #
如果有内容的话，就直接覆盖写入当前的markdown文件 if post and cnt_replace > 0: url =
"https://blog.52itstyle.vip" open(md_file, 'w', encoding='utf-8').write(post)
print('{0}的{1}个URL被替换到{2}/{3}'.format(os.path.basename(md_file), cnt_replace,
url, dir_ts)) elif cnt_replace == 0:
print('{}中没有需要替换的URL'.format(os.path.basename(md_file)))
写入数据库：
# 写入数据库 def write_db(title, content, url): sql = "INSERT INTO blog (title,
content,url) VALUES(%(title)s, %(content)s, %(url)s);" param = {"title": title,
"content": content, "url": url} mysql.insert(sql, param)
小结

互联网时代一些开放的博客社区的确方便了很多，但是也伴随着随时消失的可能性，最好就是自己备份一份到本地；你也可以选择自己喜欢的博主，爬取下收藏。

源码：https://gitee.com/52itstyle/Python <https://gitee.com/52itstyle/Python>

演示：https://blog.52itstyle.top <https://blog.52itstyle.top/>

列表：https://blog.52itstyle.top/index <https://blog.52itstyle.top/index>

详情：https://blog.52itstyle.top/49.shtml <https://blog.52itstyle.top/49.shtml>

热门工具换一换