Python 爬蟲 URL－ricky10116r2d2的部落格

===========

from bs4 import BeautifulSoup
import re
import random
import requests
from urllib.request import urlopen

base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB"]

for i in range(3):
# dealing with Chinese symbols
url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    #html = requests.get(url)

    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
    # •. : 匹配任何字符 (除了 \n)
    # {n} : 重复 n 次
    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()

ricky10116r2d2

ricky10116r2d2的部落格

ricky10116r2d2 發表在痞客邦留言(0) 人氣()

E-mail轉寄

ricky10116r2d2的部落格

歡迎光臨ricky10116r2d2在痞客邦的小天地

Python 爬蟲 URL

歷史上的今天

留言列表

站方公告

活動快報

分享遊...

我的好友

熱門文章

文章分類

C 語言 (1)

網頁技術 (2)

DataBase 資料庫 (0)

Python (9)

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY