[Python crawler]Google Search in Python

  • 3916
  • 0
  • 2019-12-25

目的:利用google搜尋引擎搜尋字串,並抓取title

 

最近在做字串比對,除了用些演算法之外(效果不好),想到用google搜尋後再比對,結果效果意外的好,以下分享實作。

作法1

先安裝google search package

*pip install google_search


from googlesearch.googlesearch import GoogleSearch
response = GoogleSearch().search("LOL LMS")
for result in response.results:
    print("Title: " + result.title)
    print("Content: " + result.getText())

 output:頁面tiltle與內文(這邊內文太多就不列了)

優點:簡單方便

缺點:無法克制化,搜尋有限制

作法2

# -*- coding: utf-8 -*-
import requests
import time
import random
from bs4 import BeautifulSoup

def google_scrape(Search_list):
    title_list=[]
    #url='http://www.baidu.com/s?rsv_idx=1&wd='LPL&usm=2&ie=utf-8&sl_lang=en&rsv_srlang=en&rsv_rq=en&rqlang=cn
    url='https://www.google.com.tw/search?q='
    user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \
          'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \
          'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
          (KHTML, like Gecko) Element Browser 5.0', \
          'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
          'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
          'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
          'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
           Version/6.0 Mobile/10A5355d Safari/8536.25', \
          'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
           Chrome/28.0.1468.0 Safari/537.36', \
          'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']   
    proxies = {
      "http": "http://113.200.214.164:9999"}
    index = random.randint(0, 9)
    user_agent = user_agents[index]
    headers = {'User-Agent': user_agent}
    for x in Search_list:
        time.sleep(1)
         res=requests.get(url = url+x,headers = headers,proxies = proxies)
        soup=BeautifulSoup(res.text, "html.parser")
        search_text=soup.find_all("div", class_="g")
        title_list=[result.find("a").text for result in search_text]
        print  title_list

google_scrape(['LOL LMS'])

output:頁面tiltle

程式碼很明顯比作法1長許多,但是卻相對安全不會被google鎖IP

1.user_agents偽裝瀏覽器

2.proxies更換

3.time.sleep(1)間隔時間

以上3點都可以減少被鎖的危險性

優點:較不易被鎖、可抓取特定範圍內容

缺點:速度較慢

以上給大家參考,建議採取作法2,如果只是少量資料或許就可採用作法1。