Scraping Google Finance (Beautifulsoup)

Python BeautifulSoup - Scraping Google Finance historical data

You can easily pass an offset i.e start=.. to the url getting 30 rows at a time which is exactly what is happening with the pagination logic:

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"

with requests.session() as s:
start = 0
req = s.get(url.format(start))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows = table.find_all("tr")
while True:
start += 30
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
if not table:
break
all_rows.extend(table.find_all("tr"))

You can also get the total rows using the script tag and use that with range:

with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")

for start in range(30, total+1, 30):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
print(len(all_rows))

The num=30 is the amount of rows per page, to make less requests you can set it to 200 which seems to be the max and work your step/offset from that.

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"

with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(200, total+1, 200):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
print(url.format(start)
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))

If we run the code, you will see we get 1643 rows:

In [7]: with requests.session() as s:
...: req = s.get(url.format(0))
...: soup = BeautifulSoup(req.content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
...: total = int(scr.text.split(",", 3)[2])
...: all_rows = table.find_all("tr")
...: for start in range(200, total+1, 200):
...: soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: all_rows.extend(table.find_all("tr"))
...: print(len(all_rows))
...:

1643

In [8]:

Web scraping from Google Finance: returned data list always empty

It is entirely up to the server what content it serves you, so the best you can do is to make sure that your request looks like the request sent by the browser as much as possible. In your case, this might mean:

opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36')]

If I am not mistaken, this gives you what you want. You can try to remove irrelevant parts by trial-and-error if you want.

Scraping data from google finance using BeautifulSoup in python

The problem was with html.parser. I instead used lxml and it worked. Also exchanged urllib with requests.

Web Scraping Google Finance

This will help:

>>> pe = soup.find('td',{'data-snapfield':'pe_ratio'})
>>> pe
<td class="key" data-snapfield="pe_ratio">P/E
</td>
>>> print(pe.td.next_sibling.get_text())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'next_sibling'
>>>
>>>
>>>
>>> pe
<td class="key" data-snapfield="pe_ratio">P/E
</td>
>>> pe.td
>>> pe.next_sibling
u'\n'
>>> pe.next_sibling.next_sibling
<td class="val">29.69
</td>
>>> pe.next_sibling.next_sibling.get_text()
u'29.69\n'

Python scraping google finance

Your XPath seem to be incorrect

Try to replace

priceO = parser.xpath('//*[@id="fac-ut"]/div[1]/div[4]/div[1]/span[1]/text()')

with below line

price0 = parser.xpath('//div[@id="price-panel"]//span')[0].text_content().strip()

output:

172.50

Using Beautiful soup to get the stock prices

The reason this doesn't work is because you will run into the following problems:

  1. You will be hitting Google's bot detection, which means when you do requests.get you won't get back the Google results, instead you'll get a response from the bot detection asking you to tick a box to confirm you are human.
  2. The class you are searching for doesn't exist.
  3. You are using the default html.parser which is going to be useless as Google does not put the price data in the raw HTML code. Instead you want to use something more advanced like the lxml parser.

Based on what you are trying to do, you could try to trick Google's bot detection by making your request seem more legitimate, for example add in the user agent that a Chrome browser would normally send. Additionally, to get the price it seems like you want the pclqee class in a span element.

Try this instead:

First install the lxml parser:
pip3 install lxml

Then use the below snippet instead:

from bs4 import BeautifulSoup
import requests
import time

def getprice():
url = "https://www.google.com/search?q=bitcoin+price"
HTML = requests.get(
url,
headers={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
},
)
soup = BeautifulSoup(HTML.text, "lxml")
text = soup.find("span", attrs={"class": "pclqee"}).text
return text

if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)

Although the above modified snippet will work, I wouldn't advise using it. Google will still be able to detect your request as a bot occassionally and so this code would be unreliable.

If you want stock data I suggest you try to web scrape some API's directly or use API's that do that for you already, e.g. have a look at https://www.alphavantage.co/



Related Topics



Leave a reply



Submit