Python BeautifulSoup - Scraping Google Finance historical data
You can easily pass an offset i.e start=.. to the url getting 30 rows at a time which is exactly what is happening with the pagination logic:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
start = 0
req = s.get(url.format(start))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows = table.find_all("tr")
while True:
start += 30
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
if not table:
break
all_rows.extend(table.find_all("tr"))
You can also get the total rows using the script tag and use that with range:
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(30, total+1, 30):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
print(len(all_rows))
The num=30
is the amount of rows per page, to make less requests you can set it to 200 which seems to be the max and work your step/offset from that.
url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
"enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"
with requests.session() as s:
req = s.get(url.format(0))
soup = BeautifulSoup(req.content, "lxml")
table = soup.select_one("table.gf-table.historical_price")
scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
total = int(scr.text.split(",", 3)[2])
all_rows = table.find_all("tr")
for start in range(200, total+1, 200):
soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
print(url.format(start)
table = soup.select_one("table.gf-table.historical_price")
all_rows.extend(table.find_all("tr"))
If we run the code, you will see we get 1643 rows:
In [7]: with requests.session() as s:
...: req = s.get(url.format(0))
...: soup = BeautifulSoup(req.content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
...: total = int(scr.text.split(",", 3)[2])
...: all_rows = table.find_all("tr")
...: for start in range(200, total+1, 200):
...: soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
...: table = soup.select_one("table.gf-table.historical_price")
...: all_rows.extend(table.find_all("tr"))
...: print(len(all_rows))
...:
1643
In [8]:
Web scraping from Google Finance: returned data list always empty
It is entirely up to the server what content it serves you, so the best you can do is to make sure that your request looks like the request sent by the browser as much as possible. In your case, this might mean:
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36')]
If I am not mistaken, this gives you what you want. You can try to remove irrelevant parts by trial-and-error if you want.
Scraping data from google finance using BeautifulSoup in python
The problem was with html.parser. I instead used lxml and it worked. Also exchanged urllib with requests.
Web Scraping Google Finance
This will help:
>>> pe = soup.find('td',{'data-snapfield':'pe_ratio'})
>>> pe
<td class="key" data-snapfield="pe_ratio">P/E
</td>
>>> print(pe.td.next_sibling.get_text())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'next_sibling'
>>>
>>>
>>>
>>> pe
<td class="key" data-snapfield="pe_ratio">P/E
</td>
>>> pe.td
>>> pe.next_sibling
u'\n'
>>> pe.next_sibling.next_sibling
<td class="val">29.69
</td>
>>> pe.next_sibling.next_sibling.get_text()
u'29.69\n'
Python scraping google finance
Your XPath seem to be incorrect
Try to replace
priceO = parser.xpath('//*[@id="fac-ut"]/div[1]/div[4]/div[1]/span[1]/text()')
with below line
price0 = parser.xpath('//div[@id="price-panel"]//span')[0].text_content().strip()
output:
172.50
Using Beautiful soup to get the stock prices
The reason this doesn't work is because you will run into the following problems:
- You will be hitting Google's bot detection, which means when you do
requests.get
you won't get back the Google results, instead you'll get a response from the bot detection asking you to tick a box to confirm you are human. - The class you are searching for doesn't exist.
- You are using the default
html.parser
which is going to be useless as Google does not put the price data in the raw HTML code. Instead you want to use something more advanced like thelxml
parser.
Based on what you are trying to do, you could try to trick Google's bot detection by making your request seem more legitimate, for example add in the user agent that a Chrome browser would normally send. Additionally, to get the price it seems like you want the pclqee
class in a span
element.
Try this instead:
First install the lxml
parser:pip3 install lxml
Then use the below snippet instead:
from bs4 import BeautifulSoup
import requests
import time
def getprice():
url = "https://www.google.com/search?q=bitcoin+price"
HTML = requests.get(
url,
headers={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"
},
)
soup = BeautifulSoup(HTML.text, "lxml")
text = soup.find("span", attrs={"class": "pclqee"}).text
return text
if __name__ == "__main__":
bitcoin = getprice()
print(bitcoin)
Although the above modified snippet will work, I wouldn't advise using it. Google will still be able to detect your request as a bot occassionally and so this code would be unreliable.
If you want stock data I suggest you try to web scrape some API's directly or use API's that do that for you already, e.g. have a look at https://www.alphavantage.co/
Related Topics
Interactive Input/Output Using Python
How to Retrieve Items from a Dictionary in the Order That They'Re Inserted
Python & MySQL: Unicode and Encoding
Why Do I Get Typeerror: Can't Multiply Sequence by Non-Int of Type 'Float'
Transparent Background in a Tkinter Window
How to Replace Two Things at Once in a String
Python Subprocess.Call a Bash Alias
Download File Through Google Chrome in Headless Mode
Unsupported Operand Type(S) for +: 'Int' and 'Str'
What Is the Most Pythonic Way to Pop a Random Element from a List
Pandas Groupby and Select Rows with the Minimum Value in a Specific Column
Sorting Text File by Using Python
How to Get Rid of Python Tkinter Root Window
Monitoring Contents of Files/Directories
Reloading Module Giving Nameerror: Name 'Reload' Is Not Defined
Why Is Bubble Sort Implementation Looping Forever