How to scrape data off morningstar
Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0
, then the numbers after are relative to that date.
import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta
userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}
n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')
url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'
data = []
idDict = {
'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
'':'F00000ZOR1]7]0]IXALL$$ALL',}
for k, v in idDict.items():
payload = {
'encyId': 'GBP',
'idtype': 'Morningstar',
'frequency': 'daily',
'startDate': dateStr,
'performanceType': '',
'outputType': 'COMPACTJSON',
'id': v,
'decPlaces': '8',
'applyTrackRecordExtension': 'false'}
temp_data = requests.get(url, params=payload).json()
df = pd.DataFrame(temp_data)
df['timestamp'] = pd.to_datetime(df[0], unit='ms')
df['date'] = df['timestamp'].dt.date
df = df[['date',1]]
df.columns = ['date', k]
data.append(df)
final_df = pd.concat(
(iDF.set_index('date') for iDF in data),
axis=1, join='inner'
).reset_index()
final_df.plot(x="date", y=list(idDict.keys()), kind="line")
Output:
print (final_df.head(5).to_string())
date Schroder Managed Balanced Instl Acc GBP Moderately Adventurous Allocation Mixed Investment 40-85% Shares
0 2019-12-22 0.000000 0.000000 0.000000 0.000000
1 2019-12-23 0.357143 0.406784 0.431372 0.694508
2 2019-12-24 0.714286 0.616217 0.632422 0.667586
3 2019-12-25 0.714286 0.616217 0.632422 0.655917
4 2019-12-26 0.714286 0.612474 0.629152 0.664124
....
To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.
Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.
Here's the final output/graph:
Cannot scrape dataid from Morningstar - How can I access the Network inspection tool from Python?
The data id may not be that important. I varied the code F00000412E that is associated with AADR whilst keeping the data id constant.
I got a list of all those codes from here:
https://www.firstrade.com/scripts/free_etfs/io.php
Then add the code of choice into your url e.g.
[
"AIA",
"iShares Asia 50 ETF",
"FOUSA06MPQ"
]
Use FOUSA06MPQ
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=FOUSA06MPQ;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
You can verify the values by adding the other fund as a benchmark to your chart e.g. XNAS:AIA
28th december has value of 55.32. Compare this with JSON retrieved:
I repeated this with
[
"ALD",
"WisdomTree Asia Local Debt ETF",
"F00000M8TW"
]
https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=F00000M8TW;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
Webscraping with VBA morningstar financial
You can just do it with XHR and RegEx instead of cumbersome IE:
Sub Test()
Dim sContent
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US", False
.Send
sContent = .ResponseText
End With
With CreateObject("VBScript.RegExp")
.Pattern = ",""currInsiderVal"":(.*?),"
Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
End With
End Sub
Here is the description how the code works:
First of all MSXML2.XMLHTTP
ActiveX instance is created. GET request opened with target URL in synchronous mode (execution interrupts until response received).
Then VBScript.RegExp
is created. By default .IgnoreCase
, .Global
and .MultiLine
properties are False
. The pattern is ,"currInsiderVal":(.*?),
, where (.*?)
is a capturing group, .
means any character, .*
- zero or more characters, .*?
- as few as possible characters (lazy matching). Other characters in pattern to be found as is. .Execute
method returns a collection of matches, there is only one match object in it since .Global
is False
. This match object has a collection of submatches, there is only one submatch in it since the pattern contains the only capturing group.
There are some helpful MSDN articles on regex:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions
Here is the description how I created the code:
First I found an element containing the target value on the webpage DOM using browser:
The corresponding node is:
<td align="right" id="currrentInsiderVal">143.51</td>
Then I made XHR and found this node in the response HTML, but it didn't contain the value (you can find response in the browser developer tools on network tab after you refresh the page):
<td align="right" id="currrentInsiderVal">
</td>
Such behavior is typical for DHTML. Dynamic HTML content is generated by scripts after the webpage loaded, either after retrieving a data from web via XHR or just processing already loaded withing webpage data. Then I just searched for the value 143.51
in the response, the snippet ,"currInsiderVal":143.51,
located within JS function:
fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};
So the regex pattern created based on that it should find the snippet ,"currInsiderVal":<some text>,
where <some text>
is our target value.
Scraping financial data with R and rvest
read.csv("http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:MSFT®ion=usa&culture=en-US&cur=&reportType=is&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=865827&denominatorView=raw&number=3", skip=1)
Fiscal.year.ends.in.June..USD.in.millions.except.per.share.data. X2011.06 X2012.06 X2013.06 X2014.06 X2015.06 TTM
1 Revenue 69943.00 73723.00 77849.00 86833.00 93580.00 90758.00
2 Cost of revenue 15577.00 17530.00 20249.00 26934.00 33038.00 31972.00
3 Gross profit 54366.00 56193.00 57600.00 59899.00 60542.00 58786.00
4 Operating expenses NA NA NA NA NA NA
5 Research and development 9043.00 9811.00 10411.00 11381.00 12046.00 11943.00
6 Sales, General and administrative 18162.00 18426.00 20425.00 20632.00 20324.00 19862.00
7 Restructuring, merger and acquisition NA NA NA 127.00 NA NA
8 Other operating expenses NA 6193.00 NA NA 10011.00 8871.00
9 Total operating expenses 27205.00 34430.00 30836.00 32140.00 42381.00 40676.00
10 Operating income 27161.00 21763.00 26764.00 27759.00 18161.00 18110.00
11 Interest Expense 295.00 380.00 429.00 597.00 781.00 869.00
12 Other income (expense) 1205.00 884.00 717.00 658.00 1127.00 883.00
13 Income before taxes 28071.00 22267.00 27052.00 27820.00 18507.00 18124.00
14 Provision for income taxes 4921.00 5289.00 5189.00 5746.00 6314.00 5851.00
15 Net income from continuing operations 23150.00 16978.00 21863.00 22074.00 12193.00 12273.00
16 Net income 23150.00 16978.00 21863.00 22074.00 12193.00 12273.00
17 Net income available to common shareholders 23150.00 16978.00 21863.00 22074.00 12193.00 12273.00
18 Earnings per share NA NA NA NA NA NA
19 Basic 2.73 2.02 2.61 2.66 1.49 1.51
20 Diluted 2.69 2.00 2.58 2.63 1.48 1.50
21 Weighted average shares outstanding NA NA NA NA NA NA
22 Basic 8490.00 8396.00 8375.00 8299.00 8177.00 8114.00
23 Diluted 8593.00 8506.00 8470.00 8399.00 8254.00 8183.00
24 EBITDA 31132.00 25614.00 31236.00 33629.00 25245.00 24983.00
It's super-helpful to make browser Developer Tools "Network" tab your BFF.
(that URL came from inspecting what the "Export" button does).
Related Topics
Regex to Match Digits and At Most One Space Between Them
Multiprocessing: How to Use Pool.Map on a Function Defined in a Class
In Dictionary, Converting the Value from String to Integer
Why Does Tkinter Image Not Show Up If Created in a Function
Exception Has Occurred: Filenotfounderror [Errno 2] No Such File or Directory: 'Data.Json'
How to Set the Default Python Path for Anaconda on Linux
How to Extract All Upper from a String - Python
Stuck With Loops in Python - Only Returning First Value
Convert Timedelta to Floating-Point
Get Span Inside a Class Using Webdriver and Selenium
Setting Matplotlib Colorbar Range
Permissionerror: [Errno 13] Permission Denied
Sqlalchemy: How to Filter Date Field
Defining and Calling a Function Within a Python Class
Get Rid of Columns With Null Value in Json Output
Python Super :Typeerror: _Init_() Takes 2 Positional Arguments But 3 Were Given
Python: String Iteration Replace a Space With a Hyphen (Or Other Character)