Scraping a website's Tableau dashboard using R
The issue is that it needs to deal with aliasIndices
in a different way. The flow is the following :
select the worksheet under
data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap
Under presModelHolder$genVizDataPresModel$paneColumnsData
:
- get all the columns under
vizDataColumns
, note thefieldCaption
as column name - note the
paneIndices
andcolumnIndices
for all columns - for each columns, under
paneColumnsList
, assign the row to the specified columnIndices and the value undervizPaneColumns
which givesvalueIndices
andaliasIndices
which are indices to the actual data in the dictionnary
In the dictionnary (under data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments
), you have the full data. You just need to match the indices you have previously extracted. In order to match those indices, you need to have the dataType
of the column to know which object to select in the dictionnary then :
- for
valueIndices
just assign indices in the list - for
aliasIndices
, if the value is > 0 assign the indice, if it's a negative number you match the indices in the list withdataType
"cstring"
I'm not sure if this algo will work for all tableau data but it's working very well for a few I've tested :
library(rvest)
library(rjson)
library(httr)
library(stringr)
#replace the hostname and the path if necessary
host_url <- "https://results.mo.gov"
path <- "/t/COVID19/views/Demographics/Public-Demographics"
body <- read_html(modify_url(host_url,
path = path,
query = list(":embed" = "y",":showVizHome" = "no")
))
data <- body %>%
html_nodes("textarea#tsConfigContainer") %>%
html_text()
json <- fromJSON(data)
url <- modify_url(host_url, path = paste(json$vizql_root, "/bootstrapSession/sessions/", json$sessionid, sep =""))
resp <- POST(url, body = list(sheet_id = json$sheetId), encode = "form")
data <- content(resp, "text")
extract <- str_match(data, "\\d+;(\\{.*\\})\\d+;(\\{.*\\})")
info <- fromJSON(extract[1,1])
data <- fromJSON(extract[1,3])
worksheets = names(data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap)
for(i in 1:length(worksheets)){
print(paste("[",i,"] ",worksheets[i], sep=""))
}
selected <- readline(prompt="select worksheet by index: ");
worksheet <- worksheets[as.integer(selected)]
print(paste("you selected :", worksheet, sep=" "))
columnsData <- data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap[[worksheet]]$presModelHolder$genVizDataPresModel$paneColumnsData
i <- 1
result <- list();
for(t in columnsData$vizDataColumns){
if (is.null(t[["fieldCaption"]]) == FALSE) {
paneIndex <- t$paneIndices
columnIndex <- t$columnIndices
if (length(t$paneIndices) > 1){
paneIndex <- t$paneIndices[1]
}
if (length(t$columnIndices) > 1){
columnIndex <- t$columnIndices[1]
}
result[[i]] <- list(
fieldCaption = t[["fieldCaption"]],
valueIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$valueIndices,
aliasIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$aliasIndices,
dataType = t[["dataType"]],
stringsAsFactors = FALSE
)
i <- i + 1
}
}
dataFull = data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments[["0"]]$dataColumns
cstring <- list();
for(t in dataFull) {
if(t$dataType == "cstring"){
cstring <- t
break
}
}
data_index <- 1
name_index <- 1
frameData <- list()
frameNames <- c()
for(t in dataFull) {
for(index in result) {
if (t$dataType == index["dataType"]){
if (length(index$valueIndices) > 0) {
j <- 1
vector <- character(length(index$valueIndices))
for (it in index$valueIndices){
vector[j] <- t$dataValues[it+1]
j <- j + 1
}
frameData[[data_index]] <- vector
frameNames[[name_index]] <- paste(index$fieldCaption, "value", sep="-")
data_index <- data_index + 1
name_index <- name_index + 1
}
if (length(index$aliasIndices) > 0) {
j <- 1
vector <- character(length(index$aliasIndices))
for (it in index$aliasIndices){
if (it >= 0){
vector[j] <- t$dataValues[it+1]
} else {
vector[j] <- cstring$dataValues[abs(it)]
}
j <- j + 1
}
frameData[[data_index]] <- vector
frameNames[[name_index]] <- paste(index$fieldCaption, "alias", sep="-")
data_index <- data_index + 1
name_index <- name_index + 1
}
}
}
}
df <- NULL
lengthList <- c()
for(i in 1:length(frameNames)){
lengthList[i] <- length(frameData[[i]])
}
max <- max(lengthList)
for(i in 1:length(frameNames)){
if (length(frameData[[i]]) < max){
len <- length(frameData[[i]])
frameData[[i]][(len+1):max]<-""
}
df[frameNames[[i]]] <- frameData[i]
}
options(width = 1200)
df <- as.data.frame(df, stringsAsFactors = FALSE)
print(df)
I've made a repository with both R and Python scripts here
How to scrape a public tableau dashboard?
Edit
I've made a tableau scraper library to extract the data from Tableau worksheets
You can get the data from worksheets in a pandas dataframe directly. Also, the parametered values are supported.
The following example get the data from worksheet Simples Demanda Máxima Ano
, then switch to daily mode, shows the worksheet Simples Demanda Máxima Semana Dia
data and then set start date to 01/01/2017
:
from tableauscraper import TableauScraper as TS
url = "https://tableau.ons.org.br/t/ONS_Publico/views/DemandaMxima/HistricoDemandaMxima"
ts = TS()
ts.loads(url)
wb = ts.getWorkbook()
# show dataframe with yearly data
ws = wb.getWorksheet("Simples Demanda Máxima Ano")
print(ws.data)
# switch to daily
wb = wb.setParameter("Escala de Tempo DM Simp 4", "Dia")
# show dataframe with daily data
ws = wb.getWorksheet("Simples Demanda Máxima Semana Dia")
print(ws.data)
# switch to daily
wb = wb.setParameter(
"Início Primeiro Período DM Simp 4", "01/01/2017")
# show dataframe with daily data from 01/01/2017
ws = wb.getWorksheet("Simples Demanda Máxima Semana Dia")
print(ws.data)
Try this on repl.it
Original post
This answer is similar to this one but the initial URL page and tableau base URL differ. The process/algo remains the same essentially but I will details the steps :
The graphic is generated in JS from the result of an API :
POST https://tableau.ons.org.br/ROOT_PATH/bootstrapSession/sessions/SESSION_ID
The SESSION_ID
parameter is located (among other things) in tsConfigContainer
textarea in the URL used to build the iframe.
Starting from https://tableau.ons.org.br/t/ONS_Publico/views/DemandaMxima/HistricoDemandaMxima?:embed=y&:showAppBanner=false&:showShareOptions=true&:display_count=no&:showVizHome=no :
- there is a textarea with id
tsConfigContainer
with a bunch of json values - extract the
session_id
and root path (vizql_root
) - make a POST on
https://tableau.ons.org.br/ROOT_PATH/bootstrapSession/sessions/SESSION_ID
with thesheetId
as form data - extract the json from the result (result is not json)
Code :
import requests
from bs4 import BeautifulSoup
import json
import re
url = "https://tableau.ons.org.br/t/ONS_Publico/views/DemandaMxima/HistricoDemandaMxima"
r = requests.get(
url,
params= {
":embed":"y",
":showAppBanner":"false",
":showShareOptions":"true",
":display_count":"no",
"showVizHome": "no"
}
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://tableau.ons.org.br{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
Find two consecutive rows
Assuming the rows have sequential IDs, something like this may be what you're looking for:
select top 1 *
from
Bills b1
inner join Bills b2 on b1.id = b2.id - 1
where
b1.IsEstimate = 1 and b2.IsEstimate = 1
order by
b1.BillDate desc
Related Topics
How to Substitute Symbols in a Language Object
Error Using T.Test() in R - Not Enough 'Y' Observations
How Is Ggplot2 Plus Operator Defined
How to Know a Dimension of Matrix or Vector in R
Read Column Names as Date Format
Add a Constant Value to All Rows in a Dataframe
How to Set Bin Width with Geom_Bar Stat="Identity" in a Time Series Plot
Cant Create File Name with Time Stamp
Sum Columns by Group (Row Names) in a Matrix
R Shiny: Plot with Dynamical Size
How to Obtain All Combinations of the Columns of a Data Frame Taken by 2
Is There Something Like a Pmax Index
Using Lubridate and Ggplot2 Effectively for Date Axis
Change Values in Row Based on a Column Value R
Shiny Leaflet Easyprint Plugin