Reading information from a password protected site
If it is indeed a http basic access authentication, the documentation on connections
provides some help:
URLs
Note that https:// connections are
only supported if --internet2 or
setInternet2(TRUE) was used (to make
use of Internet Explorer internals),
and then only if the certificate is
considered to be valid. With that
option only, the http://user:pass@site
notation for sites requiring
authentication is also accepted.
So your URL string should look like this:
http://username:password@domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains
This might be Windows-only though.
Hope this helps!
read a password-protected page
Mojo::UserAgent (see cookbook) has a built-in cookie jar and can do SSL if you have IO::Socket::SSL installed. It has a DOM parser which can easily use CSS3 selectors to traverse the returned result. And if that wasn't good enough, the whole thing can be used non-blocking (if that's something you need).
Mojo::UserAgent and the other tools listed above are parts of the Mojolicious suite of tools. It's a Perl library, and I would certainly recommend Perl for this task since it is a more general purpose language than PHP is.
Here is a very simplistic example to get the text from all the links that are inside a div with a class myclass
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->post( 'http://mysite.com/login' => form => { ... } );
my @link_text =
$ua->get( 'http://mysite.com/protected/page' )
->res
->dom('div.myclass a')
->text
->each;
In fact, running this shell command may be enough to get you started (depending on permissions)
curl -L cpanmin.us | perl - -n Mojolicious IO::Socket::SSL
R - RCurl scrape data from a password-protected site
Updated 3/5/16 to work with package Relenium
#### FRONT MATTER ####
library(devtools)
library(RSelenium)
library(XML)
library(plyr)
######################
## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver()
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)
This first section loads the required packages, sets the login URL, and then opens it in a Firefox instance. I type in my username & password, and then I'm in and can start scraping.
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers
For this example, the first page contained two tables. The first is the one I'm interested and has a table of application numbers and names. I pull out the first column (application numbers).
Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")
The data I want are stored in invidiual applications, so this bit created the links that I want to loop through.
### Grabs contact info table from each page
LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])
else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])
print(infoTable2)
}
)
results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")
This final section loops through the link for each application, then grabs the table with their contact information (which is either table 2 OR table 3, so R has to check first). Thanks again to Chinmay Patil for the tip on relenium!
Getting data from password protected website via requests
For me it has nothing to do with python
or HTTP requests.
The site whiteboard.cactusglobal
is not an API, it is a website.
It is not meant to able you to access its page only programatically. It expects a real user that interacts with it thanks to its browser.
So for me, the tool that you need is Selenium. Or any User Testing Automation tool, really. That kind of tools will let you emulate a browser that goes to the website, is redirected to the log page, and enter the authentification information in the relevant fields, all within Python.
As your use-case is basic, if you understand the basic tutorials, in particular how to fill forms, you will easily find your way :)
How to query data from a password protected https website
Most likely the server sends a cookie once login is performed. You need to submit the same values as the login form. (this can be done using UploadValues()) However, you need to save the resulting cookies in a CookieContainer.
When I did this, I did it using HttpWebRequest, however per http://couldbedone.blogspot.com/2007/08/webclient-handling-cookies.html you can subclass WebClient and override the GetWebRequest() method to make it support cookies.
Oh, also, I found it useful to use Fiddler while manually accessing the web site to see what actually gets sent back and forth to the web site, so I knew what I was trying to reproduce.
edit, elaboration requested: I can only elaborate how to do it using HttpWebRequest, I have not done it using WebClient. Below is the code snippet I used for login.
private CookieContainer _jar = new CookieContainer();
private string _password;
private string _userid;
private string _url;
private string _userAgent;
...
string responseData;
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(_url);
webRequest.CookieContainer = _jar;
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.UserAgent = _userAgent;
string requestBody = String.Format(
"client_id={0}&password={1}", _userid, _password);
try
{
using (StreamWriter requestWriter = new StreamWriter(webRequest.GetRequestStream()))
{
requestWriter.Write(requestBody);
requestWriter.Close();
using (HttpWebResponse res = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader responseReader = new StreamReader(res.GetResponseStream()))
{
responseData = responseReader.ReadToEnd();
responseReader.Close();
if (res.StatusCode != HttpStatusCode.OK)
throw new WebException("Logon failed", null, WebExceptionStatus.Success, res);
}
}
}
Scrape password-protected website in R
I don't have an account to test with, but maybe this will work:
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com")
path <- "amember/login.php"
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle
might be re-used for subsequent requests. Can't test it; but this works for me in many situations.
You can output the table using XML
> readHTMLTable(content(response))[[1]][1:5,]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Scraping password protected forum in r
Thanks to Simon I found the answer here: Using rvest or httr to log in to non-standard forms on a webpage
library(rvest)
url <-"http://forum.axishistory.com/memberlist.php"
pgsession <-html_session(url)
pgform <-html_form(pgsession)[[2]]
filled_form <- set_values(pgform,
"username" = "username",
"password" = "password")
submit_form(pgsession,filled_form)
memberlist <- jump_to(pgsession, "http://forum.axishistory.com/memberlist.php")
page <- html(memberlist)
usernames <- html_nodes(x = page, css = "#memberlist .username")
data_usernames <- html_text(usernames, trim = TRUE)
Related Topics
Convert Hours:Minutes:Seconds to Minutes
Ternary Plot and Filled Contour
Find Consecutive Values in Vector in R
Add Row in Each Group Using Dplyr and Add_Row()
R: Web Scraping Yahoo.Finance After 2019 Change
Align Two Data.Frames Next to Each Other with Knitr
How Does R Handle Unicode/Utf-8
Connect to Redshift via Ssl Using R
Reshape Data Long to Wide - Understanding Reshape Parameters
Data Table - Select Value of Column by Name from Another Column
Coloring Boxplot Outlier Points in Ggplot2
Change the Number of Breaks Using Facet_Grid in Ggplot2
Dplyr Summarize with Subtotals
R Programming: Cache the Inverse of a Matrix
Format Ttest Output by R for Tex