Reading Information from a Password Protected Site

Reading information from a password protected site

If it is indeed a http basic access authentication, the documentation on connections provides some help:

URLs

Note that https:// connections are
only supported if --internet2 or
setInternet2(TRUE) was used (to make
use of Internet Explorer internals),
and then only if the certificate is
considered to be valid. With that
option only, the http://user:pass@site
notation for sites requiring
authentication is also accepted.

So your URL string should look like this:

http://username:password@domain.name:port/awstats.pl?month=02&year=2011&config=domain.name&lang=en&framename=mainright&output=alldomains

This might be Windows-only though.

Hope this helps!

read a password-protected page

Mojo::UserAgent (see cookbook) has a built-in cookie jar and can do SSL if you have IO::Socket::SSL installed. It has a DOM parser which can easily use CSS3 selectors to traverse the returned result. And if that wasn't good enough, the whole thing can be used non-blocking (if that's something you need).

Mojo::UserAgent and the other tools listed above are parts of the Mojolicious suite of tools. It's a Perl library, and I would certainly recommend Perl for this task since it is a more general purpose language than PHP is.

Here is a very simplistic example to get the text from all the links that are inside a div with a class myclass

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;

$ua->post( 'http://mysite.com/login' => form => { ... } );
my @link_text =
  $ua->get( 'http://mysite.com/protected/page' )
     ->res
     ->dom('div.myclass a')
     ->text
     ->each;

In fact, running this shell command may be enough to get you started (depending on permissions)

curl -L cpanmin.us | perl - -n  Mojolicious IO::Socket::SSL

R - RCurl scrape data from a password-protected site

Updated 3/5/16 to work with package Relenium

#### FRONT MATTER ####

library(devtools)
library(RSelenium)
library(XML)
library(plyr)

######################

## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver() 
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)

This first section loads the required packages, sets the login URL, and then opens it in a Firefox instance. I type in my username & password, and then I'm in and can start scraping.

infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers

For this example, the first page contained two tables. The first is the one I'm interested and has a table of application numbers and names. I pull out the first column (application numbers).

Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")

The data I want are stored in invidiual applications, so this bit created the links that I want to loop through.

### Grabs contact info table from each page

LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)

if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])

else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])

print(infoTable2)
}
)

results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")

This final section loops through the link for each application, then grabs the table with their contact information (which is either table 2 OR table 3, so R has to check first). Thanks again to Chinmay Patil for the tip on relenium!

Getting data from password protected website via requests

For me it has nothing to do with python or HTTP requests.
The site whiteboard.cactusglobal is not an API, it is a website.
It is not meant to able you to access its page only programatically. It expects a real user that interacts with it thanks to its browser.

So for me, the tool that you need is Selenium. Or any User Testing Automation tool, really. That kind of tools will let you emulate a browser that goes to the website, is redirected to the log page, and enter the authentification information in the relevant fields, all within Python.

As your use-case is basic, if you understand the basic tutorials, in particular how to fill forms, you will easily find your way :)

How to query data from a password protected https website

Most likely the server sends a cookie once login is performed. You need to submit the same values as the login form. (this can be done using UploadValues()) However, you need to save the resulting cookies in a CookieContainer.

When I did this, I did it using HttpWebRequest, however per http://couldbedone.blogspot.com/2007/08/webclient-handling-cookies.html you can subclass WebClient and override the GetWebRequest() method to make it support cookies.

Oh, also, I found it useful to use Fiddler while manually accessing the web site to see what actually gets sent back and forth to the web site, so I knew what I was trying to reproduce.

edit, elaboration requested: I can only elaborate how to do it using HttpWebRequest, I have not done it using WebClient. Below is the code snippet I used for login.

    private CookieContainer _jar = new CookieContainer();
    private string _password;
    private string _userid;
    private string _url;
    private string _userAgent;

...

        string responseData;

        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(_url);

        webRequest.CookieContainer = _jar;
        webRequest.Method = "POST";
        webRequest.ContentType = "application/x-www-form-urlencoded";
        webRequest.UserAgent = _userAgent;

        string requestBody = String.Format(
            "client_id={0}&password={1}", _userid, _password);

        try
        {
            using (StreamWriter requestWriter = new StreamWriter(webRequest.GetRequestStream()))
            {

                requestWriter.Write(requestBody);
                requestWriter.Close();

                using (HttpWebResponse res = (HttpWebResponse)webRequest.GetResponse())
                {
                    using (StreamReader responseReader = new StreamReader(res.GetResponseStream()))
                    {

                        responseData = responseReader.ReadToEnd();
                        responseReader.Close();

                        if (res.StatusCode != HttpStatusCode.OK)
                            throw new WebException("Logon failed", null, WebExceptionStatus.Success, res);
                    }

                }
            }

Scrape password-protected website in R

I don't have an account to test with, but maybe this will work:

library(httr)
library(XML)

handle <- handle("http://subscribers.footballguys.com") 
path   <- "amember/login.php"

# fields found in the login form.
login <- list(
  amember_login = "username"
 ,amember_pass  = "password"
 ,amember_redirect_url = 
   "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)

response <- POST(handle = handle, path = path, body = login)

Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle might be re-used for subsequent requests. Can't test it; but this works for me in many situations.

You can output the table using XML

> readHTMLTable(content(response))[[1]][1:5,]
  Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Scraping password protected forum in r

Thanks to Simon I found the answer here: Using rvest or httr to log in to non-standard forms on a webpage

library(rvest)
url       <-"http://forum.axishistory.com/memberlist.php"
pgsession <-html_session(url)

pgform    <-html_form(pgsession)[[2]]

filled_form <- set_values(pgform,
                      "username" = "username", 
                      "password" = "password")

submit_form(pgsession,filled_form)
memberlist <- jump_to(pgsession, "http://forum.axishistory.com/memberlist.php")

page <- html(memberlist)

usernames <- html_nodes(x = page, css = "#memberlist .username") 

data_usernames <- html_text(usernames, trim = TRUE)

Reading Information from a Password Protected Site