How to Connect via Https Using Jsoup

How to connect via HTTPS using Jsoup?

If you want to do it the right way, and/or you need to deal with only one site, then you basically need to grab the SSL certificate of the website in question and import it in your Java key store. This will result in a JKS file which you in turn set as SSL trust store before using Jsoup (or java.net.URLConnection).

You can grab the certificate from your webbrowser's store. Let's assume that you're using Firefox.

  1. Go to the website in question using Firefox, which is in your case https://web2.uconn.edu/driver/old/timepoints.php?stopid=10
  2. Left in the address bar you'll see "uconn.edu" in blue (this indicates a valid SSL certificate)
  3. Click on it for details and then click on the More information button.
  4. In the security dialogue which appears, click the View Certificate button.
  5. In the certificate panel which appears, go to the Details tab.
  6. Click the deepest item of the certificate hierarchy, which is in this case "web2.uconn.edu" and finally click the Export button.

Now you've a web2.uconn.edu.crt file.

Next, open the command prompt and import it in the Java key store using the keytool command (it's part of the JRE):

keytool -import -v -file /path/to/web2.uconn.edu.crt -keystore /path/to/web2.uconn.edu.jks -storepass drowssap

The -file must point to the location of the .crt file which you just downloaded. The -keystore must point to the location of the generated .jks file (which you in turn want to set as SSL trust store). The -storepass is required, you can just enter whatever password you want as long as it's at least 6 characters.

Now, you've a web2.uconn.edu.jks file. You can finally set it as SSL trust store before connecting as follows:

System.setProperty("javax.net.ssl.trustStore", "/path/to/web2.uconn.edu.jks");
Document document = Jsoup.connect("https://web2.uconn.edu/driver/old/timepoints.php?stopid=10").get();
// ...

As a completely different alternative, particularly when you need to deal with multiple sites (i.e. you're creating a world wide web crawler), then you can also instruct Jsoup (basically, java.net.URLConnection) to blindly trust all SSL certificates. See also section "Dealing with untrusted or misconfigured HTTPS sites" at the very bottom of this answer: Using java.net.URLConnection to fire and handle HTTP requests

Jsoup connection to https(keystore)

You could do two things. I am just reffering here to also answered questions.

Generally allow all certificates

Read this answer: https://stackoverflow.com/a/2793153/3977134

And the corresponding code is:

TrustManager[] trustAllCertificates = new TrustManager[] {
new X509TrustManager() {
@Override
public X509Certificate[] getAcceptedIssuers() {
return null; // Not relevant.
}
@Override
public void checkClientTrusted(X509Certificate[] certs, String authType) {
// Do nothing. Just allow them all.
}
@Override
public void checkServerTrusted(X509Certificate[] certs, String authType) {
// Do nothing. Just allow them all.
}
}
};

HostnameVerifier trustAllHostnames = new HostnameVerifier() {
@Override
public boolean verify(String hostname, SSLSession session) {
return true; // Just allow them all.
}
};

try {
System.setProperty("jsse.enableSNIExtension", "false");
SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, trustAllCertificates, new SecureRandom());
HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
HttpsURLConnection.setDefaultHostnameVerifier(trustAllHostnames);
}
catch (GeneralSecurityException e) {
throw new ExceptionInInitializerError(e);
}

Add the certificate to the store of your JRE

This method requires you to download the CRT file from e.g. your browser. After that you should include it into your JRE using the keytool command which is part of the JRE.

A complete answer is here: https://stackoverflow.com/a/7745706/3977134

How to log in to an HTTPS website with Jsoup?

The signing in is handled by ajax. I'm using chrome, so this is what I did.
Try to login via the form from a browser. Press F12 and then press Console.
You will see something like this XHR finished loading: POST "https://www.tickld.com/ajax/login.php". . When you make the POST request, you make it to the url that is placed in the action parameter of the form tag.
In this case, no such url exists, because it is handled by javascript.

Try this and see if it works.

Document document = Jsoup.connect("https://www.tickld.com/ajax/login.php")
.data("l_username", "myUsername")
.data("l_password", "myPassword")
.cookies(loginForm.cookies())
.post();

If it doesn't then you might need to use some headless browser (which can handle js execution) like selenium webdriver.

Update

Connection.Response login = Jsoup.connect("https://www.tickld.com/signin")
.data("l_username", "myUsername")
.data("l_password", "myPassword")
.method(Connection.Method.POST)
.execute();

Document document = Jsoup.connect("http://www.tickld.com/user/chosimbaaaa")
.cookies(login.cookies())
.get();

Using jsoup to connect to untrusted certificate

If you trust the site you can ignore http errors by setting it to true:

Document doc = Jsoup.connect("your_url").ignoreHttpErrors(true).get();

and to ignore TSL validation, set validateTLSCertificates(false):

Document doc = Jsoup.connect("your_url").validateTLSCertificates(false).get();

EDIT

This answer is outdated as JSoup has deprecated and removed the validateTLSCertificates method in version 1.12.1

(https://jsoup.org/news/release-1.12.1).

If you trust the questionable site and want to ignore TLS-validiation look at this answer how-to-resolve-jsoup-error-unable-to-find-valid-certification-path

Jsoup HTTPS connecting

Try following (just put it before Jsoup.connect("https://login.emu.dk"):

        Authenticator.setDefault(new Authenticator() {
@Override
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication(username, password.toCharArray());
}
});

Scraping A Webpage With JSOUP and Given An SSL Error. Is This A Site Specific Issue? (JSOUP Works On Other Websites)

Although as Firefox shows the cert used by this server does validate using the intermediate CA Sectigo RSA Domain Validation Secure Server CA and root CA USERTrust RSA Certification Authority, the server sends only the leaf cert and not the intermediate 'chain' cert as required by standards.

You can see this by looking at the SSLLabs test report; notice the orange warning in the summary SSLLabs summary and this near the bottom of the cert details box SSLLabs detail. Alternatively if you have (or get) OpenSSL openssl s_client -connect www.strack.de:443 -showcerts (many servers today require SNI and for OpenSSL below 1.1.1 to send SNI you need to add -servername $host but not this server), or since you have Java keytool -printcert -sslserver www.strack.de.

Omitting the required chain cert(s) is a common mistake by server admins who don't bother reading documentation, because if they only test with a browser or two they don't notice the mistake -- browsers frequently can work-around the missing chain cert(s), but most other software, including Java, either cannot or not by default. It is unlikely to be intended as a deliberate anti-scraping measure since it is easily bypassed, see next, but it does suggest the server admin doesn't have an actual goal or interest to support or assist scraping.

Instead of ignoring all cert problems as suggested by Krystian you can fix this by obtaining the chain cert -- e.g. by exporting from Firefox or by fetching the caIssuer link in the cert http://crt.sectigo.com/SectigoRSADomainValidationSecureServerCA.crt (shown in the SSLLabs report, or the keytool -printcert decode, or if you run the openssl s_client output into openssl x509 -noout -text) -- and adding it to your truststore (by default the file $JREDIR/lib/security/cacerts or jssecacerts unless you change it with a sysprop or code).

(added) Re Firefox, you already found that the UI has changed slightly in the years since the Q you linked: you now click on the padlock, then the right-arrow, More Information, View Certificate. To export a specific cert, click the tab for "Sectigo RSA ..." and scroll about halfway down to this:
Firefox cert info
then click "PEM(cert)" and save somewhere appropriate.

You could also report this problem to the site admin or owner(s). Whether they will care about you or non-browser access in general I have no idea.



Related Topics



Leave a reply



Submit