Should I Call Ugi.Checktgtandreloginfromkeytab() Before Every Action on Hadoop

Should I call ugi.checkTGTAndReloginFromKeytab() before every action on hadoop?

Hadoop committer here! This is an excellent question.

Unfortunately, it's difficult to give a definitive answer to this without a deep dive into the particular usage patterns of the application. Instead, I can offer general guidelines and describe when Hadoop would handle ticket renewal or re-login from a keytab automatically for you, and when it wouldn't.

The primary use case for Kerberos authentication in the Hadoop ecosystem is Hadoop's RPC framework, which uses SASL for authentication. Most of the daemon processes in the Hadoop ecosystem handle this by doing a single one-time call to UserGroupInformation#loginUserFromKeytab at process startup. Examples of this include the HDFS DataNode, which must authenticate its RPC calls to the NameNode, and the YARN NodeManager, which must authenticate its calls to the ResourceManager. How is it that daemons like the DataNode can do a one-time login at process startup and then keep on running for months, long past typical ticket expiration times?

Since this is such a common use case, Hadoop implements an automatic re-login mechanism directly inside the RPC client layer. The code for this is visible in the RPC Client#handleSaslConnectionFailure method:

          // try re-login
          if (UserGroupInformation.isLoginKeytabBased()) {
            UserGroupInformation.getLoginUser().reloginFromKeytab();
          } else if (UserGroupInformation.isLoginTicketBased()) {
            UserGroupInformation.getLoginUser().reloginFromTicketCache();
          }

You can think of this as "lazy evaluation" of re-login. It only re-executes login in response to an authentication failure on an attempted RPC connection.

Knowing this, we can give a partial answer. If your application's usage pattern is to login from a keytab and then perform typical Hadoop RPC calls, then you likely do not need to roll your own re-login code. The RPC client layer will do it for you. "Typical Hadoop RPC" means the vast majority of Java APIs for interacting with Hadoop, including the HDFS FileSystem API, the YarnClient and MapReduce Job submissions.

However, some application usage patterns do not involve Hadoop RPC at all. An example of this would be applications that interact solely with Hadoop's REST APIs, such as WebHDFS or the YARN REST APIs. In that case, the authentication model uses Kerberos via SPNEGO as described in the Hadoop HTTP Authentication documentation.

Knowing this, we can add more to our answer. If your application's usage pattern does not utilize Hadoop RPC at all, and instead sticks solely to the REST APIs, then you must roll your own re-login logic. This is exactly why WebHdfsFileSystem calls UserGroupInformation#checkTGTAndReloginFromkeytab, just like you noticed. WebHdfsFileSystem chooses to make the call right before every operation. This is a fine strategy, because UserGroupInformation#checkTGTAndReloginFromkeytab only renews the ticket if it's "close" to expiration. Otherwise, the call is a no-op.

As a final use case, let's consider an interactive process, not logging in from a keytab, but rather requiring the user to run kinit externally before launching the application. In the vast majority of cases, these are going to be short-running applications, such as Hadoop CLI commands. However, in some cases these can be longer-running processes. To support longer-running processes, Hadoop starts a background thread to renew the Kerberos ticket "close" to expiration. This logic is visible in UserGroupInformation#spawnAutoRenewalThreadForUserCreds. There is an important distinction here though compared to the automatic re-login logic provided in the RPC layer. In this case, Hadoop only has the capability to renew the ticket and extend its lifetime. Tickets have a maximum renewable lifetime, as dictated by the Kerberos infrastructure. After that, the ticket won't be usable anymore. Re-login in this case is practically impossible, because it would imply re-prompting the user for a password, and they likely walked away from the terminal. This means that if the process keeps running beyond expiration of the ticket, it won't be able to authenticate anymore.

Again, we can use this information to inform our overall answer. If you rely on a user to login interactively via kinit before launching the application, and if you're confident the application won't run longer than the Kerberos ticket's maximum renewable lifetime, then you can rely on Hadoop internals to cover periodic renewal for you.

If you're using keytab-based login, and you're just not sure if your application's usage pattern can rely on the Hadoop RPC layer's automatic re-login, then the conservative approach is to roll your own. @SamsonScharfrichter gave an excellent answer here about rolling your own.

HBase Kerberos connection renewal strategy

Finally, I should add a note about API stability. The Apache Hadoop Compatibility guidelines discuss the Hadoop development community's commitment to backwards-compatibility in full detail. The interface of UserGroupInformation is annotated LimitedPrivate and Evolving. Technically, this means the API of UserGroupInformation is not considered public, and it could evolve in backwards-incompatible ways. As a practical matter, there is a lot of code already depending on the interface of UserGroupInformation, so it's simply not feasible for us to make a breaking change. Certainly within the current 2.x release line, I would not have any fear about method signatures changing out from under you and breaking your code.

Now that we have all of this background information, let's revisit your concrete questions.

Can I rely on the various Hadoop clients they call checkTGTAndReloginFromKeytab whenever it's needed?

You can rely on this if your application's usage pattern is to call the Hadoop clients, which in turn utilize Hadoop's RPC framework. You cannot rely on this if your application's usage pattern only calls the Hadoop REST APIs.

Should I call ever checkTGTAndReloginFromKeytab myself in my code?

You'll likely need to do this if your application's usage pattern is solely to call the Hadoop REST APIs instead of Hadoop RPC calls. You would not get the benefit of the automatic re-login implemented inside Hadoop's RPC client.

If so should I do that before every single call to ugi.doAs(...) or rather setup a timer and call it periodically (how often)?

It's fine to call UserGroupInformation#checkTGTAndReloginFromKeytab right before every action that needs to be authenticated. If the ticket is not close to expiration, then the method will be a no-op. If you're suspicious that your Kerberos infrastructure is sluggish, and you don't want client operations to pay the latency cost of re-login, then that would be a reason to do it in a separate background thread. Just be sure to stay a little bit ahead of the ticket's actual expiration time. You might borrow the logic inside UserGroupInformation for determining if a ticket is "close" to expiration. In practice, I've never personally seen the latency of re-login be problematic.

Auto renewal of Kerberos ticket not working from Java

Unfortunately, there is a known issue with automatic renewal not working correctly when using the UserGroupInformation#loginUserFromKeytabAndReturnUGI method. I am not aware of any known code fix within Apache Hadoop at this time.

Your solution to add a call to UserGroupInformation#checkTGTAndReloginFromKeytab is a viable workaround. I recommend that you stick with that for now and keep an eye on Apache Hadoop release notes to see if there is a fix committed in the future.

How does Kerberos handle multiple TGT request in the same node for the same service and for the same client?

Yes, you can update your program to use this keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package.

val configuration = new Configuration
configuration.addResource("/etc/hadoop/conf/hdfs-site.xml")
UserGroupInformation.setConfiguration(configuration)

UserGroupInformation.getCurrentUser.setAuthenticationMethod(AuthenticationMethod.KERBEROS)
UserGroupInformation.loginUserFromKeytabAndReturnUGI(
  "hadoop.kerberos.principal", " path of hadoop.kerberos.keytab file")
  .doAs(new PrivilegedExceptionAction[Unit]() {
    @Override
    def run(): Unit = {

      // logic

    }
  })

Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT.

HBase with Kerberos - keep a HTable instance open more than 10 hours

I happened to be doing some research in this forum earlier. The problem statement here, where Kerberos authentication dies after 10 hours, is nearly identical to that of this thread:

Renewing a connection to Apache Phoenix (using Kerberos) fails after exactly 10 hours

I actually just edited that thread earlier today and placed the "10 hours" into the Subject line. That thread contains some great advice on what to do here. I'm going to go ahead and borrow the good wisdom provided by Samson Scharfrichter who stated in it: "The standard solution is to spawn a background thread invoking checkTGTAndReloginFromKeytab() periodically -- see Should I call ugi.checkTGTAndReloginFromKeytab() before every action on hadoop? for a very elaborate explanation by a HortonWorks guru (a colleague of the guy who wrote that GitBook about Hadoop & Kerberos)"

I hope this provides your direction.

HBase Kerberos connection renewal strategy

A Kerberos TGT has a lifetime (e.g. 12h) and a renewable lifetime (e.g. 7 days). As long as the ticket is still valid and is still renewable, you can request a "free" renewal -- no password required --, and the lifetime counter is reset (e.g. 12h to go, again).

The Hadoop authentication library spawns a specific Java thread for automatic renewal of the current TGT. It's kind of ugly, using a kinit -R command line instead of a JAAS library call, but it works - see HADOOP-6656

So, if you get Slider to create a renewable ticket on startup, and if you can bribe your SysAdmin to raise the default (cf. client conf) and the max (cf. KDC conf) renewable lifetime to, say, 30 days, then your app could run for 30 days straight with the initial TGT. A nice improvement.

~~~~~~~~~~

If you really crave for eternity... sorry, but you will actually have some programming to do. That means a dedicated thread/process in charge or re-creating automagically the TGT.

The Java Way: on startup, before you connect to HBase/HDFS/whatever,
create explicitly an UGI with loginUserFromKeytab() then run
checkTGTAndReloginFromKeytab() from time to time
The Shell Way: start a shell that (a) creates a TGT with kinit (b)
spawns a sub-process that periodically fires kinit again (c)
launches your Java app then kills the subprocess when/if your app ever terminates

Caveat: if some other thread happens to open, or re-open, a connection while the TGT is being re-created, that connection may fail because the cache was empty at the exact time it was accessed ("race condition"). The next attempt will be successful, but expect a few rogue warnings in your logs.

~~~~~~~~~~

Final advice: you can use a private ticket cache for your app (i.e. you can run multiple apps on the same node with the same Linux account but different Kerberos principals) by setting KRB5CCNAME environment variable, as long as it's a "FILE:" cache.

Should I Call Ugi.Checktgtandreloginfromkeytab() Before Every Action on Hadoop