certification team query: allowing key rotation during tests

Issue #1471 new
Joseph Heenan created an issue

The certification team have recently received a request from an OP that is failing I believe pretty much every OpenID Connect Core Basic profile test.

The request is that all tests support key rotation whilst the test is running, and potentially multiple new keys during a test. (Currently the tests assume that all the keys needed for the test are present in the server jwks_uri at the start of the test.)

The reason the OP is requesting this is, as I understand it, that they dynamically create a new key in the server jwks_uri each time an id_token is issued.

From memory, I recall that previously we’ve explicitly allowed RPs to rate-limit fetching of the jwks_uri; I forget the exact details but I believe there are clients that (even when presented with an unknown kid) won’t refetch jwks_uri if they’ve fetched it in the last (say) 60 seconds - and that position seems incompatible with the requested change.

Any guidance from the working group would be welcome.

Comments (11)

  1. Filip Skokan

    I believe the suite’s behaviour reflects what a regular client during normal operation would encounter. The suite, like a regular client, should be ready for a key rollover but definitely not as excessive as you described.

  2. Joseph Heenan reporter

    Thanks @panva.

    The suite, like a regular client, should be ready for a key rollover but definitely not as excessive as you described.

    Can you expand on this a little please? Mostly I’m interested in the whether “The suite assumes that any key the OP intends to use in the next few minutes is published in it’s jwks_uri at the start of the test” is an okay assumption. i.e. the suite is hardwired to assume that there is a delay between publishing a key and assuming that key will be accepted by the other party.

    I think that (at least for regular key rotations) OPs in practice tend to publish them well in advance - e.g. in the case of Okta they publish the new key weeks in advance of starting to use the key: https://developer.okta.com/docs/concepts/key-rotation/

  3. Filip Skokan

    at least for regular key rotations

    That is correct. But during regular client’s operation, the client has no need to re-fetch the jwks_uri until the time it cannot resolve a usable key. The client cannot distinguish between a planned key rotation and an “emergency” one. As such it should not depend on keys being always advertised upfront, otherwise it would have to result to scheduled re-fetching of jwks_uri which is nowhere prescribed.

    What I’m trying to say is that I sympathise with the certifying party on behalf of which this query is presented. I can see the reasoning behind questioning the suite’s behaviour. But if they rotate each time an id_token is issued, that is not interoperable and is not meeting the Core Basic certification requirements.

  4. Joseph Heenan reporter

    Thanks Filip. Your point about planned key rotation vs emergency is a good one.

    For clarity the conformance suite copes fine with a ‘planned’ key rotation (one where the key is published “sometime” in advance of use of the key starting); the suite doesn’t care if the key used to sign the id_token changes during the test so long as the key was present in the jwks_uri at the start of the test.

    An emergency key rotation during a test would probably cause the test to fail - I think you’d have to be fantastically unlucky for an emergency key rotation to occur during a certification test, but if it did happen the test would pass fine on a rerun.

  5. Joseph Heenan reporter

    We discussed on today’s connect WG call, with the main people participating in the conversation being myself, Yannik @ comuny.de (who originally raised the issue with the certification team) and John Bradley with a few comments from Tom Jones.

    John mentioned that the Cache-control: header is really part of this mechanism (and the certification suite doesn’t really consider that currently), though the spec only explicitly mentions cache-control in the context of rotating encryption keys (and not signing keys as we’re talking about in this issue), though arguably anyone that was worried about a denial of service attack would still enforce a minimum time between fetches in the event it had received a Cache-Control: no-store header on the jwks_uri response.

    The same text when interpreted for the OP fetching the RP’s keys (OP-Rotation-RP-Sig) explicitly allows an OP to rate limit fetches of the jwks_uri to 60 seconds. (John did question whether this test was wrong to allow this.)

    We didn’t really reach any conclusion, other than that there is a real potential for an interoperability problem that might meet the criteria for a spec errata. A further conversation (involving Nat / Mike / Brian / Filip) is likely needed and John suggested IIW might be an opportunity for that. (It’d probably be good to add some extra notes from the meeting discussion as the discussion went on for ~50 minutes and I definitely haven’t captured all of it in this comments, I’m not sure anyone was taking minutes live but the call was recorded.)

  6. Yannik Goldgräbe

    While the Cache-Control: no-cache HTTP header is one aspect of this discussion, I would like to strongly emphasize the following key rotation mechanism which lies at the core of our issue with the current implementation of the OP Certification Suite:

    10.1.1. Rotation of Asymmetric Signing Keys
    (…) The signer can begin using a new key at its discretion and signals the change to the verifier using the kid value. The verifier knows to go back to the jwks_uri location to re-retrieve the keys when it sees an unfamiliar kid value.

    Meaning, that during the verification of the id_token, if a previously unknown kid (unknown = it can not be found in the previously cached jwks, independently from the no-cache HTTP header) is present in the header of the token, the Certification Suite (in the role of the verifier/RP) should re-retrieve the keys from the jwks_uri. It seems like this mechanism is currently NOT respected by the Certification Suite and its implementation would already resolve this issue for us.

    As you previously stated, Joseph;

    (…) the suite doesn’t care if the key used to sign the id_token changes during the test so long as the key was present in the jwks_uri at the start of the test.

    The behaviour of the Certification Suite should not be based upon the assumption that all keys are present at the start of the test and thereby set a fixed persistence for X amount of seconds. Instead, it should react according to the above mentioned extract of the OpenID Connect Core specification.

  7. Filip Skokan

    @Yannik Goldgräbe the certification suite verifies that certified client software is ready for such scenario. However, the client software also needs to protect itself in case a bad actor started sending forged tokens with random kid values with the intention of consuming server resources, as such, most client software will refetch OP jwks_uri when an unknown kid is presented, but will then go on cool down to prevent such behaviour from being abused to e.g. exhaust client resources, blocking app server worker threads, causing cascading events leading to a DoS.

    An OP that expects clients to reload its jwks_uri that often is not interoperable with existing certified software.

  8. Yannik Goldgräbe

    Of course, the aspect of an exponential backoff on client-side was also touched upon by John Bradley during yesterdays call. I do not see an issue with that.

    One could even argue that a fixed test like this should NOT be required on RP side, since it is subjectively up to the RP if and how often he wants to request and operate upon resources received from the OP. In my opinion, the scenario of a malicious OP seems like it requires individual protection mechanisms, which should be (and, from my current understanding, already are) left open to the RP.
    If the RP receives too many invalid id_token, which are not verifiable due to the absence of the key associated to the kid, he can selectively choose to not execute any further requests against this OP. It lies in the responsibility of an honest OP to keep the contents of his jwks_uri up-to-date, while an RP can always choose to react accordingly, if he classifies an OP as malicious based on individual tolerance limits.

    Just to note; our interest by pursuing this issue lies in the compliance of the OP Certification Suite. We are currently not aiming for a RP certification. I would recommend to discuss the problem of existing software that has already been certified without this aspect in mind elsewhere.

  9. Filip Skokan

    I would recommend to discuss the problem of existing software that has already been certified without this aspect in mind elsewhere.

    I disagree, since it is very much related. It is the assumption and a goal of ours that a certified RP is able to interact, interoperate with a certified OP on a given profile level.

    An OP which, as was described above (but please correct it if untrue), rotates its keys upon every single authentication request is not interoperable due to the very reasons described.

  10. Yannik Goldgräbe

    In this case, I can only reference back to the specification that a rotation (at least from our interpretation) using kids is a recommended and feasible option. The issue lies within the current state of the OP Certification Suite not respecting this, consequently halting OPs, which are embracing this mechanism, from passing the certification. There are several design, security and practical reasons to use and rely on this mechanism, whose listing is beyond the scope of this discussion.

    Our OP does not rotate all of its keys upon every single authentication request. We’re using multiple signature keys that likely could rotate and are providing new kids for those who do. Some keys might still be present in the jwks from handling previous authorize requests, while others might not -- requiring the specified key agility. Thus, an RP can not assume a fixed key set and should reference back to the jwks_uri upon every verification, being primarily instructed by the unfamiliar kid and secondarily by the no-cache header of the jwks document. Additionally, even the JWS RFC already anticipated this functionality, stating: “This parameter allows originators to explicitly signal a change of key to recipients.”

    From my point of view, a change to the RP certification should not be required and also existing states would not be affected. An RP reacting (independently from any specification) to a malicious OP should be implied and basic behaviour of any web service. Think of other common cases related to tampering with the availability of the RP keys, such as the jwks_uri not being reachable at all...
    Since I’m not familiar with the RP Certification Suite, how far are you currently assessing the RPs behaviour regarding availability related cases?

    At this point, I feel like I can not make my case more clear and am hoping for your positive evaluation and overarching support, be it the team behind the OP Certification Suite, OpenID WG or the RP certification team. Interoperability and compliance with the established core specification should be in everyones interest.

  11. Log in to comment