certification team query: allowing key rotation during tests
The certification team have recently received a request from an OP that is failing I believe pretty much every OpenID Connect Core Basic profile test.
The request is that all tests support key rotation whilst the test is running, and potentially multiple new keys during a test. (Currently the tests assume that all the keys needed for the test are present in the server jwks_uri at the start of the test.)
The reason the OP is requesting this is, as I understand it, that they dynamically create a new key in the server jwks_uri each time an id_token is issued.
From memory, I recall that previously we’ve explicitly allowed RPs to rate-limit fetching of the jwks_uri; I forget the exact details but I believe there are clients that (even when presented with an unknown kid) won’t refetch jwks_uri if they’ve fetched it in the last (say) 60 seconds - and that position seems incompatible with the requested change.
Any guidance from the working group would be welcome.
Comments (11)
-
-
reporter I remembered one of the locations of previous discussions: https://bitbucket.org/openid/connect/issues/1161/key-rotation-should-require-a-delay
-
reporter Thanks @panva.
The suite, like a regular client, should be ready for a key rollover but definitely not as excessive as you described.
Can you expand on this a little please? Mostly I’m interested in the whether “The suite assumes that any key the OP intends to use in the next few minutes is published in it’s jwks_uri at the start of the test” is an okay assumption. i.e. the suite is hardwired to assume that there is a delay between publishing a key and assuming that key will be accepted by the other party.
I think that (at least for regular key rotations) OPs in practice tend to publish them well in advance - e.g. in the case of Okta they publish the new key weeks in advance of starting to use the key: https://developer.okta.com/docs/concepts/key-rotation/
-
at least for regular key rotations
That is correct. But during regular client’s operation, the client has no need to re-fetch the jwks_uri until the time it cannot resolve a usable key. The client cannot distinguish between a planned key rotation and an “emergency” one. As such it should not depend on keys being always advertised upfront, otherwise it would have to result to scheduled re-fetching of jwks_uri which is nowhere prescribed.
What I’m trying to say is that I sympathise with the certifying party on behalf of which this query is presented. I can see the reasoning behind questioning the suite’s behaviour. But if they rotate each time an id_token is issued, that is not interoperable and is not meeting the Core Basic certification requirements.
-
reporter Thanks Filip. Your point about planned key rotation vs emergency is a good one.
For clarity the conformance suite copes fine with a ‘planned’ key rotation (one where the key is published “sometime” in advance of use of the key starting); the suite doesn’t care if the key used to sign the id_token changes during the test so long as the key was present in the jwks_uri at the start of the test.
An emergency key rotation during a test would probably cause the test to fail - I think you’d have to be fantastically unlucky for an emergency key rotation to occur during a certification test, but if it did happen the test would pass fine on a rerun.
-
reporter We discussed on today’s connect WG call, with the main people participating in the conversation being myself, Yannik @ comuny.de (who originally raised the issue with the certification team) and John Bradley with a few comments from Tom Jones.
John mentioned that the Cache-control: header is really part of this mechanism (and the certification suite doesn’t really consider that currently), though the spec only explicitly mentions cache-control in the context of rotating encryption keys (and not signing keys as we’re talking about in this issue), though arguably anyone that was worried about a denial of service attack would still enforce a minimum time between fetches in the event it had received a
Cache-Control: no-store
header on the jwks_uri response.The same text when interpreted for the OP fetching the RP’s keys (OP-Rotation-RP-Sig) explicitly allows an OP to rate limit fetches of the jwks_uri to 60 seconds. (John did question whether this test was wrong to allow this.)
We didn’t really reach any conclusion, other than that there is a real potential for an interoperability problem that might meet the criteria for a spec errata. A further conversation (involving Nat / Mike / Brian / Filip) is likely needed and John suggested IIW might be an opportunity for that. (It’d probably be good to add some extra notes from the meeting discussion as the discussion went on for ~50 minutes and I definitely haven’t captured all of it in this comments, I’m not sure anyone was taking minutes live but the call was recorded.)
-
While the
Cache-Control: no-cache
HTTP header is one aspect of this discussion, I would like to strongly emphasize the following key rotation mechanism which lies at the core of our issue with the current implementation of the OP Certification Suite:10.1.1. Rotation of Asymmetric Signing Keys
(…) The signer can begin using a new key at its discretion and signals the change to the verifier using thekid
value. The verifier knows to go back to thejwks_uri
location to re-retrieve the keys when it sees an unfamiliarkid
value.Meaning, that during the verification of the
id_token
, if a previously unknownkid
(unknown = it can not be found in the previously cachedjwks
, independently from theno-cache
HTTP header) is present in the header of the token, the Certification Suite (in the role of the verifier/RP) should re-retrieve the keys from thejwks_uri
. It seems like this mechanism is currently NOT respected by the Certification Suite and its implementation would already resolve this issue for us.As you previously stated, Joseph;
(…) the suite doesn’t care if the key used to sign the id_token changes during the test so long as the key was present in the jwks_uri at the start of the test.
The behaviour of the Certification Suite should not be based upon the assumption that all keys are present at the start of the test and thereby set a fixed persistence for X amount of seconds. Instead, it should react according to the above mentioned extract of the OpenID Connect Core specification.
-
@Yannik Goldgräbe the certification suite verifies that certified client software is ready for such scenario. However, the client software also needs to protect itself in case a bad actor started sending forged tokens with random
kid
values with the intention of consuming server resources, as such, most client software will refetch OP jwks_uri when an unknownkid
is presented, but will then go on cool down to prevent such behaviour from being abused to e.g. exhaust client resources, blocking app server worker threads, causing cascading events leading to a DoS.An OP that expects clients to reload its jwks_uri that often is not interoperable with existing certified software.
-
Of course, the aspect of an exponential backoff on client-side was also touched upon by John Bradley during yesterdays call. I do not see an issue with that.
One could even argue that a fixed test like this should NOT be required on RP side, since it is subjectively up to the RP if and how often he wants to request and operate upon resources received from the OP. In my opinion, the scenario of a malicious OP seems like it requires individual protection mechanisms, which should be (and, from my current understanding, already are) left open to the RP.
If the RP receives too many invalidid_token
, which are not verifiable due to the absence of the key associated to thekid
, he can selectively choose to not execute any further requests against this OP. It lies in the responsibility of an honest OP to keep the contents of hisjwks_uri
up-to-date, while an RP can always choose to react accordingly, if he classifies an OP as malicious based on individual tolerance limits.Just to note; our interest by pursuing this issue lies in the compliance of the OP Certification Suite. We are currently not aiming for a RP certification. I would recommend to discuss the problem of existing software that has already been certified without this aspect in mind elsewhere.
-
I would recommend to discuss the problem of existing software that has already been certified without this aspect in mind elsewhere.
I disagree, since it is very much related. It is the assumption and a goal of ours that a certified RP is able to interact, interoperate with a certified OP on a given profile level.
An OP which, as was described above (but please correct it if untrue), rotates its keys upon every single authentication request is not interoperable due to the very reasons described.
-
In this case, I can only reference back to the specification that a rotation (at least from our interpretation) using
kid
s is a recommended and feasible option. The issue lies within the current state of the OP Certification Suite not respecting this, consequently halting OPs, which are embracing this mechanism, from passing the certification. There are several design, security and practical reasons to use and rely on this mechanism, whose listing is beyond the scope of this discussion.Our OP does not rotate all of its keys upon every single authentication request. We’re using multiple signature keys that likely could rotate and are providing new
kid
s for those who do. Some keys might still be present in thejwks
from handling previous authorize requests, while others might not -- requiring the specified key agility. Thus, an RP can not assume a fixed key set and should reference back to thejwks_uri
upon every verification, being primarily instructed by the unfamiliarkid
and secondarily by theno-cache
header of thejwks
document. Additionally, even the JWS RFC already anticipated this functionality, stating: “This parameter allows originators to explicitly signal a change of key to recipients.”From my point of view, a change to the RP certification should not be required and also existing states would not be affected. An RP reacting (independently from any specification) to a malicious OP should be implied and basic behaviour of any web service. Think of other common cases related to tampering with the availability of the RP keys, such as the
jwks_uri
not being reachable at all...
Since I’m not familiar with the RP Certification Suite, how far are you currently assessing the RPs behaviour regarding availability related cases?At this point, I feel like I can not make my case more clear and am hoping for your positive evaluation and overarching support, be it the team behind the OP Certification Suite, OpenID WG or the RP certification team. Interoperability and compliance with the established core specification should be in everyones interest.
- Log in to comment
I believe the suite’s behaviour reflects what a regular client during normal operation would encounter. The suite, like a regular client, should be ready for a key rollover but definitely not as excessive as you described.