Discovery 2.2.3 - "example.com:8080" has a scheme

Issue #625 resolved
lepidum created an issue

Regarding to RFC3986, "example.com:8080" will be parsed as following. scheme = example.com path = 8080

I think in this case "https" should not be prepended, and rejected due to unknown scheme. (So, someone in the future can extend this spec to URI like "acmetelecom.net:123...".)

Comments (8)

  1. lepidum reporter

    I cannot figure out why prefixing "https://" is necessary. I guess you mean the following process, but this might lose extensibility to various schemes.

    if (startsWithAuthority(input)) return parseUrl("https://" + input) else return parseUrl(input)

  2. Nat Sakimura
    • edited description
    • changed component to Discovery

    According to RFC 3986, URI is defined as follows.

          URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
    
          hier-part   = "//" authority path-abempty
                      / path-absolute
                      / path-rootless
                      / path-empty
    

    Therefore, if example.com:8080 were parsed as a URI, then example.com will not be a scheme but either an authority section or a path.

    Since by 2.1.1, the idnetifier in this case is treated as a URL, the first segment is actually treated as the authority section.

    Having said that, the normalization rule here needs to be tightened probably.

    Note the following defs from the RFC.

    authority   = [ userinfo "@" ] host [ ":" port ]
    userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )
    host        = IP-literal / IPv4address / reg-name
    IP-literal = "[" ( IPv6address / IPvFuture  ) "]"
    IPvFuture  = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
    

    Clearly, userinfo includes ":". IP-literal also includes ":". Thus, naively treating the segment after the first ":" as a port would result in errors. We need to further constrain the authoirty segment to be reg-name.

    BTW, looking at above, perhaps we do not need to treat email style identifier specially. userinfo@host is still a valid authority section in URI. On the other hand, RFC 5322 addr-spec http://tools.ietf.org/html/rfc5322#section-3.4.1 is not suitable for our "email looking" identifier as it may include CRLF etc.

    So, I think we should go only with the authority section defined in RFC 3986 and allow only reg-name in host.

  3. lepidum reporter

    Therefore, if example.com:8080 were parsed as a URI, then example.com will not be a scheme but either an authority section or a path.

    If so, what is the scheme of "example.com:8080"? Consulting RFC 3986, URI contains a scheme and a hier-part. hier-part can be path-rootless, and path-rootless contains one or more segments splitted by "/". Thus, parsing "example.com:8080" would result to a URI with scheme "example.com" and path-rootless "8080" and actually URI parsers in serveral programming languages such as Java, C# and Ruby produce such results.

  4. Nat Sakimura

    Oh, are you just talking about the example in 2.2.3, and not the normalization rule stated in 2.1.3?

    Then you are right. The first segment of relative reference cannot contain a ":". So, if example.com:8080/joe cannot be a relative reference, and the example is wrong. Is that what you are getting at instead of the normalization rule to put https: to the authority section?

    At the same time, I do understand people want to support a user input string such as example.com:8080/joe. This is not URI, and this is not even a relative reference. It is something else if we were to normalize it to https://example.com:8080/joe. So, clause 2.1 needs to be fixed. It is closely related to issue #621.

  5. Nat Sakimura
    • changed status to open

    Just saying user input is a relative reference did not solve it. Relative reference to contain the authority section, it needed to be prefixed by "//".

  6. Log in to comment