Calculation of the start of the zipfile (location of shebang) is incorrect

Issue #5 resolved
Paul Moore created an issue

The code that calculates the position of the start of the zipfile in order to identify the shebang is incorrect. Specifically, it fails to work with zipfiles containing a shebang that have been created by the stdlib zipapp module.

The issue is that the "Offset of the central directory" field in the end of central directory record, is the offset relative to the start of the zipfile. It is legitimate for zipfiles to contain data before the actual zip data (such as the shebang, in the case of zipapp), and the offset takes this into account. The code ignores this possibility and assumes that the offset is relative to the first "local file header" (i.e., the end of the shebang). More accurately, the code assumes that the zipfile was created with no prepended data, and then data was prepended by simple copy, without updating the offset (as this is what tools normally do). The stdlib zipapp module does, however, set the offset taking the shebang into account.

This has not been an issue previously, because zipfiles can (and many do) also contain data prepended that is not reflected in the offset (for example, the launcher itself, typically). Zip utilities cope with this - Info-Zip reports it as "warning [xxx.zip]: 727 extra bytes at beginning or within zipfile", for example.

The correct means of locating the shebang needs to allow for both prepended data reflected in the offset, and prepended data not included in the offset. Doing this in full generality may require a scan for one of the signatures embedded in the zipfile data (I haven't looked at the source of any of the zip utilities to see how they handle this).

I think that the simplest solution here would be to use the existing code to locate the "presumed start" of the zipfile, and then try two approaches:

  1. Check if the bytes immediately following the "presumed start" are #! - if so, the central directory offset takes account of the prepended shebang, and so we have found it.
  2. Use the current approach, assuming that the shebang immediately precedes the "presumed start" - this handles the case where the shebang was prepended without modifying the offset.

In theory, it's possible that some other scenario could have happened, with some prepended data included in the offset and other data not. But it's unlikely that these situations will occur in practice.

I'll try to put together a PR to fix this issue, as it's directly affecting me (I can't use this launcher with zipapps).

Comments (8)

  1. Paul Moore reporter

    (Just as a note to justify my obsessing about this relatively obscure detail ;-))

    It's worth noting the -A option on the zip command (documentation here) which is an explicit way of adjusting the offsets to take account of prepended data. The fact that most people don't use anything like zip -A indicates (to me, at least) that it's a relatively obscure detail - but the fact that it exists implies that it's valid for the offsets to include the preamble.

  2. Paul Moore reporter

    I've attached a patch file for this. Sorry, but I can't work out how to create a PR in bitbucket :-(

  3. Paul Moore reporter

    I wondered if that was it, but I get "Access denied" :-(

    Also, I'm very rusty with Mercurial these days... But if you can get the access issue sorted, I'll see what I can do. Do I need to create my own fork of the repository, like I do in Github?

  4. Vinay Sajip repo owner

    I have merged your fix in c6b0f63 and the Downloads section has the updated launchers. Please try out and report back! I can then update distlib with these launchers.

  5. Log in to comment