Additional feedback from ZB

Issue #5 resolved
Michael Diamond
repo owner created an issue
  • Open a file descriptor to /dev/null and discard output to that, instead of separately writing to /dev/null repeatedly. See Wooledge and SO
  • Use placeholders in a "real" function, rather than eval-ing the whole thing, example
  • Write the exit code into the out/err file, since it's a fixed-width value. Alternatively, store it in a metadata field (mtime/atime/ctime for example, or an xattr)
  • Use printf to get the current time, rather than calling out to date (might not be available in old Bash)

Comments (6)

  1. Zac Bentley

    Some additional ideas:

    • The hashing function is a bit complex/slow. You don't need to call programs that do I/O or cryptographically secure operations to get a low-collision hash/cache location for a string. You could probably implement a simple hash algorithm in Bash and get some speedup (concurrent invocations would be tricky with a non-great distribution though).
    • Speaking of which, isn't there a somewhat large risk of using this concurrently and having collisions? I'm put in mind of "main": what if two scripts use this cacher and use it to cache a function called "main"? There might be cheap ways to coordinate (high effort though), but unless concurrent access is a requirement, I'd suggest salting the hash with something that uniquely identifies the invoking process: PID plus interpreter-start time should do it (not just PID, because wraparound, but now we're into tinfoil-hat territory unless you're running in a PID-constrained container).
    • Speaking of concurrency: if you have a program that uses this for a lot of invocations of the same function, those "find" calls are going to get really expensive as it traverses all the files in the directory. Is there any way to reduce the rate of those to once/whenever? Or coordinate with other procs that might be doing the same thing?
    • Cleanup on shutdown: how's it work? "It doesn't whatsoever; clean up /tmp yourself" is fine--if so, please document that. If not, at least a best-effort exit hook and/or opt-in cleanup function might be in order.
    • You might be able to omit the "newerthan" check entirely (and all the "which stat() am I using?" probes) if you call find() twice: once, synchronously, on just the files you know you care about for a given invocation. If that invocation deletes stuff, you know the cache was stale (find can, IIRC, be configured to return an exit code for "did you do anything?"; if not, just check lines of output), and once asynchronously to clean up random other stuff.
    • Can you background the writes into the cache files after you synchronously invoke the function, or does that compromise your guarantees?
    • You have several "|| true" statements. Are you promising that this'll work in -e mode? If so, update the docs. If not, lose the operators.
    • If your code uses bashisms and is only tested/intended for use on Bash interpreters, the file extension should be ".bash". I know I'm in the minority here. Flame on.
    • Dunder-naming in _read_input is probably unnecessary.
  2. Michael Diamond reporter
    • Yeah, I just need something that can map arbitrary strings reasonably well to valid file names. I'm not sure I want to go to the trouble of implementing my own hashing algorithm just yet, though.
    • Good point. The primary use-case is for shell prompts, where cross-process caching would generally be desirable, but there's definitely some concern for colliding namespaces that I haven't explored. At a minimum I need to ensure that cached data isn't shared (or visible) across users. Now issue #6.
    • You're referring to the async cleanup find calls? Yeah, I need to do something better there.
    • I don't feel too strongly about adding a clean-up-on-exit hook; since the cached data could be valid across shells it may be beneficial to persist. Also since the data's regularly cleaned up on the fly the leftover artifacts should be fairly minimal.
    • I'd have to benchmark what you're suggesting, but I'm trying to move away from find (per your earlier suggestion :P ). At a minimum, one big problem with find is it cds away and back under the covers, which can fail if the pwd no longer exists, and doesn't appear to be configurable. Swapping to the custom stat-ing avoids that problem, which was one of the most noticeable issues with the old code.
    • That's definitely possible. They'd need to happen synchronously relative to each other, but at least in some cases the caller doesn't need to wait for them to complete, you're right.
    • Well, I'm testing the code with bats, which runs in a set -e environment. I don't care to provide that guarantee as part of the public API, though. In general I dislike set -e.
    • Hah, good luck with that one :D
    • How so? If I do away with the __ and someone calls bc::_read_input contents the data will be written to the local contents. I could make the variables non-local, but then they're polluting the namespace. I'd rather __ everything than either of those options, but if you had something else in mind I'm curious.
  3. Zac Bentley

    Regarding cleanup-on-exit: that's fine. Maybe a sentence in the docs indicating that some stuff may be left and/or to call a random cached function before you turn out the lights to clean it all up?

    I agree re: set -e. Some people care about it, though, so if you promise compatibility, why not document that?

    Regarding the replacement of "newerthan" with a potential call to "find" on specific files rather than directories: that was a suggestion motivated by clarity/code reduction, not performance. It's probably very slightly slower, since "find" is likely a slightly slower-to-start program than the stat utilities. Splitting hairs tho.

    The dunder locals thing is pretty pointless, though. In this case you control all the names it uses (out, err, exit), so there's no need to "defeat" possible colliding callers. Since that function is namespaced to bash cache, it's not likely to be used by other random code, yes?

    Even in the general case, the "cross your fingers and guess nobody will pass something unlikely based on a python-convention-type-thing" is not a super good approach in Bash (because looser conventions and more prevalent use of global scope). If it were a big function, some silly metaprogramming/substitution/parsing "set" output could be used to avoid collisions, but since we're talking about 3-4 lines of code here, you can just do:

    read_input() {
      if [[ "${1:?Must provide a variable to read into}" == "contents" || "$1" == "line" ]]; then
        local _contents _line
        while read -r _line; do
          _contents="${_contents}${_line}"$'\n'
        done
        printf -v "$1" '%s%s' "$_contents" "$_line"
      else
        local contents line
        while read -r line; do
          contents="${contents}${line}"$'\n'
        done
        printf -v "$1" '%s%s' "$contents" "$line"
      fi
    }
    
  4. Michael Diamond reporter

    Fair enough regarding __, since you're right it's not intended to be used by anyone outside the library. That said, the reason I did it in the first place was it bit me :) I think it's safe to just fail if the varname is a local variable.

  5. Michael Diamond reporter
    • Cleanup semantics documented, 6702c8f
    • I'm going to leave the cache writing synchronous for now - the addition of the bc::warm::* functions makes this less critical and I'm not certain what the new semantics would look like

    Filed issues #8, #9, and #10 to track the remaining items I'd like to address. Calling this issue done.

  6. Log in to comment