Clone wiki

SCons / BigSignatureRefactoring

The Big Signature Refactoring

This was an architectural redesign that SK had been working on, off and on, for a long long long long LONG time. It was finally released to the world in the 0.97.0d20070918 checkpoint release.

The basic idea was to change the signature information that's stored in .sconsign files to:

  • allow switching back and forth between content signatures and timestamps in different runs (that used to cause unpredictable behavior)
  • provide flexibility to do things like use content signatures for all of a target's source files except for, say, the one huge file that takes too long to checksum, which can be configured independently to use timestamps
  • allow SCons to work on just a subset of the dependency graph and still build consistently (i.e., be less rigid about having to read the whole DAG, which used to be necessary to make sure the build signatures were consistent between runs)
  • provide for fast development builds of just one subdirectory in a big tree, possibly by supporting an SConstruct file in each directory (with SConstruct files including subsidiary SConstruct files, instead of one SConstruct file and the subdiaries all being SConscript), or possibly by allowing direct execution of SConscript files
  • support the ability to read implicit dependencies from .d files generated by (for example) the gcc -MD option
  • avoid rebuilds if explicit dependencies are re-ordered (although re-ordering source files or implicit dependencies does matter and will still cause a rebuild) One key effect of this change was that build signatures (re-MD5 summing the dependencies' signatures) went away entirely, mainly because build signatures make it impossible or difficult to do some of the things above. In particular:

  • a build signature can't be used if you want to mix-and-match content signatures and timestamps in the same target file decision

  • build signatures can't be used for just a subset of the dependency graph without doing things like just "believing" the build signature stored in a .sconsign file, which allows inconsistent builds in various corner cases It's unclear what the overall performance impact has been. Update: after optimizing this code at VMware, it looks like this is a win -- not order-of-magnitude faster, but maybe 10%-20%-30% faster. It's possible this is just optimized for VMware's configuration, but the code should be easier to optimize for other bottlenecks that pop up.

There was a branch in the Subversion repository, branches/sigrefactor, where SK had been working on this. It lay fallow for quite a while, but SK's contract employer during 2007 (VMware) revived the work for a number of reasons. The code was finished and promoted into branches/core and then released to the world as checkpoint release 0.97.0d20070918. The promotion contained updates to all the necessary tests to validate the new behavior, but did not update the documentation concerning signatures and up-to-date decisions.

<a name="comments"></a> Greg Noel, 2007-02-14:

There is a recurring scenario in determining the exact dependencies where an external change (e.g., a file is moved from one location to another, or the include path is changed) can cause a build to be out-of-date even though the signatures of all the upstream objects still match. SCons currently solves this problem by always rescanning the dependencies to confirm their location.

It occurs to me that this is really a signature problem and should be solved within the signature system. Since the Big Signature Refactoring will be a traumatic upgrade (viz., everything will be rebuilt), it would be a good time to incorporate this small change, since it will cause all signatures to be different.

In brief, the proposal is to include the <ins>name</ins> of the object in the signature. Since signatures of upstream objects aren't requested until after the upstream objects have been evaluated, their canonical name is known, so the signatures would be stable.

Here's an example: It doesn't matter if the file comes from /repository/foo.h, /source/foo.h, or /build/foo.h; the file is considered the same. The canonical name of the file is 'foo.h' and that is the name folded into the signature. However, if the file is moved to include/foo.h and the file is picked up at a different point in the include path, it is considered a different file. In that case, the canonical name is include/foo.h and that is the name folded into the signature, and a rebuild would be triggered of any dependent targets.

It would mean that caching implicit dependencies would be safe and there would be no need to rescan on a routine basis. The Taskmaster would only need to determine that the signatures of the source(s) and implicit dependencies are unchanged. If the BSF also optimizes signature calculation by treating files whose timestamps are unchanged as OK, then the ideal case would only cost a single stat() call to identify that a file was unchanged.

Steven Knight, 2007-08-17, comments by Greg Noel, further comment by Steven Knight, yet more comments by Greg Noel:

SK gave a semi-formal talk at VMware about the Big Signature Refactoring a few months ago. The PowerPoint slides are saved here. I'm not completely sure how comprehensible it is without me being able to explain verbally the differences in the .sconsign file formats (for example).

There were a number of good issues raised at the meeting. I've captured them here, along with my commentary in italics, mostly based on things I communicated back to the VMware attendees earlier but with some updates based on work since then:

  • Don't store <ins>the signatures of</ins> source files in the .sconsign file (e.g. *.[ch] files) There are two places in the code where source files had the NodeInfo class attached and updated. Commenting out the update and re-running the tests was successful, so it looks like this won't be hard to eliminate entirely. Preserving these in the .sconsign file may end up as a configurable option, as a way to cache the results of MD5 checksum calculations of large files.

JGN: Is there ever a reason to store the full source (as opposed to hash of the source) in the .sconsign file? Having to read the source from both the .sconsign file and the actual file can only increase the run time.

SK: Overly-terse note-taking on my part. I underlined the clarification above. For a program foo built from a single foo.c file which #includes a single foo.h file, the old .sconsign format was:

    foo: 8f72e133e001cb380a13bcb6a16fb16f None 1176861920 6762
            foo.o: e61afae6ccfe99a63b0b4c15f18422f6
    foo.o: e61afae6ccfe99a63b0b4c15f18422f6 None 1176861920 1488
            foo.c: b489a8c34c318fc60c8dac54fd58b791
            foo.h: c864c870c5c6f984fca5b0ebd7361a7d

But the new format is:

    foo: 5701724287c3d3847516781876f56d87 1176862330 6762
            foo.o: cc74a5b5cd4b174a59b58495cd2ef1f9 1176862330 1488
            c4245ece9e7108d276b3c8eb7662d921 [$LINK -o $TARGET ...]
    foo.c: b489a8c34c318fc60c8dac54fd58b791 1176861903 55
    foo.h: c864c870c5c6f984fca5b0ebd7361a7d 1176861911 19
    foo.o: cc74a5b5cd4b174a59b58495cd2ef1f9 1176862330 1488
            foo.c: b489a8c34c318fc60c8dac54fd58b791 1176861903 55
            foo.h: c864c870c5c6f984fca5b0ebd7361a7d 1176861911 19
            d055c09cba5c626f5e38f2f17c29c6fa [$CC -o $TARGET -c...]

Note the separate entries for foo.c and foo.h, even though they're not built targets. The observation was made that these aren't strictly necessary, since you only ever really compare against the information stored in the dependency list for foo.o (for example). Additional work since I wrote the above note revealed a reason why preserving these entries was necessary, though, although I can't remember why right now (nor find any comments as to why).

JGN: Ah, I see, the actual question is whether to store a signature for a DAG leaf; thanks for the clarification. I don't think so. Even in the case where the Decider doesn't use all the information, the extra information is saved in the dependency, so it's available somewhere. If you can remember why you thought so, we can reconsider, but for now, I see no reason to do it.

  • If foo.o is an intermediate (target) file, it gets stored in the .sconsign file. If you really change it to a source file, when and how can you really prune its information from the .sconsign file, as opposed to keeping it around forever because it might be a target file again in the future? Probably the simplest way would be to just document that these go away when you remove a .sconsign file by hand. If we wanted to have a way to remove them automatically, the entry in the .sconsign file will have a timestamp for these files. One possibility would be to remove these if they're ridiculously out-of-date, like a built file with a .sconsign timestamp more than a year old and treated as a source file in this build would cause the .sconsign info to be removed. The threshold would be configurable, of course.

JGN: I'm not sure I understand this scenario, but it's perfectly possible to be dealing with files more than a year old. If what you're recording is the last time you <ins>used</ins> this file, that would be fine, but if you're just using the file's timestamp, one run that didn't use the file would discard the information, then the next run that did would cause a rebuild, even if the file was still OK.

SK: In the example above, suppose foo.o is built in a separate directory or something, and you run the build a second time without reading the SConscript file that builds foo.o. You still link the foo executable, but you want to treat foo.o as a source file this time, even though it has dependency information from the previous full-DAG build still in the .sconsign file. So in the case building just a subset of a full-DAG build, you want to preserve this dependency information, even though you're not using it.

Now suppose you actually CHANGE the build configuration so that foo.o really IS a source file--you check it in to your source code system and completely eliminate foo.c from your build. If SCons normally preserves "vestigial" dependency information in a .sconsign file because it might be a subsetted build, how can it ever decide to remove the information when it's really not necessary?

JGN: Yes, that's how I understood it: it's how long you keep the dependency information for foo.o as well as the signature information (if any) for foo.c. My concern stands: if you only use the file's timestamp, you could discard information prematurely. You need to record the last time you <ins>used</ins> the dependencies of foo.o (checked to see if it needed to be rebuilt) and the signature of foo.c (used it as a source); if these dates are more than mumble days ago, toss them. Since you always rewrite the .sconsign file, this shouldn't add significant overhead. (A year seems too long; it's certainly a conservative choice.)

  • If we use .d files (and build signatures?) to store compiler-generated dependency info, a build in a freshly-checked-out tree will never pull a file from the CacheDir() the first time in a fresh tree, because it will be missing all of the .d file dependencies and won't be able to compute the correct signature. I don't have a good answer for this one. It may be that we just have to document that CacheDir() and .d files are not compatible. That's lousy, but I don't know how you could make this work.

  • If it's configured to look at the timestamp and then look at signatures if the timestamp is different, should it update the target (or source?) timestamp if the signatures are the same, to avoid the check in the future? The current scheme is to just record whatever's available as a by-product of the decision (or was already available from a preceding run). So if the Decider function for a given Node looks at both the timestamp and signature, both will get stored. If it looks at only one, then only one gets stored.

JGN: So if the Decider only looks at the timestamp in one run, the hash would be discarded? That's not right, unless there's a way for the Decider to say that it wants the hash to be kept.

SK: No, in general it does preserve information from preceding runs, so long as the file is up to date.

  • If we UnTar a file, "unpredictable" timestamp values would be put on all of the "target" files extracted from the tar file. Should we use a build signature to propagate rebuilds up the DAG? Not sure. This needs more thought/investigation.

  • Object files with timestamps in the header would cause rebuilds after a comment in a source file has changed to propagate "unnecessarily" up the DAG. This is simply something that would have to be configured by the user as appropriate for the toolchain. Tools that stamp the contents of a target file in this way would need to be configured to not use the file contents. I could also see adding a hook into the logic that fetches file contents that would allow the user to supply a function to ignore the timestamped header when return the contents for MD5-summing.

JGN: This hook would have to be dependent on both the source and target. Comments that would be discarded for the MD5 hash of a file to be passed to the C compiler are the real input to the Oxygen documentation generator, which would have a filter that discarded the C text.

SK: Actually, I'd say it's a function of the source and the Builder, not the target per se. Implicitly that means it's dependent on the <ins>type</ins> of target involved.

JGN: Er, yes, I think we're saying the same thing. I did assume the filter would be attached to (or by) the Builder.

  • SCons is missing dependencies on the actual tool binaries themselves. If you change compiler versions, it should know that everything needs to be recompiled. This is actually a general issue, not specific to the signature refactoring. Adding a dependency to the invoked binary itself isn't hard (the Perl-based Cons classic did it) and should have been done a while ago. It's not a complete solution, though, because a tool like gcc has a whole bunch of subsidiary executables and libraries it uses, so just depending on the gcc wrapper itself isn't sufficient in the general case.

JGN: I think it's more important to create dependencies to local executables to make sure they're built before use than it is to worry about whether the GCC wrapper represents the total executable. And I'd also make the dependency just a timestamp unless the object was also built by the SConstruct, in which case I'd use whatever the default was. I think this functionality needs to be present as a part of the 1.0 release.

SK: Agreed. Right now I'm thinking of rolling it right in after merging branches/sigrefactor, maybe even for the next checkpoint.

JGN: That would be great. It would be really good to get it in before the next (external) release; we should try to combine changes that affect the schema into a single release.

  • How do we handle dependencies on Alias Nodes? Should they be "expanded" into the list of underlying files when stored in the .sconsign file, or should they be stored them, or should their collected signatures be stored as-is? Storing "expanded" lists of underlying file signatures avoids unnecessary rebuilds if you change the Aliases in your SConscript configuration without touching the underlying files. But storing all those lists of files over and over might make the .sconsign files pretty big. Really, you probably want to allow the user to configure this. I seem to have a fix to the code that allows storing Alias info as-is, but I haven't run it yet against the VMware code base. There's also an independent patch (to allow Aliases to be used as Builder sources) that might handle the "expansion" of signature information to the underlying files, but I haven't tried mixing that in yet.

  • Value Nodes? How are they handled? Dependencies on Value Nodes look like they're handled correctly. I can express a dependency and the right signatures get calculated and rebuilds happen or not based on whether the change occurs. I need to make sure the sconsign script that dumps .sconsign file info prints these correctly.

  • Should there be a way to delete .sconsign values? Probably. I haven't thought through what this would look like. At first glance, it probably makes more sense to have it be an option to the sconsign script, not an option to scons itself.

JGN: Yes for an option to sconsign; it should be the number of days since the value (either dependency or signature) was used, as per the comment above.

  • There should be a way to populate .sconsign values from an up-to-date, fully-built tree. Yes. Actually, populating the values would already happen automatically -- scons always dumps its current build info in there. What this really means is that we want it to not rebuild the targets it already finds on disk, which shouldn't be too hard to code up.

  • Should we really use CPPPATH directories as direct inputs to a build? That is, shouldn't we rebuild if a new .h appears in any CPPPATH dir, because the dir contents changed...? I think this is a really great idea. It would make the CPPPATH directories first-order inputs to the builds they affect, with their "contents" (the list of entries in the directory) being what we track for changes. After all, the preprocessor does open up and search the directory for the files it includes...

JGN: Rescanning every time a directory changes is overkill. The signature for a file found by searching a directory list should include (a) the directories scanned where the file was <ins>not</ins> found, (b) the directory where the file was found, (c) the name of the file, and (d) the calculated values (timestamp, hash, whatever). This is generic; a file not located by a search simply has an empty list for (a). More than this is not needed: as long as the search for each source finds the same file again, it doesn't matter if the directories have changed in an unrelated fashion. (This is the essence of my block of comments above.)

SK: This really merits a whole separate discussion. The basic idea is to <ins>avoid</ins> unnecessary rescans by recording a dependency on each CPPPATH directory, where the "content signature" would be the list of files in that directory. Then, if a directory in your CPPPATH hasn't changed--because it has the same list of named files--you know you don't need to rescan for the pathological corner case of a same-named file in multiple directories. But again, we need a more thorough, separate discussion of this...

JGN: I can believe that you'd want to gobble down the contents of a directory the first time you encountered it in a search as an optimization to make the 'is this file in that directory' test fast, but that's not the same thing as attaching the directory contents as a dependency. All that the signature needs to record whether or not the file was found in a directory; the fact that the directory contents may have changed in some unrelated way is not material to whether a rebuild should occur. In fact, think of the case where a directory's contents are changed by subsequent actions within the same run (another header copied into a local include directory, for example). This shouldn't trigger a rebuild in the next run (unless, of course, the new header had the searched-for name, in which case, the SConscript probably doesn't have the right dependencies specified, but that's not SCons' problem).

Yes, it probably merits a separate discussion. Just untangling the requirements is tricky, and describing all the corner cases isn't easy. I don't think it's sufficient to have a conservative implementation that will rebuild too often; what's needed is a design that will rebuild exactly often enough.