Simple directory/file filtering for 'tmsu files'

Issue #65 resolved
created an issue

I recently came across a very simple, straightforward query case that I felt should be expressible by a normal 'tmsu files' query.

Namely, to retrieve a list of all files in the current directory that are NOT tagged uploaded.

  • tmsu files -f not uploaded doesn't work (not restricted to current dir, doesn't catch files that have no tags at all)
  • tmsu files -f not uploaded |grep -v / doesn't work (doesn't catch files that have no tags at all)
  • tmsu tags * | grep -vw 'uploaded' | sed 's/: $//g' works. Obviously it's a tad convoluted.

IMO, the appropriate syntax to express this is something like

tmsu files ./ not uploaded

The relevant argument being a directory name or glob pattern (eg './' to specify all files in the current directory, recursively; './*' to specify all files in the current directory only -- no recursion; './*.jpg' to specify all jpg files, etc.)

This is based on the 'no slashes in tags' rule: anything containing a slash can't be a tag, so it must be a directory/glob pattern.

The only corner case I notice here is with leading 'or'/'and', that is tmsu files ./ or uploaded would then be a sensible expression.

There's some argument that if this is added, then placing directory/glob-patterns anywhere in the argument list makes sense.

Meaning tmsu files uploaded and ./ or tmsu files 'uploaded not ( ./ or ../foo )' would make sense, etc. I'm not sure what to think of that prospect. It doesn't seem like a technical problem, but I don't know whether it's a good idea.

(Yes, I'm still using tmsu. Am impressed by the speed improvements lately.)

Comments (18)

  1. Paul Ruane repo owner

    In the email I sent you I said:

    Thanks for pointing out the bug with file queries combined with the --files flag: I'll look into this as soon as I can and push out 0.3.1 to address it.

    I've had a chance to look into this now and it's working as I expected it to. After rereading your email I see that I had misread. I see that it's not working to your expectations rather than not working as designed.

    tmsu files -f not uploaded doesn't work (not restricted to current dir, doesn't catch files that have no tags at all)

    So there are basically two points here:

    1. You want to be able to restrict a query to the current directory.
    2. You want to be able to include untagged files.

    The former is a new requirement I had not anticipated: the ability to return only files for a given path or pattern. As I said in my email, I think this is a good idea so I"ll have a think about how best to implement this.

    The latter is somewhat at odds with how the tool currently works. The 'files' command is a pure database query so for it to include untagged files from the current directory would be a bit unexpected and somewhat difficult to implement. I'll have a think about what you have said but I'm not sure I'd like 'tmsu files not sometag' to automatically pick up files from the current directory: it's true they're not tagged 'sometag' but the fact is they're not tagged at all and so unmanaged.

    Other options you may not have explored are:

    1. Using the 'status' command to identify the untagged files. I do this myself. Maybe there should be some an option on the status command to list only untagged files. Anyhow, this is how I currently do it:

      $ tmsu status . | grep "^U"

    2. Recursively tag the current directory with a dummy tag. This will add the files to the database so that subsequent queries will list them:

      $ tmsu tag --recursive . dummy
      $ tmsu files not uploaded

    3. This one doesn't work but I thought I'd list it anyway as it really should! You tag the current directory with a dummy tag (non-recursive) and then recursively query the files:

      $ tmsu tag . dummy

      $ tmsu files --recursive not uploaded

  2. kitlau reporter

    I'm not sure I'd like 'tmsu files not sometag' to automatically pick up files from the current directory

    Oops, just to clarify, neither would I; it should only pick up files from the current dir if explicitly told to do so.

    I want tmsu files not sometag to pick up files that are tagged but aren't tagged sometag, whereas tmsu files ./ not sometag should pick up files that are in the set derived from recursive descent of the current directory, and are not tagged sometag.

    In case further clarification is needed, I'm thinking of the filelist as a temporary table. In other words, tmsu files ./ not sometag should derive the set of files for ./ , and then remove any files that are tagged 'sometag' from that set.

  3. Paul Ruane repo owner

    Changes in af35f6a.

    This partially implements what you suggested and not quite how you suggested it either, so sorry for that to begin with.

    Firstly I decided in the end to add an option rather than automatically detect whether the first argument is a tag or path. The reason for this is that I felt the behavior wasn't very intuitive when using relative paths:

    / $ tmsu files etc cheese

    Is 'etc' a tag or a relative directory here? OK, so we can remove the ambiguity by prefixing it with ./ but I imagine this would trip users up every time they tried to do this and the error, that the tag does not exist, wouldn't be very helpful either. I reserve the right to change my mind on this decision though :)

    The second shortcoming is that the command will not list untagged files from the specified path. I'm not comfortable with automatically mixing in untagged files on what is currently a pure database query. I will still add this in some form (I'm still thinking about it) but maybe with an additional option. It may also be quite complicated— certainly more complicated than detecting a 'not' query as a query like 'cheese or (not mushroom)' would yield untagged files, so a depth search for 'not' expressions on each branch of 'or' expressions would have to be detected.

    Anyhow, I hope what I've checked in is at least somewhat useful. I'll keep this issue open and continue work on it.

  4. Paul Ruane repo owner

    I've added --untagged to the files command which, when combined with --path, will include untagged files.

    It started off very complicated until it dawned on me that I could add the files to the database and simply not commit the transaction. This way the database does the work (as normal) and the complexity added to TMSU is minimal.

    Changes in 944621c.

  5. kitlau reporter

    Messing around with TMSU has been an education in practical SQL.

    Sorry I haven't followed up on any of your comments, real life happened.

    Trying this out now:

    tmsu files -u -p ./ not uploaded

    It works as expected, so yes, this is definitely useful now.

    Since you are talking about temporarily adding the files to the DB, I am guessing that most of the time (about 3min for my test case) is spent on fingerprinting (34 files / 63mb, none of which are tagged). But benchmarking with time sha1sum * and time sha256sum * throws this into doubt -- (warm start) sha256sum takes ~500ms and sha1sum takes ~250ms, so maybe I'm just running into a general problem of database scale.

    From a warm start, tmsu files -u -p ./ not uploaded returns in a mere 1.7 seconds, so I'll tentatively say the initial slowness is mainly due to being IO-bound on the file_tag table lookup.

  6. Paul Ruane repo owner

    There is no fingerprinting with the temporarily added files: as they fingerprints are only used for detecting moved files by 'tmsu repair' and duplicate files with 'tmsu dupes' then they are of no use during the call to 'tmsu files'.

    You said it still took three minutes for just 34 files? That does seem slow. You then said it took 1.7 seconds on a subsequent call? That's a massive difference. What exactly did you mean by 'warm start' in this context?

  7. kitlau reporter

    Good to know.

    By warm start I am using it in the typical benchmarking sense, so if I run tmsu files -u -p ./ not uploaded for the first time, that is definitely a cold start, and if I follow up by repeating that command, the repetition has a warm start (==the best precaching of any on-disk data it needs that it's likely to ever get; all or most of the relevant data is likely to already be in the disk cache).

  8. Paul Ruane repo owner

    Oh the disk cache, of course. I'll look into the performance. Three minutes seems a long time for what 'find' could probably do in under 1 second. I assume it's the adding to Sqlite rather than filesystem enumeration but you never know.

  9. Paul Ruane repo owner

    FYI, I'm planning on removing the --untagged facility from the files command and put the functionality into a separate 'untagged' subcommand. The reason for this is discussed in issue #79 but the crux is that the following behaviours are inconsistent and confusing:

    1. 'tmsu files --path some/path --untagged sheep' adds untagged files to the mix rather than restricts to untagged files.
    2. 'tmsu files --path some/path --untagged' with no query does not list untagged files, as one would expect.
    3. 'tmsu status --untagged' lists just the untagged files (restricts) but each entry is prefixed with 'U' (for consistency with other 'status' output).

    I think the best approach is to remove the --untagged option from 'files' and 'status' so that 'files' operates purely on the database nad 'status' compares the filesystem to the database. A new 'untagged' subcommand can then be used to list untagged files as needed.

    The result of this would be that a negative query that includes untagged files would change from:

    $ tmsu files --path some/path --untagged not hairy


    $ tmsu files --path some/path not hairy && tmsu untagged some/path

    A status query for untagged files would change from:

    $ tmsu status --untagged some/path


    $ tmsu untagged some/path

    I'll have a look at the implications this has for performance but it may actually result in slightly faster behaviour, despite the additional process call due to it no longer being necessary for TMSU to add the files temporarily to the database.

    Let me know if you have any ideas or concerns.

  10. kitlau reporter

    A subcommand makes a lot of sense in view of the inconsistencies you bring up. I have a number of scripts that I'll have to adjust, of course, but mostly this should make them faster and less complicated.

    One question I have is where files that are fingerprinted but don't currently have any tags will fall here. Will they be listed by tmsu files or tmsu untagged , and is there a reasonably easy to remember way to explain why?

    EDIT: Completely OT, but I think you generated a awkward/ambiguous sentence in the 0.5.0 release notes, here's my suggestion for a revised version: * Broken symbolic links can now be tagged, with a warning printed to stderr.

    I guess once you implement the 'untagged' subcommand, I can modify the summary to 'Simple directory/file filtering when querying' and close this?

  11. Paul Ruane repo owner

    where files that are fingerprinted but don't currently have any tags will fall here. Will they be listed by tmsu files or tmsu untagged , and is there a reasonably easy to remember way to explain why?

    TMSU, since quite an early version, automatically removes untagged files from the database. This means you should never find untagged files in the database unless the database has been manually edited or there is a bug in the program code.

    So the answer is they should be picked up by 'untagged' and not 'files'.

    And yes, once I've checked in 'untagged' (which I already have implemented) you can give it a whirl and close this if you're happy.

    Thanks Paul

  12. kitlau reporter

    tmsu untagged is so fast and simple I can hardly complain. One thing that wasn't obvious to me initially was its directory descent policy. I eventually figured out it is equivalent to find -L (ie, dereferences symbolic linked directories and descends into them)

    Not sure what I was thinking about changing the summary, the summary still makes sense IMO since this issue covers both tmsu untagged and tmsu files --path now. So, just closing.

  13. Log in to comment