Inside zip containers, skipped names / skipped paths don't seem to work

manuelbuser avatarmanuelbuser created an issue

Hello

First of all, recoll is fabulous. I am using large zip containers that hold archived data in a tree structure. It seems that all files inside the zip are indexed, even if there is a "skipped name" or "skipped path" rule that would match a name / path inside the zip. Am I doing something wong? It would be great to have skipping rules work also inside containers.

Recoll 1.19.3 + xapian 1.2.8

Comments (4)

  1. medoc

    Hi,

    You are right that skipped paths and names are not respected inside zip files (or other archives), they only work for real file system files.

    I agree that it would be nice to have the possibility to use the file selection configuration inside archives. This is not a simple issue, because the code which walks the file system and the one which walks zip archives are totally separate (not even the same language...), and filters currently can't access the configuration.

    Identifiers inside compound documents are not necessarily file-like paths (e.g.: email folder files have message numbers), so there was no real reason initially to extend the path selection mechanism to filters.

    I am putting this on the todo, but it will take some time.

    Meanwhile, if you can write a little Python, it would probably be quite simple to modify the zip filter for skipping some paths or names (which you could read from some kind of configuration file, or just hard-code inside the modified filter).

    You can then tell recoll to use your own filter by having the following inside ~/.recoll/mimeconf:

    [index]
    application/zip = execm /path/to/my/rclzip;charset=default
    
  2. medoc

    Fixed by the new rclconfig.py module and a modification of the rclzip code. To use before the next release:

    • Fetch python/recoll/recoll/rclconfig.py and filters/rclzip from the source tree
    • Copy both to /usr/share/recoll/filters, make rclzip executable

    Set a variable named zipSkippedNames inside recoll.conf:

    • This is a space-separated list of patterns which will be passed to python fnmatch, the / characters are not special (matched as any character).
    • You can't use embedded spaces in patterns (no double-quote quoting for now)
    • This can be redefined for file system directories using the usual section indicators

    Example:

    zipSkippedNames = *.txt
    [/path/to/the/dir]
    zipSkippedNames = somedir/*/*.html
    
  3. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.