Issue #111 wontfix

python api - no way to retrieve a particular document?

Anonymous created an issue

I've made a simple webui based on BottlePy and I can retrieve a list of search results just fine (well almost, see #110) but then I would like to be able to href the results to point to full text of the particular result. query.execute(url) returns no results and ipath is null in the docs I retrieve via text search. Is there any way of retrieving a particular document via python api?

Comments (9)

  1. medoc repo owner

    There are 2 possible issues here, depending on what you are really doing and the recoll version:

    The old way to retrieve an URL was to use:

      doc = query.fetchone()
      url = getattr(doc, "url").encode('utf-8')
    

    As far as I know, it mostly works (always with ascii), but it would sometimes fail when the original file path was not encoded according to the locale.

    So in 1.18, a method was added: doc.getbinurl(), which returns the url as a binary string, meaning the part after file:/// is supposed to be bit for bit what came out of the readdir(), and can be used as parameter to any system call.

    Maybe I'd need to know what version you are using and to see your code to be sure of what goes wrong here.

  2. koniu

    Here's a sample code illustrating what I'm trying to do:

    import recoll
    
    # init
    db = recoll.connect()
    query = db.query()
    
    # search
    nres = query.execute('attack')
    print nres
    
    # get one document
    doc = query.fetchone()
    url = getattr(doc, 'url')
    print url
    
    # try get that document
    nres = query.execute(url)
    print nres
    

    Here's the output:

    30
    file:///mnt/data/archive/foo/bar.doc
    0
    

    Makes no difference whether there's encode('utf-8'). I might add that I also tried executesd() with sd.addclause(type='and', field='url', qstring=url) but to no avail. My suspicion (I didn't read into the sourcecode) is that you simply can't search by url? Or maybe there's another way?

    I suppose the best solution for what I'm trying to do would be if the Doc class included docid from xapiandb and Query had an extra method executeid to retrieve a doc with that id only.

  3. koniu

    For the record, I tried with recoll 1.18 and still getting no results. I can't see how I could use the url from getbinurl():

    $ python test.py 
    147
    file:///home/koniu/data/archive/foo/bar.doc
    Traceback (most recent call last):
      File "test.py", line 20, in <module>
        nres = query.execute(url)
    TypeError: decoding bytearray is not supported
    
  4. medoc repo owner

    No, you can't search by URL, when you have the URL, there is nothing left to search for, and you are right that query.execute(url) will do nothing useful.

    I think that I don't quite understand what you want to do here ? Once you have a query result with a URL, you are done with recoll. If you want to fetch the document, you use the URL with whatever access method is suitable for you. The things I mentionned about the nuances of URL encoding just deal with the suitability of using it as a parameter to some other access method, like a system call.

    So I'm a bit lost actually. Is in fact your question about getting a document preview like you can do with the GUI ? What do you mean by "retrieve a doc" ?

  5. koniu

    Sorry for not making myself clear - it is indeed about getting a plain-text preview from the index rather than retrieve the file from its location via url. On that note, I further noticed that Doc.text is only available when indexing. Is there (or could there be) a way of getting a preview via Python?

  6. medoc repo owner

    When querying, the only things which are available in the doc objects are:

    • Those that were stored in the Xapian data record at indexing time (ie author, size, or other fields declared as stored in the "fields" file).
    • Those which we can rebuild from the indexed terms: snippets

    The document text is not stored in the index, so it is not retrieved at query time.

    The "preview" function in the GUI actually re-extracts the document text mostly in the same way it what was done during indexing.

    While there would probably be no great difficulty to define a Python interface to the text extractor, the problem I fear is that we are pulling in the whole filter framework, with fork-execs, asynchronous pipe communication etc. and I'm not too sure how all this is going to behave when the Python module is used as a plugin, for example inside the Unity desktop.

    Another, maybe simpler, possibility would be to define a command line interface to the preview text extraction, which you could just execute with appropriate parameters to retrieve the converted document.

    Actually the recoll command currently has an option to do something very similar to extract a document from a compound one, for opening from the Unity Lens. You could run the rclxx filter on this and this would give you what you want. You can have a look at the Lens to see how the recoll command is invoked.

    For the filter part, you'd be limited to the simple ones (the ones which output the text and exit), but this would cover a good part of what you need, probably.

    Some things are probably still unclear, don't hesitate to ask more questions, this is all a bit experimental.

  7. medoc repo owner

    Closing this as a real solution would be too complicated and would probably be a source of problems, and a relatively reasonable workaround exists. I think that an extension of the Python API to help implementing the workaround (e.g. giving access to the filters config) would be actually a more reasonable approach.

  8. Log in to comment