openbiblio / README_SOLR.txt

The default branch has multiple heads

FTS for openbiblio, using Apache Solr
=====================================

Overview:

This provides a simple search interface for openbiblio, using a network-addressable Apache Solr instance to provide FTS over the content. 

The indexer currently relies on the Entry Model (in /model/entry.py) to provide an acceptable dictionary of terms to be fed to a solr instance.

Configuration:

In the paster main .ini, you need to set the param 'solr.server' to point to the solr instance. For example, 'http://localhost:8983/solr' or 'http://solr.okfn.org/solr/bibliographica.org'. If the instance requires authentication, set the 'solr.http_user' and 'solr.http_pass' parameters too. (Solr is often put behind a password-protected proxy, due to its lack of native authentication for updating the index.)

Basic usage:

The search controller: solr_search.py   (linked in config/routing.py to /search)

    Provides HTML and JSON responses (conneg) and interprets a limited but easily expandable subset of Solr params (see ALLOWED_TERMS in the controller.)

    JSON response is the raw solr response as this is quite usable in javascript. 

    HTML response is styled in the same manner as the previous (xapian-based?) search service, with the key template function formatting each row in templates/paginated_common.html  - genshi function "solr_search_row". Unless specified, the search controller will get all the fields it can for the search terms, meaning that the list of resuts in c.solr.results contain dicts with much more information than is currently exposed. The potentially available fields are as follows:

    "uri"          # URI for the item - eg http://bibligraphica.org/entry/BB1000
    "title"        # Title of the item
    "type"         # URI type(s) of the item (eg http://.... bibo#Document)
    "description"  
    "issued"       # Corresponds to the date issued, if given.
    "extent"
    "language"     # ISO formatted, 3 lettered - eg 'eng'
    "hasEditionStatement"

    "replaces"        # Free-text entry for the work that this item supercedes
    "isReplacedBy"    # Vice-versa above

    "contributor"           # Author, collaborator, co-author, etc
                            # Formatted as "John Smith b1920 <http://bibliographica.org/entity/E1000>"
                            # Use lib/helpers.py:extracturi method to add formatting.
                            # Give it a list of these sorts of strings, and it will return 
                            # a list of tuples back, in the form ("John Smith b1920", "http...")
                            # or ("John Smith", "") if no <>-enclosed URI is found.
    "contributor_filtered"  # URIs removed
    "contributor_uris"      # Just the entity URIs alone

    "editor"                # editor and publisher are formatted as contributor
    "publisher"
    "publisher_uris"        # list of publisher entity URIs

    "placeofpublication"    # Place of publication - as defined in ISBD. Possible and likely to
                            # have multiple locations here

    "keyword"               # Keyword (eg not ascribed to a taxonomy)
    "ddc"                   # Dewey number (formatted as contributor, if accompanied by a URI scheme)
    "ddc_inscheme"          # Just the dewey scheme URIs
    "lcsh"                  # eg "Music <http://id.loc.gov/...>"
    "lcsh_inscheme"         # lcsh URIs

    "subjects"              # Catch-all,with all the above subjects queriable in one field.

    "bnb_id"                # Identifiers, if found in the item
    "bl_id"
    "isbn"
    "issn"
    "eissn"
    "nlmid"                 # NLM-specific id, used in PubMed
    "seeAlso"               # URIs pertinent to this item
    
    "series_title"          # If part of a series: (again, formatted like contributor)
    "series_uris"

    "container_title"       # If it has some other container, like a Journal, or similar
    "container_type"

    "text"                  # Catch-all and default search field.
                            # Covers: title, contributor, description, publisher, and subjects

    "f_title"               # Fields indexed to be suitable for facetting
    "f_contributor"         # Contents as above
    "f_subjects
    "f_publisher"
    "f_placeofpublication"  # See http://wiki.apache.org/solr/SimpleFacetParameters for info


The query text is passed to the solr instance verbatim, so it is possible to do complex queries within the textbox, according to normal solr/lucene syntax. See http://wiki.apache.org/solr/SolrQuerySyntax for some generic documentation. The basics of the more advanced search are as follows however:


  field:query  -- search only within a given field,
  
  eg 'contributor:"Dickens, Charles"'
  
  Note that query text within quotes is searched for as declared. The above search will
  not hit an author value of "Charles Dickens" for example (and why the above is not a good
  way to search generically.)


  Booleans, AND and OR -- if left out, multiple queries will be OR'd

  eg 'contributor:Dickens contributor:Charles' == 'contributor:Dickens OR contributor:Charles'
  
  The above will match contributors who are called 'Charles' *OR* 'Dickens' (non-exclusively), which is unlikely to be what is desired. 'Charles Smith' and 'Eliza Dickens' would  be valid hits in this search.

  'contributor:Dickens AND contributor:Charles' would be closer to what is intended.


  URI matching -- many fields include the URI and these can be used to be specific about the match

  eg 'contributor:"http://bibliographica.org/entity/E200000"'

  Given an entity URI therefore, you can see which items are published/contributed/etc just by performing a search for the URI in that field.


Basic Solr Updating:

    The 'solrpy' library is used to talk to a Solr instance and so seek that project out for library-specific documentation. (>=0.9.4 as this includes basic auth)

    Fundamentally, to update the index, you need an Entry (model/entry.py) instance mapped to the item you wish to (re)index and a valid SolrConnection instance.

    from solr import SolrConnection, SolrException
    s = SolrConnection("http://host/solr", http_user="", http_pass="")
    e = Entry.get_by_uri("Entry Graph URI")

    Then, it's straightforward: (catching two typical errors)

    from socket import error as SocketError
    try:
        s.add(e.to_solr_dict())
        # to commit updates (inadvisable to do after every small change of a bulk update):
        # s.commit()
    except SocketError:
        print "Solr isn't responding or isn't there"
    except SolrException:
        print "Something wrong with the update that was sent. Make sure the solr instance has the correct schema in place and is working and that the Entry has something in it."

Bulk Solr updating from nquads:

    There is a paster command for taking the nquads Bibliographica.org dataset, parsing this into mapped Entry's and then performing the above.

    Usage: paster indexnquads [options] config.ini NQuadFile
Create Solr index from an NQuad input

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -b BATCHSIZE, --batchsize=BATCHSIZE
                        Number of solr 'docs' to combine into a single update
                        request document
  -j TOJSON, --json=TOJSON
                        Do not update solr - entry's solr dicts will be
                        json.dumped to files for later solr updating

The --json option is particularly useful for production systems, as the time consuming part of this is the parsing and mapping to Entry's and you can offload that drain to any computer and upload the solrupdate*.json files it creates directly to the production system for rapid indexing.

NOTE! This will start with solrupdate0.json and iterate up. IT WONT CHECK for existence of previous solr updates and they will be overwritten!

[I used a batchsize of 10000 when using the json export method FYI]

Bulk Solr updating from aforementioned solrupdate*.json:

    paster indexjson [options] config.ini solrupdate*
    Create Solr index from a JSON serialised list of dicts

Options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config=CONFIG_FILE
                        Configuration File
  -C COMMIT, --commit=COMMIT
                        COMMIT the solr index after sending all the updates
  -o OPTIMISE, --optimise=OPTIMISE
                        Optimise the solr index after sending all the updates
                        and committing (forces a commit)

eg

    "paster indexjson development.ini --commit=True solrupdate*"
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.