Wiki

Clone wiki

BibSonomy / development / modules / lucene / Lucene

Preface

  • Indices are configured centrally in the files LuceneIndexConfig.xml, LucenePostFields.xml, LucenePublicationFields.xml and LuceneBookmarkFields.xml via Spring. BibSonomy Post-Objects are automatically converted accordingly into Lucene-Document-Objects (see org.bibsonomy.lucene.util.LuceneResourceConverter).

  • The Update-mechanism has been reworked, such that no posts get lost and tag-updates make their way to the index:

    • in the index, the last tas_id and the last change_date of the log_bibtex or, respectively, log_bookmark table are stored.
    • all posts with tas_id > last_tas_id are to be added (and deleted in advancedue to potential tas-updates)
    • all posts from the log_*-table with change_date >= last_change_date-epsilon are to be deleted from the index
  • spam-flagging/Unflagging is now thread-safe

Updating the Indexes

Every index has a Manager, which is responsible for updating the respective Lucene indices.

Remarks about the individual modules

bibsonomy-database

in org.bibsonomy.database.manager.PostDatabaseManager:

#!java

public List<Post<R>> getPostsByResourceSearch(final String userName, final String requestedUserName, final String requestedGroupName, final Collection<String> allowedGroups,final String searchTerms, final String titleSearchTerms, final String authorSearchTerms, final Collection<String> tagIndex, final String year, final String firstYear, final String lastYear, final int limit, final int offset) {
        if (present(this.resourceSearch)) {
            return this.resourceSearch.getPosts(userName, requestedUserName, requestedGroupName, allowedGroups, searchTerms, titleSearchTerms, authorSearchTerms, tagIndex, year, firstYear, lastYear, limit, offset);
        }

        log.error("no resource searcher is set");   
        return new LinkedList<Post<R>>();
    }

see org.bibsonomy.database.managers.chain.resource.get.GetResourcesByResourceSearch and implementations for particular Resource classes.

bibsonomy-webapp

scheduler - cron beans

In webapp/WEB-INF/bibsonomy2-servlet-lucene.xml cron beans are defined, which update particular Lucene-Indices regularly. Indices are updated daily.

Configuration

bibsonomy_lucene Database Connection see .xml on page context.xml

Overridable values from project.properties

#!java
# update the lucene index?
lucene.enableUpdater = true
# base path to all indices
lucene.index.basePath = <path>
# possible values: 
#    - decimal number
#    - LIMITED (sets lucene's default value)
#    - UNLIMITED
lucene.index.maxFieldLength = 5000
# should lucene search for tags?
lucene.tagCloud.enabled = true
# the limit of the tag cloud when searching the index
lucene.tagCloud.limit = 1000

Generating Luceneindex from Database

There are two ways to do this:

via Admin Webinterface

Set path correctly:

Sensibly override the StandardConfig of lucene.index.basePath. see configuration

simply push the respective button under /admin/lucene as Admin and wait.

manually

(much more work)

create properties

copy lucene-test.properties from the bibsonomy-lucene module and change (basePath and database connection to be used).

build what's required

In the internal-tools repository, there is org.bibsonomy.lucene.util.LuceneIndexGenerator. Update the project dependencies to the latest version of bibsonomy. With mvn install an -executable.jar will be build.

create index

via commandline a new index is to be created as follows:

#!Bash shell scripts
java -cp .:target/bibsonomy-tools-internal-${project.version}-executable.jar org.bibsonomy.lucene.util.LuceneIndexGenerator <# THREADS> <RESOURCENAME>*
. is the current directory, where there should be a lucene.properties file with the settings. An example file resides in the bibsonomy-internal-tools repository: src/config/luncene_commandline/lucene.properties

* means several resource types can be given

Valid commandline arguments:

number of threads

  • the number of threads to be used for building the index; of course, this should not be larger than the number of available processors. (for example, importing 2,6 million publications on 8 processors runs for about 30 minutes)

Resourcename

  • publication

  • bookmark

  • goldstandardbookmark

  • goldstandardpublication

ToDo

outdated?

  • 62 FIXMEs in .java und .xml
#!unknown
4) remove deprecated queries

6) implement multiple deletion by multi-term-search

8) deal with group-tas!

Index

  • own Analyzer implemented - has to be tested for usability and potentially stop-Words need to be configured.

    • URL-Tokenizer for searching URL-fragments
  • Fields for fulltext search configurable

Search results

outdated? * In the search results it is possible to delete own posts. DThis causes trouble, because the index is not updated in realtime (see error-report from function tests). * Solution: Delete post via ajax from the list and output hint ('Suchindex wird erst innerhalb der naehcsten ...Minuten aktualisiert').

  • Order of results can be changed among date and relevance
  • calculate Tag Cloud via Lucene
  • and n other people is not up-to-date for non-optimized index (maybe due to term-frequencies which are not recalculated)

Function tests

see also (old) Test Protocol: Testmatrix

Index Update

  • Target: All changes in the database should make it to the Lucene-Index
  • Background: Care for the Lucene-Index is taken independently of the Posting-Process: Changes are loaded regularly
  • Potential errors:

    • Post is deleted - remains in the Index

      • Problem: There are own posts among the post, which are found. If these are deleted, the post remains in the search-result (deleted posts is therefore not recognized immediately). If pushed again,an error message appears: Internal Server Error: java.lang.IllegalStateException: The resource(s) with ID(s) [ae31361437753e69073a8f7454d6a894] do(es) not exist and could hence not be deleted. Then, the post has gone.
      • Folke's explanation: This problem is caused by the underlying principle: the post is not deleted before the next update of the index (currently on gromit every minute, on production every 10 minutes).
      • Solution: We could leave it as is, or we deactivate the delete button in the result list. '''Ajax'''
    • Post is added but not occurs in the index

      • Problem: added to the index, but is detected as differing from the other posts (after Copy)
      • Folke's explanation In the search result, all posts are shown, for which the search criteria matcher which means that, if a resource has been tagged by various users, the resource is shown several times.
      • Solution: We could keep it as is or try something like ''GROUP BY hash''. The latter would cause some problems (e.g.: which of the posts for some interhash should be shown)
    • Post is modified,but changes are not taken (via input fields)

      • Problem: Tried several times, no problems found, as long as looked solely on my own posts. With the genreal search (not in groups/friends/etc) the day's changes were only visible, when i looked at the details via Edit->Details. In the Resultview was still shown wrong.
      • Question: Maybe it has not been waited for the update interval?
    • Posts are associated to other tags, but changes are not in the index (via quick-edit in Resource-list)
      • changes were immediately updated
      • Question: Maybe it has not been waited for the update interval?
    • Post written twice to the index
      • Problem: copy on post, writing new tags => PostDoppeltImIndex
      • Frage: More inquiries are needed - maybe two posts (of different users) of some resource were shown?
  • Target: All posts, which match the search criteria, are shown, where authorizations are respected properly (such as private posts)
    • OR-disjunction should be possible
      • worked
    • no Lucene-Query-Hacks ('lucene-injection')
  • Background: for a given search term, a query is build and processed on the index
  • Potential Errors:
    • results are missing
    • too many results are shown - this can particularly cased by the Analyzer in use. Among the normalisation of Umlauts, Stemming is performed
      • Problem: When searching for '''Ciro OR Stumme''', I found Bibtexes, in which Gerd was the Editor. Bug oder Feature? ;)
      • Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug
      • Problem: For the same search, the folowing publication was returned, which was created by user Stumme:
      • Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug
#!bibtex
@article{lin1973,
  author = {Shen Lin and Brian W. Kernighan and ZZZ AAA},
  editor = {Phillippus Bous and XXX YYY},
  interHash = {64b8a082bb90639c74ed669d6b4f0776},
  intraHash = {c2ea8f5fbe747a1fe688ea91fb53e232},
  journal = {Operations Research},
  pages = {498--516},
  title = {An Effective Heuristic Algorithm for the Travelling-Salesman Problem},
  volume = {21},
  year = {1973}
}

Spam

  • Target: Spam-Posts should not be shown in the search result
  • Background: The Lucene-Index is build independently from the posting process. When a spammer is marked, all his posts will be deleted from the index with the next index-update. When a spammer is unflagged, all his posats will be re-added with the nex index-update
  • Potential Problems:
    • In a search result results from spammers can be shown
    • In a search result, posts of a recently unflagged spammer may be missing
      • After a crash asynchronly marked spammers get lost and stay in the index
  • Known problem (theoretical):
    • A spammer is unflagged and changes a post (only tags) before the next index-update. The changes will not be in the index.

Performance Tests

author-search

  • Autorenliste from the /authors-page
  • for every author:
#!bash

#!/bin/bash
AUTHORFILE=authorNames.txt
DATESTRING=`date +%Y%m%d-%H.%M.%S`
TIMINGFILE=$DATESTRING-profileOut.txt

echo "" > $TIMINGFILE
for AUTHOR in `cat $AUTHORFILE`; do
    TIMING=`curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org/author/stumme`;
    echo "$AUTHOR   $TIMING" >> $TIMINGFILE;
done;
  • on gromit:
    #!bash
    tail -f bibsonomy2-debug.log | grep "DB author tag cloud query time" > /tmp/db_author_query_time.txt
    
    or, respectively,

#!bash
tail -f bibsonomy2-debug.log | grep "Lucene author tag cloud query time" > /tmp/lucene_author_query_time.txt
* firs query removed (because it was an outlier for the DB-query) * evaluation:
#!bash
cat /tmp/lucene_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
or, respectively,

#!bash
cat /tmp/db_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

Result:

  • Lucene: 15.2174ms, 14.7391ms
  • DB: 103.832ms, 98.8571ms

Fulltext-Search

Testing Real BibSonomy Queries with Biblicious/Datbase and Biblicious/Lucene ===

  • parsing queries from logfile
#!bash
 grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u
  • querying biblicious
#!bash
 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done

 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done > biblicious_database.txt

Then Logfile looks like this:

#!logfile

200 0,031 /search/fractures+user%3Ainterlinks
200 0,081 /search/Frank+Kaufmann+ilmenau?bookmark.entriesPerPage=5&bibtex.entriesPerPage=5&lang=de
200 0,013 /search/freshlaptop
200 0,039 /search/freund+workflow
200 0,225 /search/friendfeed

Then, calculate the mean of the query-processing-time of all successful queries:

#!bash
 grep '^200' biblicious_lucene.txt   | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
 grep '^200' biblicious_database.txt | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

Constructing query

#!bash
 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done | grep '^200' | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

  • tomcat = database - 1. measure
  • tomcat = lucene - 1. measure
  • tomcat = lucene - 2. measure
  • tomcat = database - 2. measure
  • tomcat = database - 3. measure
  • tomcat = database - 4. measure
  • tomcat = database - 5. measure
  • tomcat = lucene - 2. measure repeated
  • tomcat = lucene - 3. measure
  • tomcat = lucene - 4. measure
  • tomcat = lucene - 5. measure
  1. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.0) Lucene: 13.132/411=0.0319513 Database: 185.572/411=0.451513

  2. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.1.gz) Lucene: 20.573/716=0.0287332 / repeated measure: 67.826/715=0.0948615 Database: 311.424/716=0.43495

  3. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.2.gz) Lucene: 6.186/255=0.0242588 Database: 29.333/255=0.115031

  4. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.4.gz) Lucene: 22.438/182=0.123286 Database: 249.188/182=1.36916

  5. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.5.gz) Lucene: 44.906/549=0.081796 Database: 222.595/549=0.405455

  6. measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.6.gz) Lucene: 35.917/511=0.0702877 Database: 116.436/511=0.227859

Table

Datensatz Biblicious Bibsonomy odie Lucene Bibsonomy odie Lucene (cache) Bibsonomy gandalf mySQL Bibsonomy gandalf mySQL (cache)
2009-06-25: /var/log/dilbert/bibsonomy_access.log.0 127.484/917=0.139023 159.036/863=0.184283 (1) 32.994/864=0.0381875 124.381/917=0.135639 118.754/917=0.129503
2009-06-25: /var/log/dilbert/bibsonomy_access.log.1.gz 10.289/411=0.0250341 12.875/389=0.0330977 11.967/389=0.0307635 48.19/411=0.117251 44.867/411=0.109165
2009-06-25: /var/log/dilbert/bibsonomy_access.log.2.gz 55.893/716=0.0780628 65.384/659=0.099217 18.767/659=0.028478 120.443/716=0.168216 112.66/716=0.157346
2009-06-25: /var/log/dilbert/bibsonomy_access.log.3.gz 6.133/255=0.024051 8.147/234=0.0348162 5.876/234=0.0251111 39.518/255=0.154973 41.836/255=0.164063
2009-06-26: /var/log/dilbert/bibsonomy_access.log.0 69.631/2030=0.034301 74.585/2030=0.0367414 55.61/2030=0.0273941 298.689/2030=0.147137 284.617/2030=0.140205
2009-06-26: /var/log/dilbert/bibsonomy_access.log.1.gz 59.977/917=0.0654057 30.286/916=0.0330633 22.16/916=0.0241921 108.571/917=0.118398 110.066/917=0.120028

(1) Die ersten 30 Anfragen dauerten zusammen 60 Sekunden, danach ging es schneller

Updated