Preface

Indices are configured centrally in the files LuceneIndexConfig.xml, LucenePostFields.xml, LucenePublicationFields.xml and LuceneBookmarkFields.xml via Spring. BibSonomy Post-Objects are automatically converted accordingly into Lucene-Document-Objects (see org.bibsonomy.lucene.util.LuceneResourceConverter).
The Update-mechanism has been reworked, such that no posts get lost and tag-updates make their way to the index:
- in the index, the last tas_id and the last change_date of the log_bibtex or, respectively, log_bookmark table are stored.
- all posts with tas_id > last_tas_id are to be added (and deleted in advancedue to potential tas-updates)
- all posts from the log_*-table with change_date >= last_change_date-epsilon are to be deleted from the index
spam-flagging/Unflagging is now thread-safe

Updating the Indexes

Every index has a Manager, which is responsible for updating the respective Lucene indices.

Remarks about the individual modules

bibsonomy-database

in org.bibsonomy.database.manager.PostDatabaseManager:

#!java

public List<Post<R>> getPostsByResourceSearch(final String userName, final String requestedUserName, final String requestedGroupName, final Collection<String> allowedGroups,final String searchTerms, final String titleSearchTerms, final String authorSearchTerms, final Collection<String> tagIndex, final String year, final String firstYear, final String lastYear, final int limit, final int offset) {
        if (present(this.resourceSearch)) {
            return this.resourceSearch.getPosts(userName, requestedUserName, requestedGroupName, allowedGroups, searchTerms, titleSearchTerms, authorSearchTerms, tagIndex, year, firstYear, lastYear, limit, offset);
        }

        log.error("no resource searcher is set");   
        return new LinkedList<Post<R>>();
    }

see org.bibsonomy.database.managers.chain.resource.get.GetResourcesByResourceSearch and implementations for particular Resource classes.

bibsonomy-webapp

scheduler - cron beans

In webapp/WEB-INF/bibsonomy2-servlet-lucene.xml cron beans are defined, which update particular Lucene-Indices regularly. Indices are updated daily.

Configuration

bibsonomy_lucene Database Connection see .xml on page context.xml

Overridable values from project.properties

#!java
# update the lucene index?
lucene.enableUpdater = true
# base path to all indices
lucene.index.basePath = <path>
# possible values: 
#    - decimal number
#    - LIMITED (sets lucene's default value)
#    - UNLIMITED
lucene.index.maxFieldLength = 5000
# should lucene search for tags?
lucene.tagCloud.enabled = true
# the limit of the tag cloud when searching the index
lucene.tagCloud.limit = 1000

Generating Luceneindex from Database

There are two ways to do this:

via Admin Webinterface

Set path correctly:

Sensibly override the StandardConfig of lucene.index.basePath. see configuration

simply push the respective button under /admin/lucene as Admin and wait.

manually

(much more work)

create properties

copy lucene-test.properties from the bibsonomy-lucene module and change (basePath and database connection to be used).

build what's required

In the internal-tools repository, there is org.bibsonomy.lucene.util.LuceneIndexGenerator. Update the project dependencies to the latest version of bibsonomy. With mvn install an -executable.jar will be build.

create index

via commandline a new index is to be created as follows:

#!Bash shell scripts
java -cp .:target/bibsonomy-tools-internal-${project.version}-executable.jar org.bibsonomy.lucene.util.LuceneIndexGenerator <# THREADS> <RESOURCENAME>*

. is the current directory, where there should be a lucene.properties file with the settings. An example file resides in the bibsonomy-internal-tools repository: src/config/luncene_commandline/lucene.properties

* means several resource types can be given

Valid commandline arguments:

number of threads

the number of threads to be used for building the index; of course, this should not be larger than the number of available processors. (for example, importing 2,6 million publications on 8 processors runs for about 30 minutes)

Resourcename

publication
bookmark
goldstandardbookmark
goldstandardpublication

ToDo

outdated?

62 FIXMEs in .java und .xml

#!unknown
4) remove deprecated queries

6) implement multiple deletion by multi-term-search

8) deal with group-tas!

Index

own Analyzer implemented - has to be tested for usability and potentially stop-Words need to be configured.
- URL-Tokenizer for searching URL-fragments
Fields for fulltext search configurable

Search results

outdated? * In the search results it is possible to delete own posts. DThis causes trouble, because the index is not updated in realtime (see error-report from function tests). * Solution: Delete post via ajax from the list and output hint ('Suchindex wird erst innerhalb der naehcsten ...Minuten aktualisiert').

Order of results can be changed among date and relevance
calculate Tag Cloud via Lucene
and n other people is not up-to-date for non-optimized index (maybe due to term-frequencies which are not recalculated)

Function tests

see also (old) Test Protocol: Testmatrix

Index Update

Target: All changes in the database should make it to the Lucene-Index
Background: Care for the Lucene-Index is taken independently of the Posting-Process: Changes are loaded regularly
Potential errors:
- Post is deleted - remains in the Index
  - Problem: There are own posts among the post, which are found. If these are deleted, the post remains in the search-result (deleted posts is therefore not recognized immediately). If pushed again,an error message appears: Internal Server Error: java.lang.IllegalStateException: The resource(s) with ID(s) [ae31361437753e69073a8f7454d6a894] do(es) not exist and could hence not be deleted. Then, the post has gone.
  - Folke's explanation: This problem is caused by the underlying principle: the post is not deleted before the next update of the index (currently on gromit every minute, on production every 10 minutes).
  - Solution: We could leave it as is, or we deactivate the delete button in the result list. '''Ajax'''
- Post is added but not occurs in the index
  - Problem: added to the index, but is detected as differing from the other posts (after Copy)
  - Folke's explanation In the search result, all posts are shown, for which the search criteria matcher which means that, if a resource has been tagged by various users, the resource is shown several times.
  - Solution: We could keep it as is or try something like ''GROUP BY hash''. The latter would cause some problems (e.g.: which of the posts for some interhash should be shown)
- Post is modified,but changes are not taken (via input fields)
  - Problem: Tried several times, no problems found, as long as looked solely on my own posts. With the genreal search (not in groups/friends/etc) the day's changes were only visible, when i looked at the details via Edit->Details. In the Resultview was still shown wrong.
  - Question: Maybe it has not been waited for the update interval?
- Posts are associated to other tags, but changes are not in the index (via quick-edit in Resource-list)
  - changes were immediately updated
  - Question: Maybe it has not been waited for the update interval?
- Post written twice to the index
  - Problem: copy on post, writing new tags => PostDoppeltImIndex
  - Frage: More inquiries are needed - maybe two posts (of different users) of some resource were shown?

Search

Target: All posts, which match the search criteria, are shown, where authorizations are respected properly (such as private posts)
- OR-disjunction should be possible
  - worked
- no Lucene-Query-Hacks ('lucene-injection')
Background: for a given search term, a query is build and processed on the index
Potential Errors:
- results are missing
- too many results are shown - this can particularly cased by the Analyzer in use. Among the normalisation of Umlauts, Stemming is performed
  - Problem: When searching for '''Ciro OR Stumme''', I found Bibtexes, in which Gerd was the Editor. Bug oder Feature? ;)
  - Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug
  - Problem: For the same search, the folowing publication was returned, which was created by user Stumme:
  - Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug

#!bibtex
@article{lin1973,
  author = {Shen Lin and Brian W. Kernighan and ZZZ AAA},
  editor = {Phillippus Bous and XXX YYY},
  interHash = {64b8a082bb90639c74ed669d6b4f0776},
  intraHash = {c2ea8f5fbe747a1fe688ea91fb53e232},
  journal = {Operations Research},
  pages = {498--516},
  title = {An Effective Heuristic Algorithm for the Travelling-Salesman Problem},
  volume = {21},
  year = {1973}
}

Author-Search

Fulltext-Search

Spam

Target: Spam-Posts should not be shown in the search result
Background: The Lucene-Index is build independently from the posting process. When a spammer is marked, all his posts will be deleted from the index with the next index-update. When a spammer is unflagged, all his posats will be re-added with the nex index-update
Potential Problems:
- In a search result results from spammers can be shown
- In a search result, posts of a recently unflagged spammer may be missing
  - After a crash asynchronly marked spammers get lost and stay in the index
Known problem (theoretical):
- A spammer is unflagged and changes a post (only tags) before the next index-update. The changes will not be in the index.

Performance Tests

author-search

Autorenliste from the /authors-page
for every author:

#!bash

#!/bin/bash
AUTHORFILE=authorNames.txt
DATESTRING=`date +%Y%m%d-%H.%M.%S`
TIMINGFILE=$DATESTRING-profileOut.txt

echo "" > $TIMINGFILE
for AUTHOR in `cat $AUTHORFILE`; do
    TIMING=`curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org/author/stumme`;
    echo "$AUTHOR   $TIMING" >> $TIMINGFILE;
done;

on gromit:

#!bash
tail -f bibsonomy2-debug.log | grep "DB author tag cloud query time" > /tmp/db_author_query_time.txt

or, respectively,

#!bash
tail -f bibsonomy2-debug.log | grep "Lucene author tag cloud query time" > /tmp/lucene_author_query_time.txt

* firs query removed (because it was an outlier for the DB-query) * evaluation:

#!bash
cat /tmp/lucene_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

or, respectively,

#!bash
cat /tmp/db_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

Result:

Lucene: 15.2174ms, 14.7391ms
DB: 103.832ms, 98.8571ms

Fulltext-Search

Testing Real BibSonomy Queries with Biblicious/Datbase and Biblicious/Lucene ===

parsing queries from logfile

#!bash
 grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u

querying biblicious

#!bash
 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done

 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done > biblicious_database.txt

Then Logfile looks like this:

#!logfile

200 0,031 /search/fractures+user%3Ainterlinks
200 0,081 /search/Frank+Kaufmann+ilmenau?bookmark.entriesPerPage=5&bibtex.entriesPerPage=5&lang=de
200 0,013 /search/freshlaptop
200 0,039 /search/freund+workflow
200 0,225 /search/friendfeed

Then, calculate the mean of the query-processing-time of all successful queries:

#!bash
 grep '^200' biblicious_lucene.txt   | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
 grep '^200' biblicious_database.txt | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

Constructing query

#!bash
 for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done | grep '^200' | sed -e 's/\,/\./g'  | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'

tomcat = database - 1. measure
tomcat = lucene - 1. measure
tomcat = lucene - 2. measure
tomcat = database - 2. measure
tomcat = database - 3. measure
tomcat = database - 4. measure
tomcat = database - 5. measure
tomcat = lucene - 2. measure repeated
tomcat = lucene - 3. measure
tomcat = lucene - 4. measure
tomcat = lucene - 5. measure

measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.0) Lucene: 13.132/411=0.0319513 Database: 185.572/411=0.451513
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.1.gz) Lucene: 20.573/716=0.0287332 / repeated measure: 67.826/715=0.0948615 Database: 311.424/716=0.43495
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.2.gz) Lucene: 6.186/255=0.0242588 Database: 29.333/255=0.115031
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.4.gz) Lucene: 22.438/182=0.123286 Database: 249.188/182=1.36916
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.5.gz) Lucene: 44.906/549=0.081796 Database: 222.595/549=0.405455
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.6.gz) Lucene: 35.917/511=0.0702877 Database: 116.436/511=0.227859

Table

Datensatz	Biblicious	Bibsonomy odie Lucene	Bibsonomy odie Lucene (cache)	Bibsonomy gandalf mySQL	Bibsonomy gandalf mySQL (cache)
2009-06-25: /var/log/dilbert/bibsonomy_access.log.0	127.484/917=0.139023	159.036/863=0.184283 (1)	32.994/864=0.0381875	124.381/917=0.135639	118.754/917=0.129503
2009-06-25: /var/log/dilbert/bibsonomy_access.log.1.gz	10.289/411=0.0250341	12.875/389=0.0330977	11.967/389=0.0307635	48.19/411=0.117251	44.867/411=0.109165
2009-06-25: /var/log/dilbert/bibsonomy_access.log.2.gz	55.893/716=0.0780628	65.384/659=0.099217	18.767/659=0.028478	120.443/716=0.168216	112.66/716=0.157346
2009-06-25: /var/log/dilbert/bibsonomy_access.log.3.gz	6.133/255=0.024051	8.147/234=0.0348162	5.876/234=0.0251111	39.518/255=0.154973	41.836/255=0.164063
2009-06-26: /var/log/dilbert/bibsonomy_access.log.0	69.631/2030=0.034301	74.585/2030=0.0367414	55.61/2030=0.0273941	298.689/2030=0.147137	284.617/2030=0.140205
2009-06-26: /var/log/dilbert/bibsonomy_access.log.1.gz	59.977/917=0.0654057	30.286/916=0.0330633	22.16/916=0.0241921	108.571/917=0.118398	110.066/917=0.120028

(1) Die ersten 30 Anfragen dauerten zusammen 60 Sekunden, danach ging es schneller

Wiki

BibSonomy / development / modules / lucene / Lucene

Preface

Updating the Indexes

Remarks about the individual modules

bibsonomy-database

bibsonomy-webapp

scheduler - cron beans

Configuration

Generating Luceneindex from Database

via Admin Webinterface

manually

create properties

build what's required

create index

ToDo

Index

Search results

Function tests

Index Update

Search

Author-Search

Fulltext-Search

Spam

Performance Tests

author-search

Result:

Fulltext-Search

Testing Real BibSonomy Queries with Biblicious/Datbase and Biblicious/Lucene ===

Table