Wiki
Clone wikiBibSonomy / development / modules / lucene / Lucene
Preface
-
Indices are configured centrally in the files LuceneIndexConfig.xml, LucenePostFields.xml, LucenePublicationFields.xml and LuceneBookmarkFields.xml via Spring. BibSonomy Post-Objects are automatically converted accordingly into Lucene-Document-Objects (see org.bibsonomy.lucene.util.LuceneResourceConverter).
-
The Update-mechanism has been reworked, such that no posts get lost and tag-updates make their way to the index:
- in the index, the last tas_id and the last change_date of the log_bibtex or, respectively, log_bookmark table are stored.
- all posts with tas_id > last_tas_id are to be added (and deleted in advancedue to potential tas-updates)
- all posts from the log_*-table with change_date >= last_change_date-epsilon are to be deleted from the index
-
spam-flagging/Unflagging is now thread-safe
Updating the Indexes
Every index has a Manager, which is responsible for updating the respective Lucene indices.
Remarks about the individual modules
bibsonomy-database
in org.bibsonomy.database.manager.PostDatabaseManager:
#!java public List<Post<R>> getPostsByResourceSearch(final String userName, final String requestedUserName, final String requestedGroupName, final Collection<String> allowedGroups,final String searchTerms, final String titleSearchTerms, final String authorSearchTerms, final Collection<String> tagIndex, final String year, final String firstYear, final String lastYear, final int limit, final int offset) { if (present(this.resourceSearch)) { return this.resourceSearch.getPosts(userName, requestedUserName, requestedGroupName, allowedGroups, searchTerms, titleSearchTerms, authorSearchTerms, tagIndex, year, firstYear, lastYear, limit, offset); } log.error("no resource searcher is set"); return new LinkedList<Post<R>>(); }
see org.bibsonomy.database.managers.chain.resource.get.GetResourcesByResourceSearch and implementations for particular Resource classes.
bibsonomy-webapp
scheduler - cron beans
In webapp/WEB-INF/bibsonomy2-servlet-lucene.xml cron beans are defined, which update particular Lucene-Indices regularly. Indices are updated daily.
Configuration
bibsonomy_lucene Database Connection see .xml on page context.xml
Overridable values from project.properties
#!java # update the lucene index? lucene.enableUpdater = true # base path to all indices lucene.index.basePath = <path> # possible values: # - decimal number # - LIMITED (sets lucene's default value) # - UNLIMITED lucene.index.maxFieldLength = 5000 # should lucene search for tags? lucene.tagCloud.enabled = true # the limit of the tag cloud when searching the index lucene.tagCloud.limit = 1000
Generating Luceneindex from Database
There are two ways to do this:
via Admin Webinterface
Set path correctly:
Sensibly override the StandardConfig of lucene.index.basePath
. see configuration
simply push the respective button under /admin/lucene as Admin and wait.
manually
(much more work)
create properties
copy lucene-test.properties from the bibsonomy-lucene module and change (basePath and database connection to be used).
build what's required
In the internal-tools repository, there is org.bibsonomy.lucene.util.LuceneIndexGenerator. Update the project dependencies to the latest version of bibsonomy. With mvn install
an -executable.jar
will be build.
create index
via commandline a new index is to be created as follows:
#!Bash shell scripts java -cp .:target/bibsonomy-tools-internal-${project.version}-executable.jar org.bibsonomy.lucene.util.LuceneIndexGenerator <# THREADS> <RESOURCENAME>*
.
is the current directory, where there should be a lucene.properties file with the settings. An example file resides in the bibsonomy-internal-tools repository: src/config/luncene_commandline/lucene.properties
*
means several resource types can be given
Valid commandline arguments:
number of threads
- the number of threads to be used for building the index; of course, this should not be larger than the number of available processors. (for example, importing 2,6 million publications on 8 processors runs for about 30 minutes)
Resourcename
-
publication
-
bookmark
-
goldstandardbookmark
-
goldstandardpublication
ToDo
outdated?
- 62 FIXMEs in .java und .xml
#!unknown 4) remove deprecated queries 6) implement multiple deletion by multi-term-search 8) deal with group-tas!
Index
-
own Analyzer implemented - has to be tested for usability and potentially stop-Words need to be configured.
- URL-Tokenizer for searching URL-fragments
-
Fields for fulltext search configurable
Search results
outdated? * In the search results it is possible to delete own posts. DThis causes trouble, because the index is not updated in realtime (see error-report from function tests). * Solution: Delete post via ajax from the list and output hint ('Suchindex wird erst innerhalb der naehcsten ...Minuten aktualisiert').
- Order of results can be changed among date and relevance
- calculate Tag Cloud via Lucene
- and n other people is not up-to-date for non-optimized index (maybe due to term-frequencies which are not recalculated)
Function tests
see also (old) Test Protocol: Testmatrix
Index Update
- Target: All changes in the database should make it to the Lucene-Index
- Background: Care for the Lucene-Index is taken independently of the Posting-Process: Changes are loaded regularly
-
Potential errors:
-
Post is deleted - remains in the Index
- Problem: There are own posts among the post, which are found. If these are deleted, the post remains in the search-result (deleted posts is therefore not recognized immediately). If pushed again,an error message appears: Internal Server Error: java.lang.IllegalStateException: The resource(s) with ID(s) [ae31361437753e69073a8f7454d6a894] do(es) not exist and could hence not be deleted. Then, the post has gone.
- Folke's explanation: This problem is caused by the underlying principle: the post is not deleted before the next update of the index (currently on gromit every minute, on production every 10 minutes).
- Solution: We could leave it as is, or we deactivate the delete button in the result list. '''Ajax'''
-
Post is added but not occurs in the index
- Problem: added to the index, but is detected as differing from the other posts (after Copy)
- Folke's explanation In the search result, all posts are shown, for which the search criteria matcher which means that, if a resource has been tagged by various users, the resource is shown several times.
- Solution: We could keep it as is or try something like ''GROUP BY hash''. The latter would cause some problems (e.g.: which of the posts for some interhash should be shown)
-
Post is modified,but changes are not taken (via input fields)
- Problem: Tried several times, no problems found, as long as looked solely on my own posts. With the genreal search (not in groups/friends/etc) the day's changes were only visible, when i looked at the details via Edit->Details. In the Resultview was still shown wrong.
- Question: Maybe it has not been waited for the update interval?
- Posts are associated to other tags, but changes are not in the index (via quick-edit in Resource-list)
- changes were immediately updated
- Question: Maybe it has not been waited for the update interval?
- Post written twice to the index
- Problem: copy on post, writing new tags => PostDoppeltImIndex
- Frage: More inquiries are needed - maybe two posts (of different users) of some resource were shown?
-
Search
- Target: All posts, which match the search criteria, are shown, where authorizations are respected properly (such as private posts)
- OR-disjunction should be possible
- worked
- no Lucene-Query-Hacks ('lucene-injection')
- OR-disjunction should be possible
- Background: for a given search term, a query is build and processed on the index
- Potential Errors:
- results are missing
- too many results are shown - this can particularly cased by the Analyzer in use. Among the normalisation of Umlauts, Stemming is performed
- Problem: When searching for '''Ciro OR Stumme''', I found Bibtexes, in which Gerd was the Editor. Bug oder Feature? ;)
- Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug
- Problem: For the same search, the folowing publication was returned, which was created by user
Stumme
: - Explanation: In the fulltext-search it is a Feature, for the author-search it is a bug
#!bibtex @article{lin1973, author = {Shen Lin and Brian W. Kernighan and ZZZ AAA}, editor = {Phillippus Bous and XXX YYY}, interHash = {64b8a082bb90639c74ed669d6b4f0776}, intraHash = {c2ea8f5fbe747a1fe688ea91fb53e232}, journal = {Operations Research}, pages = {498--516}, title = {An Effective Heuristic Algorithm for the Travelling-Salesman Problem}, volume = {21}, year = {1973} }
Author-Search
Fulltext-Search
Spam
- Target: Spam-Posts should not be shown in the search result
- Background: The Lucene-Index is build independently from the posting process. When a spammer is marked, all his posts will be deleted from the index with the next index-update. When a spammer is unflagged, all his posats will be re-added with the nex index-update
- Potential Problems:
- In a search result results from spammers can be shown
- In a search result, posts of a recently unflagged spammer may be missing
- After a crash asynchronly marked spammers get lost and stay in the index
- Known problem (theoretical):
- A spammer is unflagged and changes a post (only tags) before the next index-update. The changes will not be in the index.
Performance Tests
author-search
- Autorenliste from the /authors-page
- for every author:
#!bash #!/bin/bash AUTHORFILE=authorNames.txt DATESTRING=`date +%Y%m%d-%H.%M.%S` TIMINGFILE=$DATESTRING-profileOut.txt echo "" > $TIMINGFILE for AUTHOR in `cat $AUTHORFILE`; do TIMING=`curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org/author/stumme`; echo "$AUTHOR $TIMING" >> $TIMINGFILE; done;
- on gromit:
or, respectively,
#!bash tail -f bibsonomy2-debug.log | grep "DB author tag cloud query time" > /tmp/db_author_query_time.txt
#!bash tail -f bibsonomy2-debug.log | grep "Lucene author tag cloud query time" > /tmp/lucene_author_query_time.txt
#!bash cat /tmp/lucene_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
#!bash cat /tmp/db_author_query_time.txt | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$12; b++; print $12" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
Result:
- Lucene:
15.2174ms
,14.7391ms
- DB:
103.832ms
,98.8571ms
Fulltext-Search
Testing Real BibSonomy Queries with Biblicious/Datbase and Biblicious/Lucene ===
- parsing queries from logfile
#!bash grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u
- querying biblicious
#!bash for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done > biblicious_database.txt
Then Logfile looks like this:
#!logfile 200 0,031 /search/fractures+user%3Ainterlinks 200 0,081 /search/Frank+Kaufmann+ilmenau?bookmark.entriesPerPage=5&bibtex.entriesPerPage=5&lang=de 200 0,013 /search/freshlaptop 200 0,039 /search/freund+workflow 200 0,225 /search/friendfeed
Then, calculate the mean of the query-processing-time of all successful queries:
#!bash grep '^200' biblicious_lucene.txt | sed -e 's/\,/\./g' | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}' grep '^200' biblicious_database.txt | sed -e 's/\,/\./g' | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
Constructing query
#!bash for i in `grep 'GET /search' /var/log/dilbert/bibsonomy_access.log.0 | cut -d' ' -f12 | sort -u`; do curl -s -o /dev/null -w "%{http_code} %{time_total} " http://www.biblicious.org"$i"; echo $i; done | grep '^200' | sed -e 's/\,/\./g' | awk -F' ' 'BEGIN{a=0; b=0; print "Start\n"}{a+=$2; b++; print $2" >"a}END{print "\n" a"/"b"="a/b"\nENDE"}'
- tomcat = database - 1. measure
- tomcat = lucene - 1. measure
- tomcat = lucene - 2. measure
- tomcat = database - 2. measure
- tomcat = database - 3. measure
- tomcat = database - 4. measure
- tomcat = database - 5. measure
- tomcat = lucene - 2. measure repeated
- tomcat = lucene - 3. measure
- tomcat = lucene - 4. measure
- tomcat = lucene - 5. measure
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.0) Lucene: 13.132/411=0.0319513 Database: 185.572/411=0.451513
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.1.gz) Lucene: 20.573/716=0.0287332 / repeated measure: 67.826/715=0.0948615 Database: 311.424/716=0.43495
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.2.gz) Lucene: 6.186/255=0.0242588 Database: 29.333/255=0.115031
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.4.gz) Lucene: 22.438/182=0.123286 Database: 249.188/182=1.36916
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.5.gz) Lucene: 44.906/549=0.081796 Database: 222.595/549=0.405455
-
measure (2009-06-24 - bugs - /var/log/dilbert/bibsonomy_access.log.6.gz) Lucene: 35.917/511=0.0702877 Database: 116.436/511=0.227859
Table
Datensatz | Biblicious | Bibsonomy odie Lucene | Bibsonomy odie Lucene (cache) | Bibsonomy gandalf mySQL | Bibsonomy gandalf mySQL (cache) |
---|---|---|---|---|---|
2009-06-25: /var/log/dilbert/bibsonomy_access.log.0 | 127.484/917=0.139023 | 159.036/863=0.184283 (1) | 32.994/864=0.0381875 | 124.381/917=0.135639 | 118.754/917=0.129503 |
2009-06-25: /var/log/dilbert/bibsonomy_access.log.1.gz | 10.289/411=0.0250341 | 12.875/389=0.0330977 | 11.967/389=0.0307635 | 48.19/411=0.117251 | 44.867/411=0.109165 |
2009-06-25: /var/log/dilbert/bibsonomy_access.log.2.gz | 55.893/716=0.0780628 | 65.384/659=0.099217 | 18.767/659=0.028478 | 120.443/716=0.168216 | 112.66/716=0.157346 |
2009-06-25: /var/log/dilbert/bibsonomy_access.log.3.gz | 6.133/255=0.024051 | 8.147/234=0.0348162 | 5.876/234=0.0251111 | 39.518/255=0.154973 | 41.836/255=0.164063 |
2009-06-26: /var/log/dilbert/bibsonomy_access.log.0 | 69.631/2030=0.034301 | 74.585/2030=0.0367414 | 55.61/2030=0.0273941 | 298.689/2030=0.147137 | 284.617/2030=0.140205 |
2009-06-26: /var/log/dilbert/bibsonomy_access.log.1.gz | 59.977/917=0.0654057 | 30.286/916=0.0330633 | 22.16/916=0.0241921 | 108.571/917=0.118398 | 110.066/917=0.120028 |
(1) Die ersten 30 Anfragen dauerten zusammen 60 Sekunden, danach ging es schneller
Updated