Devpi performance issues

Issue #280 on hold
Stephan Erb created an issue

We are currently investigating performance issues in devpi. I use this ticket as a scratch area for ideas and findings.

Problem Description

Even for a moderate amount of 10 concurrent requests, performance drops significantly.

boom  http://127.0.0.1:3141/user/index/+simple/package  -c 10 -n 1000
Server Software: waitress
Running GET http://127.0.0.1:3141/user/index/+simple/package
Running 1000 times per 10 workers.
[================================================================>.] 99% Done

-------- Results --------
Successful calls        1000
Total time              108.2573 s
Average                 1.0795 s
Fastest                 0.7146 s
Slowest                 1.2584 s
Amplitude               0.5438 s
RPS                     9
BSI                     Hahahaha

-------- Status codes --------
Code 200                1000 times.

-------- Legend --------
RPS: Request Per Second
BSI: Boom Speed Index

To give some perspective: Our current data dir is 45GB and our sqlite db about 3GB. The listed package has about 255 release files:

A (single, non-concurrent) request of a simple page:

2015-10-30 14:20:48,094 INFO  [req2] GET /user/index/+simple/package
         204212 function calls (153819 primitive calls) in 0.111 seconds

   Ordered by: cumulative time
   List reduced from 219 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.110    0.110 tweens.py:18(excview_tween)
        1    0.000    0.000    0.110    0.110 router.py:66(handle_request)
        1    0.000    0.000    0.110    0.110 views.py:384(viewresult_to_response)
        1    0.000    0.000    0.110    0.110 views.py:471(_class_requestonly_view)
        1    0.003    0.003    0.110    0.110 views.py:366(simple_list_project)
        1    0.001    0.001    0.061    0.061 model.py:328(get_releaselinks)
       29    0.000    0.000    0.047    0.002 model.py:366(op_sro_check_pypi_whitelist)
      364    0.001    0.000    0.047    0.000 keyfs.py:597(get)
       28    0.001    0.000    0.043    0.002 model.py:564(get_releaselinks_perstage)
33085/915    0.021    0.000    0.038    0.000 keyfs.py:771(copy_if_mutable)
      209    0.000    0.000    0.037    0.000 model.py:271(get_linkstore_perstage)
      209    0.000    0.000    0.037    0.000 model.py:692(__init__)
      209    0.000    0.000    0.037    0.000 model.py:561(get_versiondata_perstage)
      364    0.001    0.000    0.033    0.000 keyfs.py:692(get)
26408/9986    0.007    0.000    0.032    0.000 keyfs.py:776(<genexpr>)
      299    0.001    0.000    0.023    0.000 keyfs.py:457(get_value_at)
        1    0.000    0.000    0.020    0.020 _xmlgen.py:67(unicode)
   1285/1    0.001    0.000    0.020    0.020 _xmlgen.py:113(visit)
    515/1    0.003    0.000    0.020    0.020 _xmlgen.py:143(Tag)
        2    0.000    0.000    0.020    0.010 _xmlgen.py:137(list)
       89    0.000    0.000    0.020    0.000 model.py:110(getstage)
       70    0.000    0.000    0.019    0.000 model.py:386(_sro)
      515    0.004    0.000    0.014    0.000 _xmlgen.py:169(attributes)
       89    0.000    0.000    0.014    0.000 model.py:245(getstage)
        1    0.000    0.000    0.013    0.013 metadata.py:119(sorted_sameproject_links)
        1    0.000    0.000    0.013    0.013 {sorted}
        2    0.000    0.000    0.013    0.006 model.py:345(get_projectname)
       12    0.000    0.000    0.013    0.001 model.py:362(op_sro)
       89    0.000    0.000    0.012    0.000 model.py:202(get)
      256    0.000    0.000    0.012    0.000 metadata.py:120(<genexpr>)
      255    0.001    0.000    0.011    0.000 metadata.py:101(__init__)
      299    0.001    0.000    0.010    0.000 keyfs.py:143(db_read_typedkey)
    50939    0.009    0.000    0.009    0.000 {isinstance}
      299    0.007    0.000    0.007    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
     9525    0.004    0.000    0.007    0.000 _xmlgen.py:220(repr_attribute)
      178    0.000    0.000    0.006    0.000 model.py:88(get_user)
      255    0.002    0.000    0.006    0.000 pkg_resources.py:1955(parse_version)
      178    0.000    0.000    0.005    0.000 keyfs.py:617(exists)
        1    0.000    0.000    0.005    0.005 model.py:351(has_pypi_base)
      178    0.000    0.000    0.005    0.000 keyfs.py:706(exists)

Proposition

CPython is (due to the GIL) effectively single-threaded. We can therefore try to optimize the single-thread performance and the concurrent behavior will improve accordingly.

Findings

A short dump of issues / potential improvements that I am aware of right now.

  • copy_if_mutable: there are two places where there are unnecessary data copies. One in TypedKey.get (as the transaction has already copied) and one when copying sets. Sets can only contain immutable data and thus it is not necessary to perform a deep copy.
  • the expensive sorted_sameproject_links is unnecessary when serving simple pages for pip. We should probably implement a 'fast path' when called by pip/setuptools/easyinstall.
  • The html rendering in simple_list_project is incredibly slow due to the way html.html(...).unicode() works. Instead, we should probably use the same approach as used in simple_list_all in order to generate HTML
  • in simple_list_project the project name is computed twice. (once via self.context.projectname and once via stage.get_projectname(name). We should compute it only once. As a replacement for the if, we can catch the exception thrown by self.context.projectname if the resolutions does not work.
  • the keyfs has a cache for the changelog entries but not for the KV table. Each request therefore (unnecessarily) hits the database
  • keyfs.exists_typed_state should re-use the already opened db connection of the transaction instead of opening another one.

Partial Validation

I did a quick check where I just dropped the additional copy_if_mutable call and get rid of the sorted_sameproject_links. Performance has already improved by about 20%. So I think we are on a good track.

boom  http://127.0.0.1:3141/user/index/+simple/package  -c 10 -n 1000
Server Software: waitress
Running GET http://127.0.0.1:3141/user/index/+simple/package
Running 1000 times per 10 workers.
[================================================================>.] 99% Done

-------- Results --------
Successful calls        1000
Total time              84.9964 s
Average                 0.8477 s
Fastest                 0.6314 s
Slowest                 0.9756 s
Amplitude               0.3442 s
RPS                     11
BSI                     Hahahaha

-------- Status codes --------
Code 200                1000 times.

-------- Legend --------
RPS: Request Per Second
BSI: Boom Speed Index

Comments (16)

  1. Holger Krekel repo owner

    Thanks, @StephanErb for the analysis. Could you tell how you got the profiling results? I'd like to repeat them.

    As to optimizing i'd first like to look at possibilities of higher level optimizations that cut away whole computation trees like e.g. caching rendered pages at the simple/NAME view function level. This could easily give an improvement of 50 times faster. Even for 500 hundred pages with 50K cache data each we would use only 25 MB RAM. The difficulty here is how to invalidate the cache intelligently, however.

    I am fine with also going for small-scale improvements like the 20% improvement you mentioned if they don't introduce more complexity.

  2. Stephan Erb reporter

    Holger, I applied a small patch to tween_keyfs_transaction in order to profile an individual request:

    diff -r 8d0f91a79d01 server/devpi_server/views.py
    --- a/server/devpi_server/views.py  Thu Oct 29 11:53:05 2015 +0100
    +++ b/server/devpi_server/views.py  Mon Nov 02 10:45:52 2015 +0100
    @@ -178,6 +178,10 @@
         keyfs = registry["xom"].keyfs
         is_replica = registry["xom"].is_replica()
         def request_tx_handler(request):
    +       import cProfile, pstats, StringIO
    +       pr = cProfile.Profile()
    +       pr.enable()
    +       
             write  = is_mutating_http_method(request.method) and not is_replica
             with keyfs.transaction(write=write) as tx:
                 threadlog.debug("in-transaction %s", tx.at_serial)
    @@ -185,6 +189,14 @@
             serial = tx.commit_serial if tx.commit_serial is not None \
                                       else tx.at_serial
             set_header_devpi_serial(response.headers, serial)
    +       
    +       pr.disable()
    +       s = StringIO.StringIO()
    +       sortby = 'cumulative'
    +       ps = pstats.Stats(pr, stream=s).strip_dirs().sort_stats(sortby)
    +       ps.print_stats(40)
    +       print s.getvalue()
    +       
             return response
         return request_tx_handler
    

    For the profiling of the single request, I then issued:

    curl http://127.0.0.1:3141/user/index/+simple/package
    

    I did the latter twice in order to warm-up all internal caches before doing the actual measurement.

    Your idea to cache of 'simple/NAME` pages sounds nice.

  3. Stephan Erb reporter

    Performance with both branches applied:

    boom  http://127.0.0.1:3141/user/index/+simple/package/  -c 10 -n 1000 --header "User-Agent:pip/x.x"
    Server Software: waitress
    Running GET http://127.0.0.1:3141/user/index/+simple/package/
        User-Agent: pip
    Running 1000 times per 10 workers.
    [================================================================>.] 99% Done
    
    -------- Results --------
    Successful calls        1000
    Total time              54.7839 s
    Average                 0.5465 s
    Fastest                 0.3413 s
    Slowest                 0.6214 s
    Amplitude               0.2801 s
    RPS                     18
    BSI                     Hahahaha
    
    -------- Status codes --------
    Code 200                1000 times.
    
    -------- Legend --------
    RPS: Request Per Second
    BSI: Boom Speed Index
    

    That is a about a factor of two faster than before.

    Here is the profiling of the code with both pull requests applied:

    82904 function calls (55145 primitive calls) in 0.048 seconds
    
       Ordered by: cumulative time
       List reduced from 168 to 40 due to restriction <40>
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.047    0.047 tweens.py:18(excview_tween)
            1    0.000    0.000    0.047    0.047 router.py:66(handle_request)
            1    0.000    0.000    0.047    0.047 views.py:384(viewresult_to_response)
            1    0.000    0.000    0.047    0.047 views.py:471(_class_requestonly_view)
            1    0.000    0.000    0.047    0.047 views.py:365(simple_list_project)
            1    0.000    0.000    0.041    0.041 model.py:328(get_releaselinks)
           29    0.000    0.000    0.041    0.001 model.py:364(op_sro_check_pypi_whitelist)
           28    0.001    0.000    0.037    0.001 model.py:562(get_releaselinks_perstage)
          287    0.001    0.000    0.033    0.000 keyfs.py:597(get)
          287    0.001    0.000    0.032    0.000 keyfs.py:692(get)
          209    0.000    0.000    0.031    0.000 model.py:271(get_linkstore_perstage)
          209    0.000    0.000    0.031    0.000 model.py:690(__init__)
          209    0.000    0.000    0.031    0.000 model.py:559(get_versiondata_perstage)
    19207/503    0.013    0.000    0.023    0.000 keyfs.py:771(copy_if_mutable)
          270    0.001    0.000    0.021    0.000 keyfs.py:457(get_value_at)
    15332/6280    0.004    0.000    0.020    0.000 keyfs.py:776(<genexpr>)
           45    0.000    0.000    0.008    0.000 model.py:110(getstage)
           35    0.000    0.000    0.008    0.000 model.py:384(_sro)
          270    0.001    0.000    0.007    0.000 keyfs.py:143(db_read_typedkey)
          6/3    0.000    0.000    0.006    0.002 decorator.py:34(__get__)
           45    0.000    0.000    0.006    0.000 model.py:245(getstage)
        29230    0.006    0.000    0.006    0.000 {isinstance}
          270    0.005    0.000    0.005    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
            1    0.000    0.000    0.005    0.005 view_auth.py:126(projectname)
            1    0.000    0.000    0.005    0.005 model.py:343(get_projectname)
            6    0.000    0.000    0.005    0.001 model.py:360(op_sro)
           45    0.000    0.000    0.004    0.000 model.py:202(get)
          209    0.001    0.000    0.004    0.000 model.py:751(get_links)
          209    0.000    0.000    0.003    0.000 model.py:499(key_projversion)
           90    0.000    0.000    0.003    0.000 model.py:88(get_user)
          345    0.001    0.000    0.002    0.000 model.py:635(__init__)
          417    0.001    0.000    0.002    0.000 keyfs.py:549(__call__)
           90    0.000    0.000    0.002    0.000 keyfs.py:617(exists)
           90    0.000    0.000    0.002    0.000 keyfs.py:706(exists)
          277    0.000    0.000    0.002    0.000 validation.py:18(normalize_name)
           14    0.000    0.000    0.002    0.000 keyfs.py:676(exists_typed_state)
           28    0.000    0.000    0.001    0.000 model.py:556(list_versions_perstage)
           45    0.000    0.000    0.001    0.000 model.py:411(__init__)
          277    0.000    0.000    0.001    0.000 validation.py:11(safe_name)
         2244    0.001    0.000    0.001    0.000 {method 'items' of 'dict' objects}
    
  4. Stephan Erb reporter

    Until devpi has first-class support for simple-page caching, we have deployed a work-around in our nginx. The inconsistency window of 60 seconds is unfortunate but much better than the installation timeouts we are seeing all over the place :-)

    In case anyone is interested, here is the configuration snippet:

        location ~ /\+simple/ {
            expires              60s;
            proxy_pass           http://localhost:3141;
            proxy_set_header     X-outside-url $scheme://$host:$server_port;
            proxy_set_header     X-Real-IP $remote_addr;
            proxy_http_version   1.1;
            proxy_cache          devpi_replica;
            proxy_cache_valid    60s;
            proxy_cache_lock     on;
        }
    
    proxy_cache_path  /var/lib/nginx/devpi-replica-cache  levels=1:2    keys_zone=devpi_replica:10m    max_size=1g;
    
  5. Stephan Erb reporter

    There have been some recent additions on Master. Here a test with the most recent master (7a6c529):

    Given from where we have started (see the beginning of the ticket), that is quite some improvement.

    boom  http://127.0.0.1:3141/platform/user/index/+simple/package/  -c 10 -n 1000 --header "User-Agent:pip/x.x"
    Server Software: waitress
    Running GET http://127.0.0.1:3141/user/index/+simple/package/
        User-Agent: pip/1.1
    Running 1000 times per 10 workers.
    [================================================================>.] 99% Done
    
    -------- Results --------
    Successful calls        1000
    Total time              46.9955 s
    Average                 0.4690 s
    Fastest                 0.3116 s
    Slowest                 0.6663 s
    Amplitude               0.3547 s
    RPS                     21
    BSI                     Hahahaha
    
    -------- Status codes --------
    Code 200                1000 times.
    
    -------- Legend --------
    RPS: Request Per Second
    
  6. Stephan Erb reporter

    So in total, we have done the following things in order to speedup things on our end. Maybe this is also helpful to others

  7. Log in to comment