pilerpurge.py run very slow on CentOS7 and mariadb5.5.6

Issue #1096 closed
Top_ created an issue

Hi

I found the problem pilerpurge.py python script run very slow. (5600 emails, take time about 1 hour)
I try to search the old piler Issues but no found solution

So I try to print the status of each step on pilerpurge.py and found the slow responses.
we found slow on step query v_attachment table on purge_attachments_by_attachment_id function and purge_attachments_by_piler_id function as below.

+-----------------------------------------------------------------+
| def purge_attachments_by_attachment_id(opts={}):
|    format = ", ".join(['%s'] * len(opts['referenced_attachments']))
|
|    cursor = opts['db'].cursor()
|
|    cursor.execute("SELECT i, piler_id, attachment_id, refcount FROM " +
|                  "v_attachment WHERE i IN (%s)" %
|                   (format), opts['referenced_attachments'])
+-----------------------------------------------------------------+

+----------------------------------------------------------------+
|def purge_attachments_by_piler_id(ids=[], opts={}):
|    format = ", ".join(['%s'] * len(ids))
|
|    cursor = opts['db'].cursor()
|
|    cursor.execute("SELECT i, piler_id, attachment_id, refcount FROM " +
|                   "v_attachment WHERE piler_id IN (%s)" % (format), ids)
+------------------------------------------------------------------+

So I try to query v_attachment table on DB software (MariaDB Version 5.5.60)

=======================================================

MariaDB [piler]> select * from v_attachment where i = '21';
+----+--------------------------------------+---------------+------+----------+
| i  | piler_id                             | attachment_id | ptr  | refcount |
+----+--------------------------------------+---------------+------+----------+
| 21 | 400000005adea55a225aedf400d8777bf7fc |             2 |    0 |      361 |
+----+--------------------------------------+---------------+------+----------+
1 row in set (5 min 39.96 sec)

MariaDB [piler]> select * from v_attachment where piler_id = '400000005adea55a225aedf400d8777bf7fc';       
+----+--------------------------------------+---------------+------+----------+
| i  | piler_id                             | attachment_id | ptr  | refcount |
+----+--------------------------------------+---------------+------+----------+
| 20 | 400000005adea55a225aedf400d8777bf7fc |             1 |    0 |       31 |
| 21 | 400000005adea55a225aedf400d8777bf7fc |             2 |    0 |      361 |
| 22 | 400000005adea55a225aedf400d8777bf7fc |             3 |    0 |      360 |
| 23 | 400000005adea55a225aedf400d8777bf7fc |             4 |    0 |       31 |
| 24 | 400000005adea55a225aedf400d8777bf7fc |             5 |    0 |       31 |
| 25 | 400000005adea55a225aedf400d8777bf7fc |             6 |    0 |       31 |
| 26 | 400000005adea55a225aedf400d8777bf7fc |             7 |    0 |       49 |
+----+--------------------------------------+---------------+------+----------+
7 rows in set (4 min 57.83 sec)

=======================================================

the query time is about 4-6 min per time. (when I run pilerpurge.py, It query v_attachment 2 times and runs about 10 min to deleted email 1000 record)

As I check. the v_attachment is the VIEW table, I guess this view table has fully generate new data every time to call it. when the data is a big volume, it will take time to create data.

So I try to direct query by SQL code by bringing the code from v_attachment table. Please refer query code below.

===============================================

MariaDB [piler]> select id as i, piler_id, attachment_id, (select count(0) from attachment where (ptr = i )) AS refcount from attachment WHERE id IN ('21');
+----+--------------------------------------+---------------+----------+
| i  | piler_id                             | attachment_id | refcount |
+----+--------------------------------------+---------------+----------+
| 21 | 400000005adea55a225aedf400d8777bf7fc |             2 |      361 |
+----+--------------------------------------+---------------+----------+
1 row in set (0.01 sec)

MariaDB [piler]> SELECT i, piler_id, attachment_id, refcount FROM (select id as i, piler_id, attachment_id, ptr, (select count(0) from attachment where (ptr = i )) AS refcount from attachment) AS v_attachment WHERE piler_id IN ('400000005adea55a225aedf400d8777bf7fc');
+----+--------------------------------------+---------------+----------+
| i  | piler_id                             | attachment_id | refcount |
+----+--------------------------------------+---------------+----------+
| 20 | 400000005adea55a225aedf400d8777bf7fc |             1 |       31 |
| 21 | 400000005adea55a225aedf400d8777bf7fc |             2 |      361 |
| 22 | 400000005adea55a225aedf400d8777bf7fc |             3 |      360 |
| 23 | 400000005adea55a225aedf400d8777bf7fc |             4 |       31 |
| 24 | 400000005adea55a225aedf400d8777bf7fc |             5 |       31 |
| 25 | 400000005adea55a225aedf400d8777bf7fc |             6 |       31 |
| 26 | 400000005adea55a225aedf400d8777bf7fc |             7 |       49 |
+----+--------------------------------------+---------------+----------+
7 rows in set (0.01 sec)

===============================================
The query time is very short.

So I try to modify command on piler as below and run pilerpurge.py by dry-run option. It is very fast more than a query from v_attachment and I recheck some file name from the result on DB, it in target data that I set the retained data to remove email.

But I did not sure that can I direct query instead view table (v_attachment)? and my modified code is may have some hidden problem?
Can you support checking the code below? or can you suggest another solution for us?

+-----------------------------------------------------------------+
|def purge_attachments_by_attachment_id(opts={}):
|    format = ", ".join(['%s'] * len(opts['referenced_attachments']))
|
|    cursor = opts['db'].cursor()
|    print "query database attached file"
|    cursor.execute("select id as i, piler_id, attachment_id, (select count(0) from attachment where (ptr = i )) AS refcount from attachment " + 
|                   " WHERE id IN (%s)" % (format), opts['referenced_attachments'])
+-----------------------------------------------------------------+

+-----------------------------------------------------------------+
|def purge_attachments_by_piler_id(ids=[], opts={}):
|    format = ", ".join(['%s'] * len(ids))
|
|    cursor = opts['db'].cursor()
|        cursor.execute("SELECT i, piler_id, attachment_id, refcount FROM " +
|                       "(select id as i, piler_id, attachment_id, ptr, (select count(0) from attachment where (ptr = i )) AS refcount from attachment) AS v_attachment " +
|                       " WHERE piler_id IN (%s)" % (format), ids)
+-----------------------------------------------------------------+

Comments (6)

  1. Janos SUTO repo owner

    How many rows do you have in the attachment table? Also show me the following queries:

    explain select * from v_attachment where i = '21';
    explain select * from v_attachment where piler_id = '400000005adea55a225aedf400d8777bf7fc';
    

  2. Top_ reporter

    As the number row on attachment table

    MariaDB [piler]> select count(1) from attachment;
    +----------+
    | count(1) |
    +----------+
    | 26966961 |
    +----------+
    1 row in set (7.15 sec)

    ==============================

    and number row on v_attachment table

    MariaDB [piler]> select count(1) from v_attachment;
    +----------+
    | count(1) |
    +----------+
    | 26966967 |
    +----------+
    1 row in set (3 min 42.39 sec)
    ===============================

    and result of explain select * from v_attachment where i = '21';

    MariaDB [piler]> explain select * from v_attachment where i = '21';
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 26952690 | Using where |
    | 2 | DERIVED | attachment | ALL | NULL | NULL | NULL | NULL | 26952690 | |
    | 3 | DEPENDENT SUBQUERY | attachment | ref | attachment_idx3 | attachment_idx3 | 9 | func | 4 |
    Using index |
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    3 rows in set (0.01 sec)
    ================================================
    and result of explain select * from v_attachment where piler_id = '400000005adea55a225aedf400d8777bf7fc';
    ================================================
    MariaDB [piler]> explain select * from v_attachment where piler_id = '400000005adea55a225aedf400d8777bf7fc';
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 26952699 | Using where |
    | 2 | DERIVED | attachment | ALL | NULL | NULL | NULL | NULL | 26952699 | |
    | 3 | DEPENDENT SUBQUERY | attachment | ref | attachment_idx3 | attachment_idx3 | 9 | func | 4 | Using index |
    +------+--------------------+------------+------+-----------------+-----------------+---------+------+----------+-------------+
    3 rows in set (0.00 sec)

    =================================================

  3. Janos SUTO repo owner

    I have a different output for these:

    MariaDB [piler]> explain select * from v_attachment where i = '21';
    +------+-------------+------------+-------+-----------------+-----------------+---------+-------+------+-------------+
    | id   | select_type | table      | type  | possible_keys   | key             | key_len | ref   | rows | Extra       |
    +------+-------------+------------+-------+-----------------+-----------------+---------+-------+------+-------------+
    |    1 | PRIMARY     | attachment | const | PRIMARY         | PRIMARY         | 8       | const |    1 |             |
    |    3 | SUBQUERY    | attachment | ref   | attachment_idx3 | attachment_idx3 | 9       | func  |    1 | Using index |
    +------+-------------+------------+-------+-----------------+-----------------+---------+-------+------+-------------+
    2 rows in set (0.002 sec)
    

    and

    MariaDB [piler]> explain select * from v_attachment where piler_id='400000005f33bf5e2d909aac00cecdf9cfcb';
    +------+--------------------+------------+------+-----------------+-----------------+---------+-------+------+-----------------------+
    | id   | select_type        | table      | type | possible_keys   | key             | key_len | ref   | rows | Extra                 |
    +------+--------------------+------------+------+-----------------+-----------------+---------+-------+------+-----------------------+
    |    1 | PRIMARY            | attachment | ref  | attachment_idx  | attachment_idx  | 144     | const |    1 | Using index condition |
    |    3 | DEPENDENT SUBQUERY | attachment | ref  | attachment_idx3 | attachment_idx3 | 9       | func  |    1 | Using index           |
    +------+--------------------+------------+------+-----------------+-----------------+---------+-------+------+-----------------------+
    

    Also check out the following query and see if you miss any index defined below:

    MariaDB [piler]> show index from attachment;
    +------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | Table      | Non_unique | Key_name        | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
    +------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    | attachment |          0 | PRIMARY         |            1 | id          | A         |         602 |     NULL | NULL   |      | BTREE      |         |               |
    | attachment |          1 | attachment_idx  |            1 | piler_id    | A         |         602 |     NULL | NULL   |      | BTREE      |         |               |
    | attachment |          1 | attachment_idx2 |            1 | sig         | A         |         602 |     NULL | NULL   |      | BTREE      |         |               |
    | attachment |          1 | attachment_idx2 |            2 | size        | A         |         602 |     NULL | NULL   | YES  | BTREE      |         |               |
    | attachment |          1 | attachment_idx2 |            3 | ptr         | A         |         602 |     NULL | NULL   | YES  | BTREE      |         |               |
    | attachment |          1 | attachment_idx3 |            1 | ptr         | A         |         602 |     NULL | NULL   | YES  | BTREE      |         |               |
    +------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
    

  4. Top_ reporter

    Could you please share command you have used to create view “v_attachment”. We want to compare with ours.

  5. Log in to comment