Skip to content

Reduce database size #152

Open
donald opened this issue Dec 31, 2023 · 4 comments
Open

Reduce database size #152

donald opened this issue Dec 31, 2023 · 4 comments

Comments

@donald
Copy link
Contributor

donald commented Dec 31, 2023

@thomas in #150

But honestly, as long as the actual database design yields to something
that only runs in 30+ GB RAM, this discussion is way pointless. Maybe
try to look at the elephant in the room?

Three tables in mxq, a biggy, two trivials, maybe think of how to move
the finished jobs into a kind of bag-table, and have the active ones
acessible? So, I propose adding an other table, plus some logic. This
will indeed not yield to 'realtime' performance, but would make the
whole thing more responsive and less memory hungry **).

Thomas

**) IMHO SQL-databases are intended to be run under constrained/limited
conditions (Well, indeed there is the Hans/Günther way, ... grinsel)

@donald
Copy link
Contributor Author

donald commented Dec 31, 2023

Just FTR: SHOW TABLE STATUS shows 44860082 rows for mxq_job with average 660 byte row length. That would be 27GB raw data. mxq_job.ibd is 35GB. So that and the 30+GB RAM is all in the same order of magnitude. Maybe MySQL is taking alle the memory it can get for caching and because it is not limited, caches the whole thing?

I've estimated (just by datatype, ignoring all overhead) the fixed row size to 188 bytes. I've averaged the string sizes of the last 1000 entries of mxq_job, which sum up to 341 characters. So that would be 529 bytes, which is near enough to the 600 to be plausible.

job_id                   BIGINT        8
job_status               SMALLINT      2
job_priority             SMALLINT      2
job_cancelled            TINYINT       1
group_id                 BIGINT        8
job_workdir              VARCHAR(4096) 2 + len     66
job_argc                 INT           4
job_argv                 VARCHAR       2 + len    145
job_stdout               VARCHAR       2 + len     28
job_stderr               VARCHAR       2 + len     37
job_umask                INT           4
host_submit              VARCHAR(64)   1 + len     26
server_id                VARCHAR(64)   1 + len      4
host_id                  varchar(128)  1 + len      5
daemon_id                int           4
host_hostname            varchar(64)   1 + len     24
host_pid                 int           4
host_slots               int           4
host_cpu_set             varchar(4096) 2 + len      6
date_submit              timestamp     4
date_start               timestamp     4
date_end                 timestamp     4
stats_max_sumrss         bigint        8
stats_status             int           4
stats_utime_sec          bigint        8
stats_utime_usec         int           4
stats_stime_sec          bigint        8
stats_stime_usec         int           4
stats_real_sec           bigint        8
stats_real_usec          int           4
stats_maxrss             bigint        8
stats_minflt             bigint        8
stats_majflt             bigint        8
stats_nswap              bigint        8
stats_inblock            bigint        8
stats_oublock            bigint        8
stats_nvcsw              bigint        8
stats_nivcsw             bigint        8
                                     -------------
                                     188        341
                                     --> 529

I know it doesn't help much, just wanted to drop the data somewhere :-)

We could reduce the row size. INT instead of BIGINT should be enough for job_id or row_id. I guess nobody really cares for stats_stime_usec or stats_majflt. I've already recovered a few GB by removing unused columns. But as long as the string data takes most of the bytes, it only helps that much.

How exactly could we archive away old jobs? If we just put mxq_job rows into another table (or use partitioning), the data would still be there and probably cached if ever accessed. If we move the rows into another table or delete the rows, the count in mxq_group wouldn't match the actual summary counts. Add another count (group_jobs_archived) ? If we remove rows from mxq_group, too, to avoid inconsistencies, the user would lose statistics and functionality would change a bit ( mxqsub with parameters already used creates a new initial group). Do our users care at all or could we just remove groups and its jobs with last change time > 1 year without anyone really noticing?

@thomas
Copy link
Contributor

thomas commented Jan 3, 2024

I do not think, that reducing the row sizes here and there will help with the issues on the long run. What also becomes immediately apparent, when converting the Aria tables to MyISAM, is the huge index size of around 3GB for mxq_job.

My proposal would be to split mxq_job into, say, mxq_job_active and mxq_job_finished. Where setting of the date_end field would initiate the transfer from the active to the finished table. The automatic assignment of group ids is not affected by this, they are generated from mxq_group.

From what I could oversee, it would only affect certain summaries/statistics, and here the accustomed functionality could be achieved with union queries.

To make it a bit more concrete:

-- split the tables
create table mxq_job_active like mxq_job;
insert mxq_job_active select * from mxq_job where year(date_end) < 1000;

create table mxq_job_finished like mxq_job;
insert mxq_job_finished select * from mxq_job where year(date_end) > 1000;
(using year() is less confusing then dealing with datetime strings...)

This would make mxq_job_active realy small (~17000 records in my older sample set), whereas the bulk contains around 48 million records. Also some indexes could be dropped from mxq_job_finished, no auto increment indeed, and the questionable index on job_status (since this field only takes a few distinct values a tree based index is really, really bad, and should be hash based, that would avoid rebalancing the tree on every update, and also make the inevitable linear searches faster).

-- example query for the job listing from the web interface

-- OLD

  SELECT job_id, job_status, date_submit, date_start, date_end, host_hostname, host_pid,
      timestampdiff(SECOND,date_submit,date_start), timestampdiff(SECOND,date_submit,now()),
      timestampdiff(SECOND,date_start,date_end), timestampdiff(SECOND,date_start,now()),
      UNIX_TIMESTAMP(date_submit) as ds FROM mxq_job,mxq_group
    WHERE mxq_job.group_id=mxq_group.group_id AND mxq_job.group_id='489103'
    ORDER BY job_id DESC;


-- NEW, looks a bit copy'n'pastish...

  (SELECT job_id, job_status, date_submit, date_start, date_end, host_hostname, host_pid,
      timestampdiff(SECOND,date_submit,date_start), timestampdiff(SECOND,date_submit,now()),
      timestampdiff(SECOND,date_start,date_end), timestampdiff(SECOND,date_start,now()),
      UNIX_TIMESTAMP(date_submit) as ds FROM mxq_job_active, mxq_group
    WHERE mxq_job_active.group_id=mxq_group.group_id AND mxq_job_active.group_id='489103')
  union
  (SELECT job_id, job_status, date_submit, date_start, date_end, host_hostname, host_pid,
      timestampdiff(SECOND,date_submit,date_start), timestampdiff(SECOND,date_submit,now()),
      timestampdiff(SECOND,date_start,date_end), timestampdiff(SECOND,date_start,now()),
      UNIX_TIMESTAMP(date_submit) as ds FROM mxq_job_finished, mxq_group
    WHERE mxq_job_finished.group_id=mxq_group.group_id AND mxq_job_finished.group_id='489103')
    ORDER BY job_id DESC;

A table partitioning could be also a way to achieve this, would also make the queries look a bit more elegant, but would make the database/queries eventually less portable, and indeed keep unneeded index data on the bulk. So I would prefer the simpler way using unions.

@donald
Copy link
Contributor Author

donald commented Jan 4, 2024

FTR: We discovered that the production mysql-server instance is started with an explicit --innodb_buffer_pool_size=32G which is the likely cause of the memory usage. A test server with a copy of the life data runs the same database with 3GB.

@donald
Copy link
Contributor Author

donald commented Jan 5, 2024

As an experiment, I've restarted the database without the tuning parameters. Let's see, if we run into performance or other problems, which I don't expect. vmsize now 1.9 GB.

/project/mxq/home/bin/mariux-mysqlctl:

### MYSQL_CMD+=" --max_allowed_packet=256M"
### # https://dev.mysql.com/doc/refman/5.6/en/innodb-multiple-buffer-pools.html
### MYSQL_CMD+=" --innodb_buffer_pool_size=32G"
### MYSQL_CMD+=" --innodb_buffer_pool_instances=16"
### # https://www.percona.com/blog/2014/01/28/10-mysql-performance-tuning-settings-after-installation/
### MYSQL_CMD+=" --innodb_log_file_size=4G"
### # https://www.critiqueslibres.com/blog/?p=1603
### MYSQL_CMD+=" --innodb_flush_log_at_trx_commit=0"
### # https://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb-diskio.html
### MYSQL_CMD+=" --innodb_flush_method=O_DIRECT"
### MYSQL_CMD+=" "

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants