MMFiles Compaction options

The ArangoDB MMFiles storage engine will run a compaction over data files.

ArangoDB writes Documents in the WAL file. Once they have been sealed in the wal file,the collector may copy them into a per collection journal file.

Once journal files fill up, they’re sealed to become data files.

=> one collection may have documents in the WAL logs, its journal file, and an arbitrary number of data files.

If a collection is loaded, each of these files are opened (thus use a file handle) and are mmap’ed. Since file handles and memory mapped files are also a sparse resource, that number should be kept low.

Once you update or remove documents from data files (or already did while it was the journal file)these documents are marked as ‘dead’ with a deletion marker.

Over time the number of dead documents may rise, and we don’t want to use the previously mentionedresources, plus the disk space should be given back to the system.Thus several journal files can be combined to one, ommitting the dead documents.

Combining several of these data files into one is called compaction. The compaction process readsthe alive documents from the original data files, and writes them into new data file.

Once that is done, the memory mappings to the old data files is released, and the files are erased.

Since the compaction locks the collection, and also uses I/O resources, its carefully configurableunder which conditions the system should perform which amount of these compaction jobs:

ArangoDB spawns one compactor thread per database. The settings below vary in scope.

Activity control

The activity control parameters alter the behavior in terms of scan / execution frequencyof the compaction.

Sleep interval between two compaction runs (in seconds):—compaction.db-sleep-timeThe number of seconds the collector thread will wait between two attempts to search for compactable data files of collections in one Database. If the compactor has actually executed work,a subsequent lookup is done. Scope: Database.

Minimum sleep time between two compaction runs (in seconds):—compaction.min-intervalWhen an actual compaction was executed for one collection, we wait for this timebefore we execute the compaction on this collection again.This is here to let eventually piled up user load be worked out. Scope: collection.

Source data files

These parameters control which data files are taken into account for a compaction run.You can specify several criteria which each off may be sufficcient alone.

The scan over the data files belonging to one collection is executed from oldestdata file to newest; if files qualify for a compaction they may be merged with newer files(containing younger documents)

Scope: Collection level, some are influenced by collection settings.

minimal filesize threshold original data files have to be below for a compaction:—compaction.min-small-data-file-sizeThis is the threshold which controls below which minimum total size a data filewill always be taken into account for the compaction.

Minimum unused count of documents in a datafile:—compaction.dead-documents-thresholdData files will often contain dead documents. This parameter specifies theirtop most accetpeable count until the data file qualifies for compaction.

How many bytes of the source data file are allowed to be unused at most:—compaction.dead-size-thresholdThe dead data size varies along with the size of your documents.If you have many big documents, this threshold may hit before the document count threshold.

How many percent of the source data file should be unused at least:—compaction.dead-size-percent-thresholdsince the size of the documents may vary this threshold works on the percentage of the dead documents size. Thus, if you have many hugedead documents, this threshold kicks in earlier.

To name an example with numbers, if the data file contains 800 kbytes of aliveand 400 kbytes of dead documents, the share of the dead documents is:

400 / (400 + 800) = 33 %.

If this value if higher than the specified threshold, the data file will be compacted.

Compacted target files

Once data files of a collection are qualified for a compaction run, these parameters control how many data files are merged into one, (or even one source data file may be compacted intoone smaller target data file)

Scope: Collection level, some are influenced by collection settings.

Maximum number of files to merge to one file:—compaction.dest-max-filesHow many data files (at most) we may merge into one resulting data file during one compaction run.

How large the resulting file may be in comparison to the collections database.maximal-journal-size setting:—compaction.dest-max-file-size-factorIn ArangoDB you can configure a default journal filesize globally and override it on a per collection level.This value controls the size of collected data files relative to the configured journal file size of the collection in question.

A factor of 3 means that the maximum filesize of the compacted file is3 times the size of the maximum collection journal file size.

how large may the compaction result file become:—compaction.dest-max-result-file-sizenext to the factor above, a totally maximum allowed filesize in bytes may be specified.This will overrule all previous parameters.