File index

Define file-index.${index_type}.columns, Paimon will create its corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, or in the directory of the data file. Each data file corresponds to an index file, which has a separate file definition and can contain different types of indexes with multiple columns.

Index File

File index file format. Put all column and offset in the header.

  1. _____________________________________ _____________________
  2. magic versionhead length
  3. |-------------------------------------|
  4. column number
  5. |-------------------------------------|
  6. column 1 index number
  7. |-------------------------------------|
  8. index name 1 start pos length
  9. |-------------------------------------|
  10. index name 2 start pos length
  11. |-------------------------------------|
  12. index name 3 start pos length
  13. |-------------------------------------| HEAD
  14. column 2 index number
  15. |-------------------------------------|
  16. index name 1 start pos length
  17. |-------------------------------------|
  18. index name 2 start pos length
  19. |-------------------------------------|
  20. index name 3 start pos length
  21. |-------------------------------------|
  22. ...
  23. |-------------------------------------|
  24. ...
  25. |-------------------------------------|
  26. redundant length redundant bytes
  27. |-------------------------------------| ---------------------
  28. BODY
  29. BODY
  30. BODY BODY
  31. BODY
  32. _____________________________________ _____________________
  33. *
  34. magic: 8 bytes long, value is 1493475289347502L, BIT_ENDIAN
  35. version: 4 bytes int, BIT_ENDIAN
  36. head length: 4 bytes int, BIT_ENDIAN
  37. column number: 4 bytes int, BIT_ENDIAN
  38. column x name: 2 bytes short BIT_ENDIAN and Java modified-utf-8
  39. index number: 4 bytes int (how many column items below), BIT_ENDIAN
  40. index name x: 2 bytes short BIT_ENDIAN and Java modified-utf-8
  41. start pos: 4 bytes int, BIT_ENDIAN
  42. length: 4 bytes int, BIT_ENDIAN
  43. redundant length: 4 bytes int (for compatibility with later versions, in this version, content is zero)
  44. redundant bytes: var bytes (for compatibility with later version, in this version, is empty)
  45. BODY: column index bytes + column index bytes + column index bytes + .......

Column Index Bytes: BloomFilter

Define 'file-index.bloom-filter.columns'.

Content of bloom filter index is simple:

  • numHashFunctions 4 bytes int, BIT_ENDIAN
  • bloom filter bytes

This class use (64-bits) long hash. Store the num hash function (one integer) and bit set bytes only. Hash bytes type (like varchar, binary, etc.) using xx hash, hash numeric type by specified number hash.

Column Index Bytes: Bitmap

Define 'file-index.bitmap.columns'.

Bitmap file index format (V1):

  1. Bitmap file index format (V1)
  2. +-------------------------------------------------+-----------------
  3. version (1 byte)
  4. +-------------------------------------------------+
  5. row count (4 bytes int)
  6. +-------------------------------------------------+
  7. non-null value bitmap number (4 bytes int)
  8. +-------------------------------------------------+
  9. has null value (1 byte)
  10. +-------------------------------------------------+
  11. null value offset (4 bytes if has null value) HEAD
  12. +-------------------------------------------------+
  13. value 1 | offset 1
  14. +-------------------------------------------------+
  15. value 2 | offset 2
  16. +-------------------------------------------------+
  17. value 3 | offset 3
  18. +-------------------------------------------------+
  19. ...
  20. +-------------------------------------------------+-----------------
  21. serialized bitmap 1
  22. +-------------------------------------------------+
  23. serialized bitmap 2
  24. +-------------------------------------------------+ BODY
  25. serialized bitmap 3
  26. +-------------------------------------------------+
  27. ...
  28. +-------------------------------------------------+-----------------
  29. *
  30. value x: var bytes for any data type (as bitmap identifier)
  31. offset: 4 bytes int (when it is negative, it represents that there is only one value
  32. and its position is the inverse of the negative value)

Integer are all BIT_ENDIAN.