Parquet Format

Format: Serialization Schema Format: Deserialization Schema

The Apache Parquet format allows to read and write Parquet data.

Dependencies

In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependencySQL Client
  1. <dependency>
  2. <groupId>org.apache.flink</groupId>
  3. <artifactId>flink-parquet</artifactId>
  4. <version>1.16.0</version>
  5. </dependency>
Copied to clipboard!
Download

How to create a table with Parquet format

Here is an example to create a table using Filesystem connector and Parquet format.

  1. CREATE TABLE user_behavior (
  2. user_id BIGINT,
  3. item_id BIGINT,
  4. category_id BIGINT,
  5. behavior STRING,
  6. ts TIMESTAMP(3),
  7. dt STRING
  8. ) PARTITIONED BY (dt) WITH (
  9. 'connector' = 'filesystem',
  10. 'path' = '/tmp/user_behavior',
  11. 'format' = 'parquet'
  12. )

Format Options

OptionRequiredDefaultTypeDescription
format
required(none)StringSpecify what format to use, here should be ‘parquet’.
parquet.utc-timezone
optionalfalseBooleanUse UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.

Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.

Data Type Mapping

Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:

  • Timestamp: mapping timestamp type to int96 whatever the precision is.
  • Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data TypeParquet typeParquet logical type
CHAR / VARCHAR / STRINGBINARYUTF8
BOOLEANBOOLEAN
BINARY / VARBINARYBINARY
DECIMALFIXED_LEN_BYTE_ARRAYDECIMAL
TINYINTINT32INT_8
SMALLINTINT32INT_16
INTINT32
BIGINTINT64
FLOATFLOAT
DOUBLEDOUBLE
DATEINT32DATE
TIMEINT32TIME_MILLIS
TIMESTAMPINT96
ARRAY-LIST
MAP-MAP
ROW-STRUCT