Production Readiness Checklist
The production readiness checklist provides an overview of configuration options that should be carefully considered before bringing an Apache Flink job into production. While the Flink community has attempted to provide sensible defaults for each configuration, it is important to review this list and ensure the options chosen are sufficient for your needs.
Set An Explicit Max Parallelism
The max parallelism, set on a per-job and per-operator granularity, determines the maximum parallelism to which a stateful operator can scale. There is currently no way to change the maximum parallelism of an operator after a job has started without discarding that operators state. The reason maximum parallelism exists, versus allowing stateful operators to be infinitely scalable, is that it has some impact on your application’s performance and state size. Flink has to maintain specific metadata for its ability to rescale state which grows linearly with max parallelism. In general, you should choose max parallelism that is high enough to fit your future needs in scalability, while keeping it low enough to maintain reasonable performance.
Maximum parallelism must fulfill the following conditions:
0 < parallelism <= max parallelism <= 2^15
You can explicitly set maximum parallelism by using setMaxParallelism(int maxparallelism)
. If no max parallelism is set Flink will decide using a function of the operators parallelism when the job is first started:
128
: for all parallelism <= 128.MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15)
: for all parallelism > 128.
Set UUIDs For All Operators
As mentioned in the documentation for savepoints, users should set uids for each operator in their DataStream
. Uids are necessary for Flink’s mapping of operator states to operators which, in turn, is essential for savepoints. By default, operator uids are generated by traversing the JobGraph and hashing specific operator properties. While this is comfortable from a user perspective, it is also very fragile, as changes to the JobGraph (e.g., exchanging an operator) results in new UUIDs. To establish a stable mapping, we need stable operator uids provided by the user through setUid(String uid)
.
Choose The Right State Backend
See the description of state backends for choosing the right one for your use case.
Choose The Right Checkpoint Interval
Checkpointing is Flink’s primary fault-tolerance mechanism, wherein a snapshot of your job’s state persisted periodically to some durable location. In the case of failure, Flink will restart from the most recent checkpoint and resume processing. A jobs checkpoint interval configures how often Flink will take these snapshots. While there is no single correct answer on the perfect checkpoint interval, the community can guide what factors to consider when configuring this parameter.
What is the SLA of your service: Checkpoint interval is best understood as an expression of the jobs service level agreement (SLA). In the worst-case scenario, where a job fails one second before the next checkpoint, how much data can you tolerate reprocessing? A checkpoint interval of 5 minutes implies that Flink will never reprocess more than 5 minutes worth of data after a failure.
How often must your service deliver results: Exactly once sinks, such as Kafka or the FileSink, only make results visible on checkpoint completion. Shorter checkpoint intervals make results available more quickly but may also put additional pressure on these systems. It is important to work with stakeholders to find a delivery time that meet product requirements without putting undue load on your sinks.
How much load can your Task Managers sustain: All of Flinks’ built-in state backends support asynchronous checkpointing, meaning the snapshot process will not pause data processing. However, it still does require CPU cycles and network bandwidth from your machines. Incremental checkpointing can be a powerful tool to reduce the cost of any given checkpoint.
And most importantly, test and measure your job. Every Flink application is unique, and the best way to find the appropriate checkpoint interval is to see how yours behaves in practice.
Configure JobManager High Availability
The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster. It is a single point of failure within the cluster, and if it crashes, no new jobs can be submitted, and running applications will fail.
Configuring High Availability, in conjunction with Apache Zookeeper or Flinks Kubernetes based service, allows for a swift recovery and is highly recommended for production setups.