Apache Kafka is a streaming platform engineered for processing and managing events. It can be brilliantly tailored to meet an organization’s needs. This makes Kafka the perfect choice as an integration platform for decoupling applications/communication. Many Modern enterprises are adopting off-premise managed infrastructure like Kafk on AWS cloud. These are easier to provision, manage and monitor. They can be easily scaled as per fluctuating needs(if any). This enables a huge variety of infrastructure onto one’s disposal, the limit is only the sky.
Deployments in cloud
There are various vendors available for deploying Kafka on the cloud (AWS). Folks can go for AWS who prefer a fully matured and time tested cloud vendor. The most attractive feature of cloud is the makeshift capability, one can easily build dummies/prototypes etc for a better understanding economically.
For starters there are two ways to proceed. Let us talk about some of the best practices of using Apache Kafka on AWS Cloud.
Self-managed kafka on AWS
With a self managed server you are responsible for managing your server and any server software installed on it. Here you will be responsible for designing, installations, issues, maintenance and all. Installation steps on self-managed Kafka cluster on AWS are:
- Deciding upon a design (EC2 instances, Storage(EBS), Network, etc).
- Deciding upon the parameters & performance considerations (optimisation for Throughput, Latency, Durability, and Availability).
- Provisioning infrastructure on AWS, accommodating availability zones, domains for DR etc.
- Installing and configuring on a makeshift/sandbox environment, running miniature loads and defining SLAs, POT/POCs.
- Accommodating findings in Design and provisioning infrastructure in AWS.
- Downloading Confluent binaries from confluent.io
- Installing the tarball package in all EC2 instances.
- Starting the services and benchmarking performance.

Fully managed Kafka on AWS
In fully managed offerings from confluent cloud, the entire Kafka cluster can be spawned with ease and hassle-free. Some advantages of confluent cloud include:
- Support for both hybrid and multi-cloud.
- Infinite storage scaling capability.
- 120+ connector support from confluent-hub, easy integration capability with existing enterprise applications.
- comfortably scale critical apps.
- Supports enterprise grade security & compliance protocols.
Watch the demo video from Confluent cloud.
One can even opt for Amazon managed Apache Kafka. Similar to Confluent Cloud, a fully functional Kafka cluster can be spawned in a few minutes using Amazon MSK console.
For a detailed review of Amazon MSK, check out this review. Visit below link for quick start.
Design considerations when deploying to AWS cloud:
Capacity
Proper planning and foreseeing requirements hugely reduces later analysis and damage-control efforts. Proper Proof of Concept / Proof of Technology ensures safe future operation and prevents outages. A minimum of 3 nodes broker cluster & 3 node zookeeper ensemble is recommended for minimalistic Kafka cluster design.
The thumb rule says for N broker-cluster, the topic can sustain N-1 failures. So more brokers, the merrier.
For a system handling production load a 7 node broker and 5 node Zookeeper ensemble is recommended.

Source: confluent.io
Processing
The speed and latency for a Kafka cluster comes from the server type and the network attached. Buying reserved instances helps quick fixing instance failures.
D2.xlarge is recommended if you’re using instance local storage, or r4.xlarge if you’re attaching EBS.
Network
Communication within the Kafka cluster constitutes the core of Apache Kafka performance. All replication and inter-broker heartbeat happen over the wire. Therefore a reliable and strong network is recommended i.e. 1 GbE, 10 GbE.
Storage
The choice of storage varies from budget, performance, ease of recovery etc factors. Appropriate trade-offs between these factors would create an efficient storage solution.
Instance storage is cheaper and fast, but recovering data from a failed instance is bothersome and sometimes impossible. EBS on the other hand is bit expensive and high performant, but is expensive. EBS from a failed instance can be re-attached to a new instance to restore node functioning quicker. st1 EBS is found better performant and optimised for through-put.
In general, isolating storage by using discs is recommended, this ensures instance failures don’t corrupt the attached storage.
Performance
The storage and instance computing resources compose and dictate the overall system performance. Opting a level of high-availability has implications on performance adding latency. Adopting different security standards also has their own performance implications. SSL for instance adds to processing overhead on instances therefore increasing latency.
Fault-tolerance
Ensuring cluster withstands instance failures, multi node cluster is recommended of minimum 5 nodes. Appropriate monitoring infrastructure and alerting solutions help in this regard. Kafka inherently attempts to maintain cluster healthy functioning, it automatically assigns a new leader from existing in-sync replicas (ISRs).
It’s important to have a minimum replication factor of 3. A higher replication factor greatly enhances fault tolerance, but on the other hand increases network overhead and storage requirements.
High-availability
To create a cluster disaster proof and survive multiple hardware failures which includes power failures, network outages, router burns etc constitutes high-availability of a cluster. The degree of availability depends upon the criticality of workload.
It is recommended to have your instances on multiple availability zones to ensure continuity of operations. To make system disaster safe, one needs to have a disaster recovery cluster in a different domain. This requires precise calculations accommodating data consistency and cost considerations. Conceptually there is another Disaster recovery (DR) cluster situated outside an availability domain, and continuous replication of data happens from primary and DR cluster using standard tools (Mirror make or Confluent Replicator).
Security
Apache Kafka comes with inbuilt configurable security protocols, the implementation of any(or all) of these are driven by enterprise use-case. Each security feature takes a toll on Kafka streaming performance, this needs to be analysed thoroughly, appropriate trade-off should be understood and implemented. Best way to mitigate this is to benchmark the cluster on each security layer.
Licence
It is recommended to have a Confluent subscription for enterprise support on any mission critical Kafka implementation. This safeguards enterprises from potential issue roadblocks and unknown/known bugs helps quick remediation and restoration of affected components.
Some additional references
Installing on RHEL/CentOS
Recommended install on production systems
https://docs.confluent.io/5.5.0/installation/installing_cp/zip-tar.html#prod-kafka-cli-install
Design recommendations while deploying on AWS cloud