(not a guide for this question; only for how this question is different from all others)
Delivery semantics
- Exactly once, at most once or at least once? Why most systems do at least once?
- Is it a queue or a topic? In a queue, once a message is sent to a consumer, it is deleted or at least not sent to others. In a topic, different consumer groups can receive all messages, and deletion is subject to a retention policy.
Failure modes: when is a message actually “consumed”?
- What if the message was received but the consumer crashed right after?
- What if sending the message to the consumer times out? Was it actually “processed” or not? Maybe the network glitched but the consumer did “ack” it.
- If we retry on failure, is it ok to process other messages to maintain throughput or is it more important to maintain strict ordering?
- SQS makes a message “invisible” while it is being processed by a consumer, but this invisibility times out if the message is not acked (or failed) before a deadline and then it can be retried by a different consumer.
- Common approaches: consumers “ack” messages, and then producer can consider them “done”, and consumer keeps state of consumed messages by some id, so consumption can be idempotent.
Concurrency pattern for scaling consumers
- If strict ordering is required for consumption, then there’s no way to scale reads horizontally, but typically absolute ordering is not required.
- More commonly, there are many queues and partitions within each queue, and only ordering within partition needs to support ordering.
- In order to keep track of which backend instance contains which queue contents, discuss consistent hashing.
Where to store messages?
- High performance requires storing messages in memory, but memory is volatile so replication is needed (high reliability).
- Durability may be a requirement, cost or scale may mean messages don’t fit in memory or are too expensive, so disk may be required.
- The above suggests LSM trees which is exactly what Kafka uses.