Sunday, March 18, 2018

Why there will be no Kafka EventStore in prooph

tl;dr

When even Greg Young, the author of the original EventStore implementation (http://geteventstore.com/) says that it's a bad idea to implement a Kafka EventStore, than it's a bad idea. The prooph-team will not provide a Kafka EventStore implementation.

Before we begin, let's see what requirements we need from an event-store:

- Concurrency checks
  When an event with same version is appended twice to the event-store, only the first attempt is allowed to succeed. This is very important, imagine you have multiple processes inserting event to an existing stream, let's say we have an an existing aggregate with only one event (version 1). Now two processes insert two events, so we have: event 1, event 2, event 2', event 3, event 3'. Next another process is inserting an event 4. If the consumer of the stream has to decide whether or not an event belongs to the stream, it's a hard decision now, because we could have the following possible event streams: event 1, event 2, event 3, event 4 or event 1, event 2', event 3, event 4 or event 1, event 2, event 3', event 4 or event 1, event 2', event 3', event 4. Additionally you can't rely on timing (let's say event 2 was inserted slightly before event 2'), because on a different data center the order could be the other way around. The matter complicates the further the event stream goes. And this is not a blockchain-like problem, where simply the longest chain wins, because sometimes, there will be no more additional events to an existing stream. I have to add, that this concurrency check requirement and version constraint might not be needed for all use-cases, in some applications it might be okay to just record whatever happened and versions / order don't matter at all (or not that much), but for a general purpose event-store implementation (where you don't wanna put dozens of warnings, and stuff), this will only bring problems and lot of bug-reports.

- One stream per aggregate
  In the original event-store implementation of Greg Young (https://github.com/EventStore/EventStore), there is by default one stream per aggregate. That means that not all events related to aggregate type "User" are stored in a single stream, but that we have one stream for each aggregate, f.e. "User-<user_id_1>", "User-<user_id_2>", "User-<user_id_3>", ...
  This option is also available for prooph-event-store, limiting the usage to disallow this strategy is possibile, but not really wanted.
  To quote Greg Young: "You need stream per aggregate not type. You can do single aggregate instance for all instances but it's yucky"

- Store forever
  While at first glance obvious, the event store should persist the event forever, it's not allowed to be removed, garbage collected or deleted on server shutdown.

- Querying event offset
  Another quite obvious thing to consider at first: Given you loaded an aggregate from a snapshot, you already have it at a specific version (let's say event version is 10 f.e.). Then you only need to load events starting from 11. This is especially important, once you have thousands of events in an aggregate (imagine you would need to load all 5000 events again, instead of only the last 3, event when you have a snapshot). Even more important when one stream per aggregate is not possible.
 

Now let's look at what Kafka has to offer:

- Concurrency checks

  Well, Kafka hasn't that. It would not be such a problem with something like actor model, to quote Greg Young again:
  >> I also really like the idea of in memory models (especially when built
  >> up as actors in say erlang or akka). One of the main benefits here is
  >> that the actor infrastructure can assure you a single instance in the
  >> cluster and gets rid of things like the need for optimistic concurrency.
  >>
  >> Greg
  This this one is a really big issue.

- One stream per aggregate
  If I have thousands of aggregates and each have a topic, Kafka (and ZooKeeper specifically) will explode. These are real problems with Kafka. ZooKeeper can't handle 10M partitions right now.

- Store forever
  On Kafka events expire! Yes really, they expire! Fortunately enough, with never versions of Kafka, you can configure it to not expire messages at all (day saved!).

- Querying event offset
  Here we go, that's not possibile with Kafka! Combine that with the "One stream per aggregate" problem, and here we go full nightmare. It's simply not reasonable to read millions or even billions of events, just to replay the 5 events you're interessted in to your aggregate root.

Way around some limitations:

- Use some actor modelish implementation (like Akka) - this would solve the "Concurrency checks" issue and well as the
  "Queryingevent offset" issue, because you can use stuff like "idempotent producer semantics" (to track producer id and position in event stream) and in-memory checks of aggregates, to go around the concurrency checks requirement. But prooph are PHP components and even if we would implement some akka-like infrastructure, that would be a rare use-case that someone wanted to use this implementation.
 
 
Another interessting quote from Greg Young about an Kafka EventStore:

>> Most systems have some amount of data that is "live" and some much
>> larger amount of data that is essentially "historical" a perfect example
>> of this might be mortgage applications in a bank. There is some tiny %
>> that are currently in process in the system and a vast number that are
>> "done".
>>
>> If you wanted to just put everything into one stream you would need to
>> hydrate all of them and keep them in memory (even the ones you really
>> don't care about any more).
>>
>> This can turn into a very expensive operation/decision.
>>
>> Cheers,
>>
>> Greg

and also this one:

>> Most of the systems discussed are not really event sourced they just
>> raise events and are distributing them. They do not keep their events as
>> their source of truth (they just throw them away). They don't do things
>> like replaying (which is kind of the benchmark)
>>
>> Not everyone who sends a message is "event sourcing"
>>
>> Greg

Final thoughts:

Kafka is great for stream processing. With enqueue (github.com/php-enqueue/enqueue-dev) and prooph's enqueue-producer (github.com/prooph/psb-enqueue-producer/) you can already send messages to Kafka for processing. So send your messages to Kafka, if you need or want to.
In my opinion a Kafka EventStore implementation would be very limited and not useful for most of the PHP applications build. Therefor I think there will never be a Kafka EventStore implementation (not in prooph, nor in any other programming language - correct me if I'm wrong and you know an open source Kafka EventStore implementation somewhere!).
When even Greg Young things a Kafka EventStore is a bad idea, I'm at least not the only one out there.

References:

Most of the Greg Young quotes about an Kafka EventStore are taken from here: https://groups.google.com/forum/#!topic/dddcqrs/rm02iCfffUY
I recommend this thread for everyone you wants to dig deaper into the problems with Kafka as an EventStore.

11 comments:

  1. Hi there Sascha

    The Kafka broker is really a log so it’s best not to think of it as a database. It provides a powerful event storage layer. Atop that you add a view (i.e. the Query side of CQRS). People do this in two ways: either via the Kafka Streams API (which is part of Kafka) which provides an internal ‘state store’ where you can build views that allow you to query aggregates directly within your app. Alternatively via a database (which it can be connected to via one of the connectors).

    The approach is little different to the way people use Event Store, so I can see where the confusion would come from, but it has some nice advantages:
    - it encourages you to process events directly which takes you down a more event-driven route.
    - It separates the log of events (the source of truth) from the view(s) which makes it work well in multi tenant systems like microservices. This is useful because the source of truth remains (and is authoritative), but the view(s) typically change frequently.
    - This also means microservices can share a single source of truth, but each microservice has views that are lightweight and targeted to the job that microservice needs to do, as well as being wholly owned and operated by that service.
    - You can leverage all the tools that come with a stream processing engine, right on your log of events.

    So I see Event Store as an event sourcing database, and it definitely has its place. Kafka is more like an event store that facilitates a *set* of event driven applications through the event sourcing / command sourcing, EDA and CQRS patterns, as well as providing a lot of scalability and resiliency primitives. That’s why it tends to be used a lot with microservices.

    There is an example of a small application here: https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/

    There is also a bit more detail in this free book: https://www.confluent.io/designing-event-driven-systems

    Oh, and you can store data in kafka for as long as you want (set retention.ms=-1). In fact topics with hundreds of TBs are not uncommon.

    Ben

    ReplyDelete
    Replies
    1. Hi Ben, thanks for your input. At the open source project prooph (http://getprooph.org/) we got a lot of requests for an event store implemented with Kafka. So I looked into it and this blog post is the result of my research.

      I am not saying that Kafka is a bad tool, or it cannot be used in event sourced systems at all. All I am saying is that Kafka is not suitable for an event store. By event store I mean a database that is able to store events (like http://eventstore.org/ or as in prooph we use MySQL / Postgres). This works really well and Kafka can be a nice addition for event-processing, view generation, etc. But as you said yourself, the Kafka broker is a log. I would not use the word "event store" when talking about Kafka, because an event store has already a specific meaning. People get confused quickly, if you say things like "Kafka is an event store that..." - Kafka is not an event store. As I laid out in this blog post, it's not suitable for an event store at all. But Kafka is still a great tool that can give awesome results when used in combination, no doubts about that.

      Delete
    2. RDBMS is good choice for implementing event store but, they lack on rebuilding read models and projections, for example we have 1M stream (aka table in RDBMS on per-aggregate-id per-aggregate-type strategies),how i can project from multiple tables?

      Delete
    3. Hello Ben,

      I'm currently investigating on using Kafka as an Event Store, and I have a few questions regarding your previous comment (sorry for digging up the thread).

      - it encourages you to process events directly which takes you down a more event-driven route.

      => That is not directly related to using Kafka, but is always true when going in the Event Sourcing path, isn't it?

      - It separates the log of events (the source of truth) from the view(s) which makes it work well in multi tenant systems like microservices. This is useful because the source of truth remains (and is authoritative), but the view(s) typically change frequently.
      - This also means microservices can share a single source of truth, but each microservice has views that are lightweight and targeted to the job that microservice needs to do, as well as being wholly owned and operated by that service.

      => That also will be achieved when using CQRS/Event Sourcing with any event store infrastructure, isn't it?

      - You can leverage all the tools that come with a stream processing engine, right on your log of events.

      Are you talking about kafka-streams, KStream and KTable for instance?

      More generally, my main concern about using Kafka as an event store is the difficulty/impossibility to retrieve a stream of events for a given aggregate type and id. A workaround to this is to have a topic per aggregate type and consumers that index events by aggregate id. Then, when a process needs to fetch a stream of events to rebuild an aggregate (during an update operation for instance), it can rely on this separate index. Which mean that the write processes rely on a view (these indexes are projections/views of the source of truth), and not on the source of truth, which makes me very uncomfortable (as these indexes are eventual consistent by nature)!

      My second concern, as Sascha has stated, is the inability to handle concurrency check when publishing events (aka optimistic locking) with Kafka. That means that domain invariant can be silently broken and that does not sound good at all... However, adding such feature to Kafka is under discussion here: https://issues.apache.org/jira/browse/KAFKA-2260.

      Please note that these concerns could arise because I have a biased idea of what, or how, should an event sourced system work (mainly by experiencing on Prooph/Event Store), so feel free to correct me if I'm raising incorrect issues (or non issues)!

      Thanks!

      Gildas

      Delete
    4. Regarding concurrency issue, I've just had a breakthrough while browsing this example project: https://github.com/amitayh/event-sourcing-kafka-streams.

      Basically, by publishing commands on a topic and using aggregate id as the message keys, the owner of the project was able to ensure that no command would be simultaneously handled for a given aggregate id.

      Thus, completely removing the need for concurrency check when appending the new events.

      I believe the magic happens here: https://github.com/amitayh/event-sourcing-kafka-streams/blob/master/commandhandler/src/main/scala/org/amitayh/invoices/commandhandler/CommandToResultTransformer.scala#L27-L39.

      WDYT?

      Gildas

      Delete
  2. Hey there movie fans out there, I have brought The Fallout’ movie in UK for you to watch online for free at Watch in Uk, the best streaming platform.

    ReplyDelete
  3. Hello Guys! hope you all are well, The best movie I have ever seen is Inception(https://www.thestreambible.com/netflix/watch-inception-on-netflix/) this is one of the best movie that i have ever watched

    ReplyDelete
  4. The absence of a Kafka EventStore in prooph may stem from various considerations, such as architectural decisions, project goals, or technical constraints. It's intriguing to explore the reasoning behind this choice, as it sheds light on the design philosophy and trade-offs made. If you're looking for expert assistance with your journal article, our reliable journal article writing services are here to support you. Our journal article writing services will ensure your research is presented effectively, helping you communicate your findings with clarity and impact. Trust us to elevate your scholarly contributions.




    ReplyDelete
  5. As a design subscription service, I find it interesting to explore the reasons behind the absence of a Kafka EventStore in prooph. While Kafka is a powerful event streaming platform, the decision to not include it in prooph may be based on various factors. It could be a matter of aligning with the specific needs and goals of prooph, focusing on alternative event storage solutions, or considering the trade-offs in terms of complexity, maintainability, and scalability. Ultimately, understanding the reasoning behind such decisions helps us appreciate the careful considerations that go into shaping software frameworks and tools.

    ReplyDelete
  6. Kafka EventStore won't integrate with Prooph primarily because of architectural differences. Prooph emphasizes Event Sourcing with its own event store solutions. This separation maintains system integrity. For assignment help UK, this ensures robustness and scalability, enhancing the platform's ability to support academic needs efficiently.

    ReplyDelete
  7. That was my first article on this topic so I guess it will take me time to absorb the info. Was reading an article on digital marketing services in UAE before this one and it was great too.

    ReplyDelete