Simplifying Step Functions and Stepwise: Lessons Learned and a New Approach

At Motiva, we use AWS Step Functions to manage all of our workflows. To make this simpler, we developed Stepwise, an open-source Clojure library that helps us coordinate various tasks like business processes, data pipelines, and even our machine learning workflows. By using Step Functions, we can effortlessly handle complex event-driven processes and monitor our operations. Although we have been using Step Functions and Stepwise in production for a couple years, we identified some areas where we can improve the developer experience. This post will share some bottlenecks we discovered and propose a new and improved version of a Step Functions library, but with Clojure all the way.

What do we mean Clojure all the way? We have been designing a new interface to define Step Functions state machines, which we will illustrate with an example of a pizza-making workflow. To create this state machine on AWS Step Functions, we would use the following code:

(def pizza-making-state-machine
  (sfn/->> request
           (sfn/parallel (make-dough)
                         (make-sauce)
                         (sfn/map {:iterate-over :ingredients} prepare-ingredients))
           (put-ingredients-on-dough)
           (bake)
           (sfn/wait 2 :minutes)
           (sfn/choice (comp not is-pizza-acceptable?)
                       ;; branch off to this fn if condition is true
                       (sfn/fail))
           (serve)))

(sfn/ensure-state-machine client :pauls-pizza-making-machine pizza-making-state-machine)

This looks like normal Clojure code, but it is actually a workflow. This is because AWS Step Functions keeps the state of your workflow at all times. So, if your server, processes, or any of your workers go down during any workflow execution, your execution will continue where it left off once your system comes back online.

Step Functions diagram for pizza-making-state-machine

We can define the workers for each step of the workflow in Clojure as follows:

(defn make-dough
  ;; SFN error handling configuration as metadata
  {:retry [{:error-equals     [:sfn.timeout]
            :interval-seconds 60
            :max-attempts     3}]}
  [coll]
  coll)

(defn put-ingredients-on-dough [coll] coll)

(defn bake [coll] coll)

(defn make-sauce [coll] coll)

(sfn/def-choice-predicate-fn is-pizza-acceptable? [m]
  true)

These functions define the steps in the pizza-making workflow. We can choose where to run these workers at runtime using the following code:

(sfn/run-here client
              {make-dough               {:concurrency 2}
               put-ingredients-on-dough {:concurrency 1}
               bake                     {:concurrency 1}})


(sfn/run-on-lambda client
                   {make-sauce {:timeout         40
                                :memory-size     512
                                :max-concurrency 50}})

Since the workers are defined as Clojure functions, we can choose to run them in containers or serverless functions at runtime.

Say, what if make-dough turns out to be an infrequent but bursty process that would be more suitable to run on serverless. But make-sauce takes too long before Lambda times out. We can switch the two like so:

(sfn/run-here client
              {make-sauce               {:concurrency 5}
               put-ingredients-on-dough {:concurrency 1}
               bake                     {:concurrency 1}})


(sfn/run-on-lambda client
                   {make-dough {:timeout         30
                                :memory-size     512
                                :max-concurrency 100}})

At Motiva, we have state machines to manage email delivery, data integration, and machine learning decisions, among others. However, we only have four developers on our team, and we want to improve our speed in delivering quality products with operational excellence. As demand grows and customers ask for more, we're seeing that our speed to orchestrate new business workflows as state machines is crucial to our competitiveness.

To achieve this goal, we plan to simplify our development process by using only one tool, Clojure, instead of using multiple tools like Amazon States Language and Terraform. By doing so, we can focus on delivering value to our customers.

So, what do you think? We're still in the design phase of this new library. If this is something that's of interest to you, please get in touch with me.