Moving from Lambda to SageMaker for inference

A few years ago, we moved our machine learning inference endpoints to serverless in order to reduce operations upkeep. At that time, AWS Lambda seemed like an obvious choice, as many of our Clojure services were already running on Lambda, and we had a deployment process in place. We used our continuous integration platform to package our applications, upload them to S3, and then deployed resources with Terraform.

However, we faced a new challenge as our machine learning models grew beyond the 250 MB Lambda artifact limit. To resolve this, we needed to update our deployment pipeline to either pack our models as container images on Lambda (which has a more generous 10 GB limit) or find a new solution.

Enter SageMaker Serverless Inference. It offers all the benefits of Lambda with additional conveniences tailored for ML model deployments. One drawback is cost: according to our calculations, 1000 requests on a 2GB memory instance each running for 1 second cost 3.3 cents on Lambda, but 4 cents on SageMaker Serverless Inference. It's a 21% increase, but for a small company like us, the absolute amount is negligible compared to developer time saved.

In this blog post, I'll guide you through the steps of creating and using a SageMaker Serverless Inference endpoint, which we found to be a seamless experience. We're continuously impressed with the ecosystem developing around the MLOps community.

Creating a SageMaker endpoint

We decided to try SageMaker Serverless Inference and re-deployed one of our smallest machine learning models on SageMaker using the instructions from this Hugging Face notebook.

  • create an IAM role with AmazonSageMakerFullAccess policy on your AWS account (ref). Let's call it my_sagemaker_execution_role in this example.

  • create a S3 bucket for SageMaker to upload data, models, and logs:

import sagemaker

sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
  • ensure execution IAM role and default S3 bucket:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="my_sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
  • create a HuggingFaceModel instance by passing in env as the environment variables. It is important to note that we need to set the optional HF_API_TOKEN value because our model is private on Hugging Face. Hence, we must pass an API token to the environment for the container to successfully pull our private model.
env = {
    "HF_MODEL_ID": "MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli",
    "HF_TASK": "zero-shot-classification",

huggingface_model = HuggingFaceModel(
    env=env,                      # configuration for loading model from Hub
    role=role,                    # iam role with permissions to create an Endpoint
    transformers_version="4.26",  # transformers version used
    pytorch_version="1.13",       # pytorch version used
    py_version="py39",            # python version used
  • deploy!
serverless_config = ServerlessInferenceConfig(

classifier = huggingface_model.deploy(
    serverless_inference_config=serverless_config, endpoint_name="my_demo_sagemaker_endpoint"

We really liked the fact that this can all be done in the same Python environment as our model development. This saves us from context switching between Python, Docker, and Terraform. However, we are unsure about using the SageMaker Hugging Face Inference Toolkit, which hides some of the nice features of SageMaker endpoints, such as support for production variants to perform A/B testing on models.

Requesting an inference

Wait for the SageMaker endpoint to be up and running. It takes a few minutes. After this, we can make a request. If you followed through further from the upstream instructions linked above, you can make a request immediately using the object returned from HuggingFaceModel.deploy(). However, in most cases, requests are made from a different process and at a later time. Therefore, we will not have access to the deploy() object at that point. Fortunately, this is easily achievable.

We have a couple of options. We can use the HTTP endpoint directly or leverage SageMakerRuntime. We opted to use SageMakerRuntime because we're already in Python and do not want to go through the process of writing an AWS authentication header for a HTTP request.

import boto3

client = boto3.client("sagemaker-runtime")

# request body for a zero-shot classifier
body = {
    "inputs": "A new model offers an explanation for how the Galilean satellites formed around the solar system’s largest world.",
    "parameters": {
        "candidate_labels": ["space", "microbiology", "robots", "archeology"],

response = client.invoke_endpoint(
    EndpointName="my_demo_sagemaker_endpoint", # replace with your endpoint name


This example request body schema is for a zero-shot classifier. Request bodies for other transformers.pipeline tasks are shown in this Hugging Face notebook.

Lastly, note that SageMaker Serverless endpoint has a concurrent invocation limit of 200 invokes per endpoint, whereas Lambda can handle tens of thousands.