Triton Inference Server on AIR-T¶

This tutorial will walk you through how to set up and run the triton inference server on your AIR-T and provide a minimal example to load a model and get a prediction.

Triton Inference Server is an open source inference serving software that streamlines AI inference, i.e., running an AI application for production. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Further instructions for setup and optimization can be found here:

Nvidia Triton Documentation

This tutorial is for Airstack 2.0+. If you are running an older version of airstack please see the previous version of this tutorial here

Triton Setup¶

For JetPack 6.0+ Nvidia now releases docker containers that include triton installation. You can find the latest docker container release here and pull the version with "-igpu":
```
docker pull nvcr.io/nvidia/tritonserver:XX.YY-py3-igpu
```

Setup a Model Repository¶

Choose a folder on your AIR-T to hold your triton inference models. Inside this folder you will need to follow this format:

 <model-repository-path>/
   <model-name>/
     [config.pbtxt]
     [<output-labels-file> ...]
     <version>/
       <model-definition-file>
     <version>/
       <model-definition-file>
     ...
   ...

We will download the model from the Inference on AIR-T tutorial and use it to request a prediction:

Create model directory
```
mkdir triton_models
cd triton_models
```

Create directories

mkdir avg_power_net
mkdir avg_power_net/0

Download model onnx file into version folder

cd avg_power_net/0
wget https://github.com/deepwavedigital/airstack-examples/raw/refs/heads/airstack-2.0/inference/pytorch/avg_pow_net.onnx -O model.onnx

Create the config.pbtxt file show below. This file should live in the top level of this model directory and be named config.pbtxt

name: "avg_power_net"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [
{
name: "input_buffer"
data_type: TYPE_FP32
dims: [4096]
}]
output [{
name: "output_buffer"
data_type: TYPE_FP32
dims: [1]
}]

Note: More information on model repository setup can be found here model repository documentation

Run the Triton Server¶

Run the docker container using the example command below:

docker run --runtime=nvidia --rm --network=host  -v /home/deepwave/triton_models:/triton_models nvcr.io/nvidia/tritonserver:24.12-py3-igpu tritonserver --model-repository=/triton_models

Your triton inference server is now available at the listed addresses. You should be able to check the average power net model using:

curl http://0.0.0.0:8000/v2/models/avg_power_net

and receive back:

{
    "name": "avg_power_net",
    "versions": ["0"],
    "platform": "onnxruntime_onnx",
    "inputs": [
        {
            "name": "input_buffer",
            "datatype": "FP32",
            "shape": [-1, 4096]
        }
    ],
    "outputs": [
        {
            "name": "output_buffer",
            "datatype": "FP32",
            "shape": [-1, 1]
        }
    ]
}

To request an inference result using python you can follow the code below:

import requests
import numpy as np

# Setup POST Request
url = "http://localhost:8000/v2/models/avg_power_net/infer"
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}
session = requests.Session()

# Create input data array
input_data = np.random.uniform(-1, 1, 4096).astype(np.float32)
data = (
    {
        "inputs": [{
            "name": "input_buffer",
            "shape": [1, 4096],
            "datatype": "FP32",
            "data": input_data.tolist()
        }],
        "outputs": [{
            "name": "output_buffer"
        }]
    }
)

# Post request
r = session.post(url, json=data, headers=headers)

# Read Results
print(r.content)

For more information on HTTP/REST & GRPC protocols check out: Triton Protocols