Bayesian modeling in Julia: Turing and Stan

I’ve been using Julia more and more in my non-work projects. Years ago I had tried out Julia but found the syntax and the concepts pretty foreign. Over time, I’ve been exposed more and more to computer science concepts that have made it seem increasingly approachable.

Julia certainly feels much less high-level than Python and R. An obvious reason is that though Julia is a dynamically-typed language, it does not use an interpreter in the same way that Python and R does. Julia does not go as far as statically-typed and low-level languages such as C++ where you write the code, save the file, and then compile it to to an executible with the machine code that ultimately does the work. Julia instead leverages a Just-In-Time compiler that stores the machine code as you go along. Julia also allows you to define types rather than simply allowing you to provide type hints (like in Python). This and a few other features of the JIT compiler it uses (LLVM) such as Multiple Dispatch, is the reason that many in the scientific computing and data science space have used it as their language of choice where one needs to maximize performance. For more details about how Julia works and why it is so performant, I highly recommend this book.

Aside from the need to understand some computer science concepts as a barrier, another was that I was more comfortable writing my Bayesian statistical models in Stan using either cmdstanr or cmdstanpy. These interfaces allowed me to take advantage of the power of C++ along with Stan’s use of the No-U-Turn-Sampler for efficient and accurate MCMC inference while allowing me to keep using languages that I was comfortable with to do all of the other parts outside of fitting the model (i.e., data cleaning, visualizing the results, etc.). As my curiosity around Julia grew, I was still determined to stick with Stan due to my comfort with it and I didn’t feel that there was as large of a community (therefore support) for Stan (and the NUTS sampler) in Julia. So, I stayed away.

In the past few months, I have been working on some academic projects in my free time and had seen that I could use the NUTS algorithm using Turing.jl. I haven’t checked how long this algorithm has been available with Turing.jl and honestly I don’t even remember looking for Turing.jl when I had been flirting with the language in the past. I had just been looking for “Stan in Julia” and had not seen as much as I was accustomed to in R and Python circles. It was definitely my fault for not having looked into it deeper. But, I don’t think I am alone in not seeing Julia as being competitive with R or even Python for Bayesian modeling.

A recent side project of mine has been to use Bradley-Terry models to power rank MLB teams. MLB teams are ranked by how many wins and losses they have. The more teams relative to losses a team has, the higher their rankings are in the standings. The use of the Bradley-Terry model is to estimate the latent ability of each of the teams. This is done by taking each team and their opponents. To understand how difficult the opponents are, you look at the opponents’ opponents and how they’ve faired against them. Teams that may have the same number of wins and losses may be equal in the standings, but if one team has beaten more teams with better records than the other team that has beaten fewer teams with better records, then the former team is the one that would have a better latent ability than the latter team.

When I had first been playing around with these models, I was doing it in Python with cmdstanpy. However, recently, I thought I’d try it out using Turing.jl. One reason was out of curiosity about how it’d perform against cmdstanpy. The other reason was that I’ve been toying around the idea for a larger project where I write a web app to do some Bayesian modeling with a Julia backend and thought that I should also try to fit the models using Turing.jl rather than do a Python backend or to use straight-up cmdstan.

So, I re-wrote the models that I had in cmdstanpy and implemented them in Turing.jl. I posted a picture of my terminal on Bluesky and Elliott Morris was amazed, like I had been, that Turing.jl had NUTS available as the MCMC sampler. He then asked for a blog post comparing between them. I am but a humble servant.

First, I am going to show the contents of the scripts used to load the data from my local database, to define the model, to fit the model, and extract the ranks of the teams. This will help give a sense of the differences in syntax.

Let’s load the dependencies.

R
Julia

library(DBI)
library(dplyr)
library(duckdb)
library(cmdstanr)
library(microbenchmark)

using DataFrames;
using DuckDB;
using Stan;
using Turing;

Then I am going to connect to my local database (from the project that this was inspired from) and pull in the data.

R
Julia

# Get the data.
# - Connect to the DB.
con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db")
# - Get the dataframe.
df <- dbGetQuery(
    con
    , "
    with a as (
        select
            scores.game_id
            , schedule.home_team
            , scores.home_runs
            , schedule.away_team
            , scores.away_runs
        from scores
            left join schedule
            on scores.game_id=schedule.game_id
        where schedule.season_id like '2025'
    )
    select
        game_id
        , teams.team_abbr
        , dense_rank() over(order by home_team) as home_team
        , dense_rank() over(order by away_team) as away_team
        , (case
            when home_runs > away_runs then 1
            else 0
        end) as home_win
    from a
        left join teams
        on a.home_team=teams.team_id
    where teams.season_id like '2025';
    "
)
# - Close the connection.
dbDisconnect(con, shutdown=TRUE)

# Get a vector of key-value pairs to map team abbreviations to team ids.
ids <- vector()
for (i in seq_len(nrow(df))) {
    key <- df[i, "team_abbr"]
    value <- df[i, "home_team"]
    ids[as.character(key)] <- as.integer(value)
}

# Connect to the DB.
con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db");
# Pull in the data as a dataframe.
df = DataFrame(
    DBInterface.execute(
        con 
        , """
        with a as (
            select
                scores.game_id
                , schedule.home_team
                , scores.home_runs
                , schedule.away_team
                , scores.away_runs
            from scores
                left join schedule
                on scores.game_id=schedule.game_id
            where schedule.season_id like '2025'
        )
        select
            game_id
            , teams.team_abbr
            , dense_rank() over(order by home_team) as home_team
            , dense_rank() over(order by away_team) as away_team
            , (case
                when home_runs > away_runs then 1
                else 0
            end) as home_win
        from a
            left join teams
            on a.home_team=teams.team_id
        where teams.season_id like '2025'
        """
    )
)

# Get a unique Dictionary of the team abbreviation and a team number that ranges from 1-30.
ids = Dict(Pair.(df.team_abbr, df.home_team))

Now, to define the simple Bradley-Terry model.

\[ \begin{aligned} y \sim \text{Bernoulli}(\theta) \\ \theta = \frac{1}{1 + e^{-(log(\alpha_\text{home}) - log(\alpha_\text{away}))}} \\ \alpha_D \sim \mathcal{HN}(0, 1) \end{aligned} \]

As can be seen below, the code to fit the model is a lot less verbose with Turing.jl. However, the generated quantities block in Stan allows for me to compute the ranks directly rather than to have a separate function that has to wrangle with the posterior samples.

R
Julia

# The Stan model code.
stan <- "
data {
  int<lower=1> N; // Number of games.
  int<lower=1> J; // Number of teams.
  array[N, 2] int T; // Matrix of team ids.
  array[N] int<lower=0, upper=1> y; // Did the home team win?
}

transformed data {
  // Create a vector indicating each team.
  array[N] int home = to_array_1d(T[,1]);
  array[N] int away = to_array_1d(T[,2]);
}

parameters {
  vector<lower=0>[J] alpha; // The ability for each team.
}

model {
  // Prior on a logged-odds scale of the ability for each team.
  alpha ~ normal(0, 1);
  // Compute the ability for each team given who won.
  // The ability for the away team is dependent on whether they
  // beat the home team.
  // If the away team won, then the logged odd would be
  // alpha_away * 1 - alpha_home * 0
  y ~ bernoulli_logit(log(alpha[home]) - log(alpha[away]));
}

generated quantities {
  // Compute the ranking of each team based on who won.
  array[J] int rank; // Ranking of the teams.
  {
    // Get the ranking of each team in descending order
    // of the alpha.
    array[J] int rank_index = sort_indices_desc(alpha);
    // For each team, apply the rank
    // is the rank for team i.
    for (i in 1:J) {
      rank[rank_index[i]] = i;
    }
  }
}
"

# Define the model.
@model function turing(x, y, d)
    # Prior for the ability parameter
    # for each team (length of d).
    # HalfNormal.
    α ~ filldist(truncated(Normal(0.,1.), 0., Inf), d)
    # Likelihood.
    for i in 1:length(y)
        θ = log(α[x[i,1]]) - log(α[x[i,2]])
        y[i] ~ BernoulliLogit(θ)
    end
end

# Define the function to rank the teams.
"""
    rank

Ranking the ability scores, α, for each object in each posterior draw.

Args:
    x (Chains): The result of sampling the turing.jl model.
    d (Int8): The number of objects.

Returns:

"""
function rank(x, d)
    # Get the dims of the posterior.
    iters = size(x,1)
    chains = size(x,3)
    # Create a matrix of samples for the α parameter.
    samples = MCMCChains.group(x, :α).value
    # Initialize an array of rankings.
    rank_arr = Array{Integer, 3}(undef, iters, length(d), chains)
    # Rank the α for each sample.
    # This should produce an array with the ranking
    # for each team -- in order.
    for c in 1:chains
        for i in 1:iters
            # Get the current sample for iteration i and chain j
            current_sample = samples[i, :, c]
            # Rank the options by sorting the α values
            # in descending order (higher α means higher rank).
            ranked_indices = sortperm(current_sample, rev=true)
            # Assign ranks to each option based on sorted order
            for rank_idx in 1:length(d)
                rank_arr[i, ranked_indices[rank_idx], c] = rank_idx
            end
        end
    end
    # Initialize a DataFrame.
    df = DataFrame()
    # Place the rankings into a DataFrame.
    for c in 1:chains
        for (key, value) in d
            temp_df = DataFrame(
                iter = repeat(iters:-1:1, outer=1)
                , Rank = rank_arr[:, value, c]
                , Team = key
                , chain = c
            )
            # Append the temporary DataFrame
            # to the main DataFrame
            append!(df, temp_df)
        end
    end    
    # Return the result.
    return df
end

Now, to fit the models.

R
Turing.jl

mod <- cmdstan_model(stan)
dat <- list(
    N = nrow(df)
    , J = length(ids)
    , T = as.matrix(df[, c("home_team", "away_team")])
    , y = df$home_win
)
fit <- mod$sample(
    data = dat
    , seed = 123
    , chains = 4
    , iter_warmup = 1000
    , iter_sampling = 3000
    , show_messages = FALSE
)

mod = turing(
    Matrix(select(df, [:home_team, :away_team]))
    , df.home_win
    , length(ids)
);
fit = Turing.sample(mod, NUTS(), MCMCThreads(), 4_000, 4);
turing_ranks = rank(fit, ids);

Next, I’ll look at how quickly Julia and R can do all of this.

I’ll start by providing a disclaimer: This is not a super formal benchmarking analysis; this is meant to give us some rough sense of the differences in performance between cmdstanr and Turing.jl. With that being said, I use the microbenchmark package in R and use BenchmarkTools in Julia. A select piece of code is executed 100 times and the time is recorded. I then compare the average amount of time that the code took to execute.

I am interested in seeing how different pieces of the code compare between R and Julia. While I am primarily interested in the differences between cmdstanr and Turing.jl, I am also interested in how long it takes to retrieve the data and to set things up as well. Rather than just throw all of it into one chunk and get the benchmark for all of the steps, I break up the different steps of interest into functions and benchmark each of the functions. Note, these benchmarks do not include the compile times of the Julia code nor does it include the compile times of the stan code – for each, I compiled them before doing the benchmarking.

I ran these on an M2 Macbook Pro with 16 GB of RAM. I ran the Julia code using the REPL and the R code with the interactive console.

Here is what each script looks like.

R
Julia

# Script for cmdstanr benchmarks.
library(DBI)
library(dplyr)
library(duckdb)
library(cmdstanr)
library(microbenchmark)

# Define the number of times to evaluate the functions.
evals <- 100

# Get the data.
get_data <- function(){
    # Connect to the DB.
    con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db")
    # Get the dataframe.
    df <- dbGetQuery(
        con
        , "
        with a as (
            select
                scores.game_id
                , schedule.home_team
                , scores.home_runs
                , schedule.away_team
                , scores.away_runs
            from scores
                left join schedule
                on scores.game_id=schedule.game_id
            where schedule.season_id like '2025'
        )
        select
            game_id
            , teams.team_abbr
            , dense_rank() over(order by home_team) as home_team
            , dense_rank() over(order by away_team) as away_team
            , (case
                when home_runs > away_runs then 1
                else 0
            end) as home_win
        from a
            left join teams
            on a.home_team=teams.team_id
        where teams.season_id like '2025';
        "
    )
    # Close the connection.
    dbDisconnect(con, shutdown=TRUE)
    return(df)
}

# Benchmark the data retreival.
microbenchmark(get_data(), times=evals)

# Get the dataframe object.
df <- get_data()

# Get a vector of key-value pairs to map team abbreviations to team ids.
get_ids <- function(df){
    ids <- vector()
    for (i in seq_len(nrow(df))) {
        key <- df[i, "team_abbr"]
        value <- df[i, "home_team"]
        ids[as.character(key)] <- as.integer(value)
    }
    return(ids)
}

# Benchmark the creation of the vector.
microbenchmark(get_ids(df), times=evals)

# Get the ids vector.
ids <- get_ids(df)

# The model.

fit_model <- function(df, ids){
    # Fit the stan model
    mod <- cmdstan_model("./posts/julia_nuts/bt.stan")
    dat <- list(
        N = nrow(df)
        , J = length(ids)
        , T = as.matrix(df[, c("home_team", "away_team")])
        , y = df$home_win
    )
    fit <- mod$sample(
        data = dat
        , seed = 123
        , chains = 4
        , iter_warmup = 1000
        , iter_sampling = 3000
        , show_messages = FALSE
    )
    return(fit)
}

# Benchmark it.
microbenchmark(fit_model(df, ids), times=evals)

# Script for Turing.jl benchmarks
using BenchmarkTools;
using DataFrames;
using DuckDB;
using Turing;

#  Define the number of times to evaluate the functions.
const times = 100;

# Get the data.
function get_data()
    # Connect to the DB.
    con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db")
    # Pull in the data as a  dataframe.
    df = DataFrame(
        DBInterface.execute(
            con 
            , """
            with a as (
                select
                    scores.game_id
                    , schedule.home_team
                    , scores.home_runs
                    , schedule.away_team
                    , scores.away_runs
                from scores
                    left join schedule
                    on scores.game_id=schedule.game_id
                where schedule.season_id like '2025'
            )
            select
                game_id
                , teams.team_abbr
                , dense_rank() over(order by home_team) as home_team
                , dense_rank() over(order by away_team) as away_team
                , (case
                    when home_runs > away_runs then 1
                    else 0
                end) as home_win
            from a
                left join teams
                on a.home_team=teams.team_id
            where teams.season_id like '2025'
            """
        )
    )
    # Close the connection.
    DBInterface.close(con)
    return df
end;

# Benchmark the data retreival.
get_data();
run(@benchmarkable get_data() samples=times evals=1)

# Get the dataframe object.
df = get_data();

# Get a vector of key-value pairs  to map team abbreviations to team ids.
function get_ids(df)
    ids = Dict(Pair.(df.team_abbr, df.home_team))
end

# Benchmark the creation of the vector.
get_ids(df);
run(@benchmarkable get_ids(df) samples=times evals=1)

# Get the ids vector.
ids = get_ids(df);

# The model.
# Define the functions.
@model function turing(x, y, d)
    # Prior for the ability parameter
    # for each team (length of d).
    # HalfNormal.
    α ~ filldist(truncated(Normal(0., 1.), 0., Inf), d)
    # Likelihood.
    for i in 1:length(y)
        θ = log(α[x[i,1]]) - log(α[x[i,2]])
        y[i] ~ BernoulliLogit(θ)
    end
end
# Define the function to rank the teams.
"""
    rank

Ranking the ability scores, α, for each object in each posterior draw.

Args:
    x (Chains): The result of sampling the turing.jl model.
    d (Int8): The number of objects.

Returns:

"""
function rank(x, d)
    # Get the dims of the posterior.
    iters = size(x,1)
    chains = size(x,3)
    # Create a matrix of samples for the α parameter.
    samples = MCMCChains.group(x, :α).value
    # Initialize an array of rankings.
    rank_arr = Array{Integer, 3}(undef, iters, length(d), chains)
    # Rank the α for each sample.
    # This should produce an array with the ranking
    # for each team -- in order.
    for c in 1:chains
        for i in 1:iters
            # Get the current sample for iteration i and chain j
            current_sample = samples[i, :, c]
            # Rank the options by sorting the α values
            # in descending order (higher α means higher rank).
            ranked_indices = sortperm(current_sample, rev=true)
            # Assign ranks to each option based on sorted order
            for rank_idx in 1:length(d)
                rank_arr[i, ranked_indices[rank_idx], c] = rank_idx
            end
        end
    end
    # Initialize a DataFrame.
    df = DataFrame()
    # Place the rankings into a DataFrame.
    for c in 1:chains
        for (key, value) in d
            temp_df = DataFrame(
                iter = repeat(iters:-1:1, outer=1)
                , Rank = rank_arr[:, value, c]
                , Team = key
                , chain = c
            )
            # Append the temporary DataFrame
            # to the main DataFrame
            append!(df, temp_df)
        end
    end    
    # Return the result.
    return df
end

function fit_model(df, ids)
    # Fit the model.
    mod = turing(
        Matrix(select(df, [:home_team, :away_team]))
        , df.home_win
        , length(ids)
    );
    fit = Turing.sample(mod, NUTS(), MCMCThreads(), 4_000, 4);
    turing_ranks = rank(fit, ids);
    return turing_ranks
end

# Benchmark it.
fit_model(df, ids);
run(@benchmarkable fit_model(df, ids) samples=times evals=1)

The benchmarks for loading the data:

R
Julia

Unit: milliseconds
       expr      min       lq     mean  median       uq      max neval
 get_data() 27.96528 29.22603 30.81858 30.2425 30.82956 87.96124   100

BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
 Range (min … max):  5.200 ms …   6.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.344 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.362 ms ± 140.335 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             █▂ ▅ ▂    ▄
  ▆▁▁▅█▃▆▅▆▇▆██▆████▁▇▇█▁▁▅▅▅▁▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  5.2 ms          Histogram: frequency by time        5.78 ms <

 Memory estimate: 115.76 KiB, allocs estimate: 2806.

Given the extensive examples of comparing between R and Julia, Julia was quite fast relative to R to load the data from the same database, using DuckDB, and using the same SQL code.

The benchmarks for creating the dictionary mapping the team abbreviations and team ids:

R
Julia

Unit: milliseconds
        expr      min       lq     mean   median     uq     max neval
 get_ids(df) 14.92953 15.25797 16.13266 15.51901 16.847 21.4346   100

BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
 Range (min … max):  25.959 μs … 64.583 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.060 μs ±  4.041 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▁  █▄▁▁▁
  ▆██▄▆█████▆▃▄▁▄▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  26 μs           Histogram: frequency by time        40.7 μs <

 Memory estimate: 55.27 KiB, allocs estimate: 11.

Again, Julia is very very fast relative to R. Note, that the benchmarks in Julia are in terms of nanoseconds while the benchmarks in R are in terms of milliseconds.

The benchmarks for fitting the models:

R
Julia

Unit: seconds
               expr      min      lq     mean   median       uq      max neval
 fit_model(df, ids) 10.41509 10.4595 10.48976 10.48027 10.50788 10.76658   100

BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
 Range (min … max):  14.073 s … 114.021 s  ┊ GC (min … max): 7.59% … 4.14%
 Time  (median):     54.673 s              ┊ GC (median):    4.63%
 Time  (mean ± σ):   53.880 s ±  25.798 s  ┊ GC (mean ± σ):  4.93% ± 1.98%

  ▅  ▂▂█▅                  ▅   ▂    ▂ ▅    ▂▅     ▅    █ ▂
  █▅▅████▅██▁▁▅███▅▅█▁▅▁▅▁▁███▁█▅▅▅▁█▁██▅█▁██▅▅█▅▅█▁█▁████▅█ ▅
  14.1 s         Histogram: frequency by time         92.9 s <

 Memory estimate: 10.93 GiB, allocs estimate: 39417142.

Here is what is quite interesting, cmdstanr is VERY fast relative to Turing.jl. I am a huge fan of Stan. Stan.jl is a wonderful interface to Stan in Julia, however, it appears that a lot of the work on this has been limited to a small group of people who very understandably need to scale back and have begun archiving some of the repositories for it and have slowed down their commits. The latest commit to Stan.jl was 8 months ago as of writing this blog post. Hopefully more of us Stan and Julia fans can pitch in!

Before I close out, another word of caution is important here. In addition to the disclaimers earlier, it is also important to note that the relative performance of cmdstanr and Julia are limited to the simple Bradley-Terry model – which is a quite niche scenario. The generalizability of these results could be quite poor, so take this all with a grain of salt. It seems like others who have compared Turing.jl and Stan have also documented the power of Stan. (Thank you Stephen!)

But hopefully this is fun and interesting! It’s been fun and interesting for me to work on!

--- title: "Bayesian modeling in Julia: Turing and Stan" code-tools: true code-fold: false code-line-numbers: true date: 2025-07-28 categories: - julia - bayesian - benchmarking execute: eval: false --- I've been using `Julia` more and more in my non-work projects. Years ago I had tried out `Julia` but found the syntax and the concepts pretty foreign. Over time, I've been exposed more and more to computer science concepts that have made it seem increasingly approachable. `Julia` certainly feels much less high-level than `Python` and `R`. An obvious reason is that though `Julia` is a dynamically-typed language, it does not use an interpreter in the same way that `Python` and `R` does. `Julia` does not go as far as statically-typed and low-level languages such as `C++` where you write the code, save the file, and then compile it to to an executible with the machine code that ultimately does the work. `Julia` instead leverages a `Just-In-Time` compiler that stores the machine code as you go along. `Julia` also allows you to define types rather than simply allowing you to provide type hints (like in `Python`). This and a few other features of the JIT compiler it uses (`LLVM`) such as Multiple Dispatch, is the reason that many in the scientific computing and data science space have used it as their language of choice where one needs to maximize performance. For more details about how `Julia` works and why it is so performant, I highly recommend [this book](https://juliahighperformance.com/). Aside from the need to understand some computer science concepts as a barrier, another was that I was more comfortable writing my Bayesian statistical models in `Stan` using either `cmdstanr` or `cmdstanpy`. These interfaces allowed me to take advantage of the power of `C++` along with `Stan`'s use of the No-U-Turn-Sampler for efficient and accurate `MCMC` inference while allowing me to keep using languages that I was comfortable with to do all of the other parts outside of fitting the model (i.e., data cleaning, visualizing the results, etc.). As my curiosity around `Julia` grew, I was still determined to stick with `Stan` due to my comfort with it and I didn't feel that there was as large of a community (therefore support) for `Stan` (and the NUTS sampler) in `Julia`. So, I stayed away. In the past few months, I have been working on some academic projects in my free time and had seen that I could use the NUTS algorithm using `Turing.jl`. I haven't checked how long this algorithm has been available with `Turing.jl` and honestly I don't even remember looking for `Turing.jl` when I had been flirting with the language in the past. I had just been looking for "Stan in Julia" and had not seen as much as I was accustomed to in `R` and `Python` circles. It was definitely my fault for not having looked into it deeper. But, I don't think I am alone in not seeing `Julia` as being competitive with `R` or even `Python` for Bayesian modeling. A [recent side project of mine](https://blog.damoncroberts.io/posts/baseball_paired/content) has been to use Bradley-Terry models to power rank MLB teams. MLB teams are ranked by how many wins and losses they have. The more teams relative to losses a team has, the higher their rankings are in the standings. The use of the Bradley-Terry model is to estimate the latent ability of each of the teams. This is done by taking each team and their opponents. To understand how difficult the opponents are, you look at the opponents' opponents and how they've faired against them. Teams that may have the same number of wins and losses may be equal in the standings, but if one team has beaten more teams with better records than the other team that has beaten fewer teams with better records, then the former team is the one that would have a better latent ability than the latter team. When I had first been playing around with these models, I was doing it in `Python` with `cmdstanpy`. However, recently, I thought I'd try it out using `Turing.jl`. One reason was out of curiosity about how it'd perform against `cmdstanpy`. The other reason was that I've been toying around the idea for a larger project where I write a web app to do some Bayesian modeling with a `Julia` backend and thought that I should also try to fit the models using `Turing.jl` rather than do a `Python` backend or to use straight-up `cmdstan`. So, I re-wrote the models that I had in `cmdstanpy` and implemented them in `Turing.jl`. [I posted a picture of my terminal on Bluesky](https://bsky.app/profile/damoncroberts.io/post/3luxy6qbsds25) and [Elliott Morris was amazed](https://bsky.app/profile/gelliottmorris.com/post/3luxyiphwv22k), like I had been, that `Turing.jl` had NUTS available as the MCMC sampler. He then asked for a blog post comparing between them. I am but a humble servant. First, I am going to show the contents of the scripts used to load the data from my local database, to define the model, to fit the model, and extract the ranks of the teams. This will help give a sense of the differences in syntax. Let's load the dependencies. ::: {.panel-tabset} ## `R` ```{.r} library(DBI) library(dplyr) library(duckdb) library(cmdstanr) library(microbenchmark) ``` ## `Julia` ```{.julia} using DataFrames; using DuckDB; using Stan; using Turing; ``` ::: Then I am going to connect to my local database (from the project that this was inspired from) and pull in the data. ::: {.panel-tabset} ## `R` ```{.r} # Get the data. # - Connect to the DB. con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db") # - Get the dataframe. df <- dbGetQuery( con , " with a as ( select scores.game_id , schedule.home_team , scores.home_runs , schedule.away_team , scores.away_runs from scores left join schedule on scores.game_id=schedule.game_id where schedule.season_id like '2025' ) select game_id , teams.team_abbr , dense_rank() over(order by home_team) as home_team , dense_rank() over(order by away_team) as away_team , (case when home_runs > away_runs then 1 else 0 end) as home_win from a left join teams on a.home_team=teams.team_id where teams.season_id like '2025'; " ) # - Close the connection. dbDisconnect(con, shutdown=TRUE) ``` ```{.r} # Get a vector of key-value pairs to map team abbreviations to team ids. ids <- vector() for (i in seq_len(nrow(df))) { key <- df[i, "team_abbr"] value <- df[i, "home_team"] ids[as.character(key)] <- as.integer(value) } ``` ## `Julia` ```{.julia} # Connect to the DB. con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db"); # Pull in the data as a dataframe. df = DataFrame( DBInterface.execute( con , """ with a as ( select scores.game_id , schedule.home_team , scores.home_runs , schedule.away_team , scores.away_runs from scores left join schedule on scores.game_id=schedule.game_id where schedule.season_id like '2025' ) select game_id , teams.team_abbr , dense_rank() over(order by home_team) as home_team , dense_rank() over(order by away_team) as away_team , (case when home_runs > away_runs then 1 else 0 end) as home_win from a left join teams on a.home_team=teams.team_id where teams.season_id like '2025' """ ) ) ``` ```{.julia} # Get a unique Dictionary of the team abbreviation and a team number that ranges from 1-30. ids = Dict(Pair.(df.team_abbr, df.home_team)) ``` ::: Now, to define the simple Bradley-Terry model. $$ \begin{aligned} y \sim \text{Bernoulli}(\theta) \\ \theta = \frac{1}{1 + e^{-(log(\alpha_\text{home}) - log(\alpha_\text{away}))}} \\ \alpha_D \sim \mathcal{HN}(0, 1) \end{aligned} $$ As can be seen below, the code to fit the model is a lot less verbose with `Turing.jl`. However, the `generated quantities` block in `Stan` allows for me to compute the ranks directly rather than to have a separate function that has to wrangle with the posterior samples. ::: {.panel-tabset} ## `R` ```{.r} # The Stan model code. stan <- " data { int<lower=1> N; // Number of games. int<lower=1> J; // Number of teams. array[N, 2] int T; // Matrix of team ids. array[N] int<lower=0, upper=1> y; // Did the home team win? } transformed data { // Create a vector indicating each team. array[N] int home = to_array_1d(T[,1]); array[N] int away = to_array_1d(T[,2]); } parameters { vector<lower=0>[J] alpha; // The ability for each team. } model { // Prior on a logged-odds scale of the ability for each team. alpha ~ normal(0, 1); // Compute the ability for each team given who won. // The ability for the away team is dependent on whether they // beat the home team. // If the away team won, then the logged odd would be // alpha_away * 1 - alpha_home * 0 y ~ bernoulli_logit(log(alpha[home]) - log(alpha[away])); } generated quantities { // Compute the ranking of each team based on who won. array[J] int rank; // Ranking of the teams. { // Get the ranking of each team in descending order // of the alpha. array[J] int rank_index = sort_indices_desc(alpha); // For each team, apply the rank // is the rank for team i. for (i in 1:J) { rank[rank_index[i]] = i; } } } " ``` ## `Julia` ```{.julia} # Define the model. @model function turing(x, y, d) # Prior for the ability parameter # for each team (length of d). # HalfNormal. α ~ filldist(truncated(Normal(0.,1.), 0., Inf), d) # Likelihood. for i in 1:length(y) θ = log(α[x[i,1]]) - log(α[x[i,2]]) y[i] ~ BernoulliLogit(θ) end end # Define the function to rank the teams. """ rank Ranking the ability scores, α, for each object in each posterior draw. Args: x (Chains): The result of sampling the turing.jl model. d (Int8): The number of objects. Returns: """ function rank(x, d) # Get the dims of the posterior. iters = size(x,1) chains = size(x,3) # Create a matrix of samples for the α parameter. samples = MCMCChains.group(x, :α).value # Initialize an array of rankings. rank_arr = Array{Integer, 3}(undef, iters, length(d), chains) # Rank the α for each sample. # This should produce an array with the ranking # for each team -- in order. for c in 1:chains for i in 1:iters # Get the current sample for iteration i and chain j current_sample = samples[i, :, c] # Rank the options by sorting the α values # in descending order (higher α means higher rank). ranked_indices = sortperm(current_sample, rev=true) # Assign ranks to each option based on sorted order for rank_idx in 1:length(d) rank_arr[i, ranked_indices[rank_idx], c] = rank_idx end end end # Initialize a DataFrame. df = DataFrame() # Place the rankings into a DataFrame. for c in 1:chains for (key, value) in d temp_df = DataFrame( iter = repeat(iters:-1:1, outer=1) , Rank = rank_arr[:, value, c] , Team = key , chain = c ) # Append the temporary DataFrame # to the main DataFrame append!(df, temp_df) end end # Return the result. return df end ``` ::: Now, to fit the models. ::: {.panel-tabset} ## `R` ```{.r} mod <- cmdstan_model(stan) dat <- list( N = nrow(df) , J = length(ids) , T = as.matrix(df[, c("home_team", "away_team")]) , y = df$home_win ) fit <- mod$sample( data = dat , seed = 123 , chains = 4 , iter_warmup = 1000 , iter_sampling = 3000 , show_messages = FALSE ) ``` ## `Turing.jl` ```{.julia} mod = turing( Matrix(select(df, [:home_team, :away_team])) , df.home_win , length(ids) ); fit = Turing.sample(mod, NUTS(), MCMCThreads(), 4_000, 4); turing_ranks = rank(fit, ids); ``` ::: Next, I'll look at how quickly `Julia` and `R` can do all of this. I'll start by providing a disclaimer: This is not a super formal benchmarking analysis; this is meant to give us some rough sense of the differences in performance between `cmdstanr` and `Turing.jl`. With that being said, I use the `microbenchmark` package in `R` and use `BenchmarkTools` in `Julia`. A select piece of code is executed 100 times and the time is recorded. I then compare the average amount of time that the code took to execute. I am interested in seeing how different pieces of the code compare between `R` and `Julia`. While I am primarily interested in the differences between `cmdstanr` and `Turing.jl`, I am also interested in how long it takes to retrieve the data and to set things up as well. Rather than just throw all of it into one chunk and get the benchmark for all of the steps, I break up the different steps of interest into functions and benchmark each of the functions. Note, these benchmarks do not include the compile times of the `Julia` code nor does it include the compile times of the `stan` code -- for each, I compiled them before doing the benchmarking. I ran these on an M2 Macbook Pro with 16 GB of RAM. I ran the `Julia` code using the `REPL` and the `R` code with the interactive console. Here is what each script looks like. ::: {.panel-tabset} ## `R` ```{.r} # Script for cmdstanr benchmarks. library(DBI) library(dplyr) library(duckdb) library(cmdstanr) library(microbenchmark) # Define the number of times to evaluate the functions. evals <- 100 # Get the data. get_data <- function(){ # Connect to the DB. con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db") # Get the dataframe. df <- dbGetQuery( con , " with a as ( select scores.game_id , schedule.home_team , scores.home_runs , schedule.away_team , scores.away_runs from scores left join schedule on scores.game_id=schedule.game_id where schedule.season_id like '2025' ) select game_id , teams.team_abbr , dense_rank() over(order by home_team) as home_team , dense_rank() over(order by away_team) as away_team , (case when home_runs > away_runs then 1 else 0 end) as home_win from a left join teams on a.home_team=teams.team_id where teams.season_id like '2025'; " ) # Close the connection. dbDisconnect(con, shutdown=TRUE) return(df) } # Benchmark the data retreival. microbenchmark(get_data(), times=evals) # Get the dataframe object. df <- get_data() # Get a vector of key-value pairs to map team abbreviations to team ids. get_ids <- function(df){ ids <- vector() for (i in seq_len(nrow(df))) { key <- df[i, "team_abbr"] value <- df[i, "home_team"] ids[as.character(key)] <- as.integer(value) } return(ids) } # Benchmark the creation of the vector. microbenchmark(get_ids(df), times=evals) # Get the ids vector. ids <- get_ids(df) # The model. fit_model <- function(df, ids){ # Fit the stan model mod <- cmdstan_model("./posts/julia_nuts/bt.stan") dat <- list( N = nrow(df) , J = length(ids) , T = as.matrix(df[, c("home_team", "away_team")]) , y = df$home_win ) fit <- mod$sample( data = dat , seed = 123 , chains = 4 , iter_warmup = 1000 , iter_sampling = 3000 , show_messages = FALSE ) return(fit) } # Benchmark it. microbenchmark(fit_model(df, ids), times=evals) ``` ## `Julia` ```{.julia} # Script for Turing.jl benchmarks using BenchmarkTools; using DataFrames; using DuckDB; using Turing; # Define the number of times to evaluate the functions. const times = 100; # Get the data. function get_data() # Connect to the DB. con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db") # Pull in the data as a dataframe. df = DataFrame( DBInterface.execute( con , """ with a as ( select scores.game_id , schedule.home_team , scores.home_runs , schedule.away_team , scores.away_runs from scores left join schedule on scores.game_id=schedule.game_id where schedule.season_id like '2025' ) select game_id , teams.team_abbr , dense_rank() over(order by home_team) as home_team , dense_rank() over(order by away_team) as away_team , (case when home_runs > away_runs then 1 else 0 end) as home_win from a left join teams on a.home_team=teams.team_id where teams.season_id like '2025' """ ) ) # Close the connection. DBInterface.close(con) return df end; # Benchmark the data retreival. get_data(); run(@benchmarkable get_data() samples=times evals=1) # Get the dataframe object. df = get_data(); # Get a vector of key-value pairs to map team abbreviations to team ids. function get_ids(df) ids = Dict(Pair.(df.team_abbr, df.home_team)) end # Benchmark the creation of the vector. get_ids(df); run(@benchmarkable get_ids(df) samples=times evals=1) # Get the ids vector. ids = get_ids(df); # The model. # Define the functions. @model function turing(x, y, d) # Prior for the ability parameter # for each team (length of d). # HalfNormal. α ~ filldist(truncated(Normal(0., 1.), 0., Inf), d) # Likelihood. for i in 1:length(y) θ = log(α[x[i,1]]) - log(α[x[i,2]]) y[i] ~ BernoulliLogit(θ) end end # Define the function to rank the teams. """ rank Ranking the ability scores, α, for each object in each posterior draw. Args: x (Chains): The result of sampling the turing.jl model. d (Int8): The number of objects. Returns: """ function rank(x, d) # Get the dims of the posterior. iters = size(x,1) chains = size(x,3) # Create a matrix of samples for the α parameter. samples = MCMCChains.group(x, :α).value # Initialize an array of rankings. rank_arr = Array{Integer, 3}(undef, iters, length(d), chains) # Rank the α for each sample. # This should produce an array with the ranking # for each team -- in order. for c in 1:chains for i in 1:iters # Get the current sample for iteration i and chain j current_sample = samples[i, :, c] # Rank the options by sorting the α values # in descending order (higher α means higher rank). ranked_indices = sortperm(current_sample, rev=true) # Assign ranks to each option based on sorted order for rank_idx in 1:length(d) rank_arr[i, ranked_indices[rank_idx], c] = rank_idx end end end # Initialize a DataFrame. df = DataFrame() # Place the rankings into a DataFrame. for c in 1:chains for (key, value) in d temp_df = DataFrame( iter = repeat(iters:-1:1, outer=1) , Rank = rank_arr[:, value, c] , Team = key , chain = c ) # Append the temporary DataFrame # to the main DataFrame append!(df, temp_df) end end # Return the result. return df end function fit_model(df, ids) # Fit the model. mod = turing( Matrix(select(df, [:home_team, :away_team])) , df.home_win , length(ids) ); fit = Turing.sample(mod, NUTS(), MCMCThreads(), 4_000, 4); turing_ranks = rank(fit, ids); return turing_ranks end # Benchmark it. fit_model(df, ids); run(@benchmarkable fit_model(df, ids) samples=times evals=1) ``` ::: The benchmarks for loading the data: ::: {.panel-tabset} ## `R` ```{.bash} Unit: milliseconds expr min lq mean median uq max neval get_data() 27.96528 29.22603 30.81858 30.2425 30.82956 87.96124 100 ``` ## `Julia` ```{.bash} BenchmarkTools.Trial: 100 samples with 1 evaluation per sample. Range (min … max): 5.200 ms … 6.431 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.344 ms ┊ GC (median): 0.00% Time (mean ± σ): 5.362 ms ± 140.335 μs ┊ GC (mean ± σ): 0.00% ± 0.00% █▂ ▅ ▂ ▄ ▆▁▁▅█▃▆▅▆▇▆██▆████▁▇▇█▁▁▅▅▅▁▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃ 5.2 ms Histogram: frequency by time 5.78 ms < Memory estimate: 115.76 KiB, allocs estimate: 2806. ``` ::: Given the extensive examples of comparing between `R` and `Julia`, `Julia` was quite fast relative to `R` to load the data from the same database, using `DuckDB`, and using the same `SQL` code. The benchmarks for creating the dictionary mapping the team abbreviations and team ids: ::: {.panel-tabset} ## `R` ```{.bash} Unit: milliseconds expr min lq mean median uq max neval get_ids(df) 14.92953 15.25797 16.13266 15.51901 16.847 21.4346 100 ``` ## `Julia` ```{.bash} BenchmarkTools.Trial: 100 samples with 1 evaluation per sample. Range (min … max): 25.959 μs … 64.583 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 27.500 μs ┊ GC (median): 0.00% Time (mean ± σ): 28.060 μs ± 4.041 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▁▁ █▄▁▁▁ ▆██▄▆█████▆▃▄▁▄▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃ 26 μs Histogram: frequency by time 40.7 μs < Memory estimate: 55.27 KiB, allocs estimate: 11. ``` ::: Again, `Julia` is very very fast relative to `R`. Note, that the benchmarks in `Julia` are in terms of nanoseconds while the benchmarks in `R` are in terms of milliseconds. The benchmarks for fitting the models: ::: {.panel-tabset} ## `R` ```{.bash} Unit: seconds expr min lq mean median uq max neval fit_model(df, ids) 10.41509 10.4595 10.48976 10.48027 10.50788 10.76658 100 ``` ## `Julia` ```{.bash} BenchmarkTools.Trial: 100 samples with 1 evaluation per sample. Range (min … max): 14.073 s … 114.021 s ┊ GC (min … max): 7.59% … 4.14% Time (median): 54.673 s ┊ GC (median): 4.63% Time (mean ± σ): 53.880 s ± 25.798 s ┊ GC (mean ± σ): 4.93% ± 1.98% ▅ ▂▂█▅ ▅ ▂ ▂ ▅ ▂▅ ▅ █ ▂ █▅▅████▅██▁▁▅███▅▅█▁▅▁▅▁▁███▁█▅▅▅▁█▁██▅█▁██▅▅█▅▅█▁█▁████▅█ ▅ 14.1 s Histogram: frequency by time 92.9 s < Memory estimate: 10.93 GiB, allocs estimate: 39417142. ``` ::: Here is what is quite interesting, `cmdstanr` is VERY fast relative to `Turing.jl`. I am a huge fan of `Stan`. [`Stan.jl`](https://stanjulia.github.io/Stan.jl/stable/INTRO/#StanJulia-overview) is a wonderful interface to `Stan` in `Julia`, however, it appears that a lot of the work on this has been limited to a small group of people who very understandably need to scale back and have begun archiving some of the repositories for it and have slowed down their commits. The latest commit to `Stan.jl` was 8 months ago as of writing this blog post. Hopefully more of us `Stan` and `Julia` fans can pitch in! Before I close out, another word of caution is important here. In addition to the disclaimers earlier, it is also important to note that the relative performance of `cmdstanr` and `Julia` are limited to the simple Bradley-Terry model -- which is a quite niche scenario. The generalizability of these results could be quite poor, so take this all with a grain of salt. [It seems like others who have compared `Turing.jl` and `Stan` have also documented the power of `Stan`](https://statisticalrethinkingjulia.github.io/MCMCBenchmarks.jl/latest/benchmarks/). ([Thank you Stephen!](https://bsky.app/profile/stephenjwild.bsky.social/post/3luxzch3htk2f)) But hopefully this is fun and interesting! It's been fun and interesting for me to work on!