I’ve been using Julia
more and more in my non-work projects. Years ago I had tried out Julia
but found the syntax and the concepts pretty foreign. Over time, I’ve been exposed more and more to computer science concepts that have made it seem increasingly approachable.
Julia
certainly feels much less high-level than Python
and R
. An obvious reason is that though Julia
is a dynamically-typed language, it does not use an interpreter in the same way that Python
and R
does. Julia
does not go as far as statically-typed and low-level languages such as C++
where you write the code, save the file, and then compile it to to an executible with the machine code that ultimately does the work. Julia
instead leverages a Just-In-Time
compiler that stores the machine code as you go along. Julia
also allows you to define types rather than simply allowing you to provide type hints (like in Python
). This and a few other features of the JIT compiler it uses (LLVM
) such as Multiple Dispatch, is the reason that many in the scientific computing and data science space have used it as their language of choice where one needs to maximize performance. For more details about how Julia
works and why it is so performant, I highly recommend this book.
Aside from the need to understand some computer science concepts as a barrier, another was that I was more comfortable writing my Bayesian statistical models in Stan
using either cmdstanr
or cmdstanpy
. These interfaces allowed me to take advantage of the power of C++
along with Stan
’s use of the No-U-Turn-Sampler for efficient and accurate MCMC
inference while allowing me to keep using languages that I was comfortable with to do all of the other parts outside of fitting the model (i.e., data cleaning, visualizing the results, etc.). As my curiosity around Julia
grew, I was still determined to stick with Stan
due to my comfort with it and I didn’t feel that there was as large of a community (therefore support) for Stan
(and the NUTS sampler) in Julia
. So, I stayed away.
In the past few months, I have been working on some academic projects in my free time and had seen that I could use the NUTS algorithm using Turing.jl
. I haven’t checked how long this algorithm has been available with Turing.jl
and honestly I don’t even remember looking for Turing.jl
when I had been flirting with the language in the past. I had just been looking for “Stan in Julia” and had not seen as much as I was accustomed to in R
and Python
circles. It was definitely my fault for not having looked into it deeper. But, I don’t think I am alone in not seeing Julia
as being competitive with R
or even Python
for Bayesian modeling.
A recent side project of mine has been to use Bradley-Terry models to power rank MLB teams. MLB teams are ranked by how many wins and losses they have. The more teams relative to losses a team has, the higher their rankings are in the standings. The use of the Bradley-Terry model is to estimate the latent ability of each of the teams. This is done by taking each team and their opponents. To understand how difficult the opponents are, you look at the opponents’ opponents and how they’ve faired against them. Teams that may have the same number of wins and losses may be equal in the standings, but if one team has beaten more teams with better records than the other team that has beaten fewer teams with better records, then the former team is the one that would have a better latent ability than the latter team.
When I had first been playing around with these models, I was doing it in Python
with cmdstanpy
. However, recently, I thought I’d try it out using Turing.jl
. One reason was out of curiosity about how it’d perform against cmdstanpy
. The other reason was that I’ve been toying around the idea for a larger project where I write a web app to do some Bayesian modeling with a Julia
backend and thought that I should also try to fit the models using Turing.jl
rather than do a Python
backend or to use straight-up cmdstan
.
So, I re-wrote the models that I had in cmdstanpy
and implemented them in Turing.jl
. I posted a picture of my terminal on Bluesky and Elliott Morris was amazed, like I had been, that Turing.jl
had NUTS available as the MCMC sampler. He then asked for a blog post comparing between them. I am but a humble servant.
First, I am going to show the contents of the scripts used to load the data from my local database, to define the model, to fit the model, and extract the ranks of the teams. This will help give a sense of the differences in syntax.
Let’s load the dependencies.
Then I am going to connect to my local database (from the project that this was inspired from) and pull in the data.
# Get the data.
# - Connect to the DB.
con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db")
# - Get the dataframe.
df <- dbGetQuery(
con
, "
with a as (
select
scores.game_id
, schedule.home_team
, scores.home_runs
, schedule.away_team
, scores.away_runs
from scores
left join schedule
on scores.game_id=schedule.game_id
where schedule.season_id like '2025'
)
select
game_id
, teams.team_abbr
, dense_rank() over(order by home_team) as home_team
, dense_rank() over(order by away_team) as away_team
, (case
when home_runs > away_runs then 1
else 0
end) as home_win
from a
left join teams
on a.home_team=teams.team_id
where teams.season_id like '2025';
"
)
# - Close the connection.
dbDisconnect(con, shutdown=TRUE)
# Connect to the DB.
con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db");
# Pull in the data as a dataframe.
df = DataFrame(
DBInterface.execute(
con
, """
with a as (
select
scores.game_id
, schedule.home_team
, scores.home_runs
, schedule.away_team
, scores.away_runs
from scores
left join schedule
on scores.game_id=schedule.game_id
where schedule.season_id like '2025'
)
select
game_id
, teams.team_abbr
, dense_rank() over(order by home_team) as home_team
, dense_rank() over(order by away_team) as away_team
, (case
when home_runs > away_runs then 1
else 0
end) as home_win
from a
left join teams
on a.home_team=teams.team_id
where teams.season_id like '2025'
"""
)
)
Now, to define the simple Bradley-Terry model.
\[ \begin{aligned} y \sim \text{Bernoulli}(\theta) \\ \theta = \frac{1}{1 + e^{-(log(\alpha_\text{home}) - log(\alpha_\text{away}))}} \\ \alpha_D \sim \mathcal{HN}(0, 1) \end{aligned} \]
As can be seen below, the code to fit the model is a lot less verbose with Turing.jl
. However, the generated quantities
block in Stan
allows for me to compute the ranks directly rather than to have a separate function that has to wrangle with the posterior samples.
# The Stan model code.
stan <- "
data {
int<lower=1> N; // Number of games.
int<lower=1> J; // Number of teams.
array[N, 2] int T; // Matrix of team ids.
array[N] int<lower=0, upper=1> y; // Did the home team win?
}
transformed data {
// Create a vector indicating each team.
array[N] int home = to_array_1d(T[,1]);
array[N] int away = to_array_1d(T[,2]);
}
parameters {
vector<lower=0>[J] alpha; // The ability for each team.
}
model {
// Prior on a logged-odds scale of the ability for each team.
alpha ~ normal(0, 1);
// Compute the ability for each team given who won.
// The ability for the away team is dependent on whether they
// beat the home team.
// If the away team won, then the logged odd would be
// alpha_away * 1 - alpha_home * 0
y ~ bernoulli_logit(log(alpha[home]) - log(alpha[away]));
}
generated quantities {
// Compute the ranking of each team based on who won.
array[J] int rank; // Ranking of the teams.
{
// Get the ranking of each team in descending order
// of the alpha.
array[J] int rank_index = sort_indices_desc(alpha);
// For each team, apply the rank
// is the rank for team i.
for (i in 1:J) {
rank[rank_index[i]] = i;
}
}
}
"
# Define the model.
@model function turing(x, y, d)
# Prior for the ability parameter
# for each team (length of d).
# HalfNormal.
α ~ filldist(truncated(Normal(0.,1.), 0., Inf), d)
# Likelihood.
for i in 1:length(y)
θ = log(α[x[i,1]]) - log(α[x[i,2]])
y[i] ~ BernoulliLogit(θ)
end
end
# Define the function to rank the teams.
"""
rank
Ranking the ability scores, α, for each object in each posterior draw.
Args:
x (Chains): The result of sampling the turing.jl model.
d (Int8): The number of objects.
Returns:
"""
function rank(x, d)
# Get the dims of the posterior.
iters = size(x,1)
chains = size(x,3)
# Create a matrix of samples for the α parameter.
samples = MCMCChains.group(x, :α).value
# Initialize an array of rankings.
rank_arr = Array{Integer, 3}(undef, iters, length(d), chains)
# Rank the α for each sample.
# This should produce an array with the ranking
# for each team -- in order.
for c in 1:chains
for i in 1:iters
# Get the current sample for iteration i and chain j
current_sample = samples[i, :, c]
# Rank the options by sorting the α values
# in descending order (higher α means higher rank).
ranked_indices = sortperm(current_sample, rev=true)
# Assign ranks to each option based on sorted order
for rank_idx in 1:length(d)
rank_arr[i, ranked_indices[rank_idx], c] = rank_idx
end
end
end
# Initialize a DataFrame.
df = DataFrame()
# Place the rankings into a DataFrame.
for c in 1:chains
for (key, value) in d
temp_df = DataFrame(
iter = repeat(iters:-1:1, outer=1)
, Rank = rank_arr[:, value, c]
, Team = key
, chain = c
)
# Append the temporary DataFrame
# to the main DataFrame
append!(df, temp_df)
end
end
# Return the result.
return df
end
Now, to fit the models.
Next, I’ll look at how quickly Julia
and R
can do all of this.
I’ll start by providing a disclaimer: This is not a super formal benchmarking analysis; this is meant to give us some rough sense of the differences in performance between cmdstanr
and Turing.jl
. With that being said, I use the microbenchmark
package in R
and use BenchmarkTools
in Julia
. A select piece of code is executed 100 times and the time is recorded. I then compare the average amount of time that the code took to execute.
I am interested in seeing how different pieces of the code compare between R
and Julia
. While I am primarily interested in the differences between cmdstanr
and Turing.jl
, I am also interested in how long it takes to retrieve the data and to set things up as well. Rather than just throw all of it into one chunk and get the benchmark for all of the steps, I break up the different steps of interest into functions and benchmark each of the functions. Note, these benchmarks do not include the compile times of the Julia
code nor does it include the compile times of the stan
code – for each, I compiled them before doing the benchmarking.
I ran these on an M2 Macbook Pro with 16 GB of RAM. I ran the Julia
code using the REPL
and the R
code with the interactive console.
Here is what each script looks like.
# Script for cmdstanr benchmarks.
library(DBI)
library(dplyr)
library(duckdb)
library(cmdstanr)
library(microbenchmark)
# Define the number of times to evaluate the functions.
evals <- 100
# Get the data.
get_data <- function(){
# Connect to the DB.
con <- dbConnect(duckdb(), "~/Desktop/mlb_pred/data/twenty_five.db")
# Get the dataframe.
df <- dbGetQuery(
con
, "
with a as (
select
scores.game_id
, schedule.home_team
, scores.home_runs
, schedule.away_team
, scores.away_runs
from scores
left join schedule
on scores.game_id=schedule.game_id
where schedule.season_id like '2025'
)
select
game_id
, teams.team_abbr
, dense_rank() over(order by home_team) as home_team
, dense_rank() over(order by away_team) as away_team
, (case
when home_runs > away_runs then 1
else 0
end) as home_win
from a
left join teams
on a.home_team=teams.team_id
where teams.season_id like '2025';
"
)
# Close the connection.
dbDisconnect(con, shutdown=TRUE)
return(df)
}
# Benchmark the data retreival.
microbenchmark(get_data(), times=evals)
# Get the dataframe object.
df <- get_data()
# Get a vector of key-value pairs to map team abbreviations to team ids.
get_ids <- function(df){
ids <- vector()
for (i in seq_len(nrow(df))) {
key <- df[i, "team_abbr"]
value <- df[i, "home_team"]
ids[as.character(key)] <- as.integer(value)
}
return(ids)
}
# Benchmark the creation of the vector.
microbenchmark(get_ids(df), times=evals)
# Get the ids vector.
ids <- get_ids(df)
# The model.
fit_model <- function(df, ids){
# Fit the stan model
mod <- cmdstan_model("./posts/julia_nuts/bt.stan")
dat <- list(
N = nrow(df)
, J = length(ids)
, T = as.matrix(df[, c("home_team", "away_team")])
, y = df$home_win
)
fit <- mod$sample(
data = dat
, seed = 123
, chains = 4
, iter_warmup = 1000
, iter_sampling = 3000
, show_messages = FALSE
)
return(fit)
}
# Benchmark it.
microbenchmark(fit_model(df, ids), times=evals)
# Script for Turing.jl benchmarks
using BenchmarkTools;
using DataFrames;
using DuckDB;
using Turing;
# Define the number of times to evaluate the functions.
const times = 100;
# Get the data.
function get_data()
# Connect to the DB.
con = DBInterface.connect(DuckDB.DB, "~/Desktop/mlb_pred/data/twenty_five.db")
# Pull in the data as a dataframe.
df = DataFrame(
DBInterface.execute(
con
, """
with a as (
select
scores.game_id
, schedule.home_team
, scores.home_runs
, schedule.away_team
, scores.away_runs
from scores
left join schedule
on scores.game_id=schedule.game_id
where schedule.season_id like '2025'
)
select
game_id
, teams.team_abbr
, dense_rank() over(order by home_team) as home_team
, dense_rank() over(order by away_team) as away_team
, (case
when home_runs > away_runs then 1
else 0
end) as home_win
from a
left join teams
on a.home_team=teams.team_id
where teams.season_id like '2025'
"""
)
)
# Close the connection.
DBInterface.close(con)
return df
end;
# Benchmark the data retreival.
get_data();
run(@benchmarkable get_data() samples=times evals=1)
# Get the dataframe object.
df = get_data();
# Get a vector of key-value pairs to map team abbreviations to team ids.
function get_ids(df)
ids = Dict(Pair.(df.team_abbr, df.home_team))
end
# Benchmark the creation of the vector.
get_ids(df);
run(@benchmarkable get_ids(df) samples=times evals=1)
# Get the ids vector.
ids = get_ids(df);
# The model.
# Define the functions.
@model function turing(x, y, d)
# Prior for the ability parameter
# for each team (length of d).
# HalfNormal.
α ~ filldist(truncated(Normal(0., 1.), 0., Inf), d)
# Likelihood.
for i in 1:length(y)
θ = log(α[x[i,1]]) - log(α[x[i,2]])
y[i] ~ BernoulliLogit(θ)
end
end
# Define the function to rank the teams.
"""
rank
Ranking the ability scores, α, for each object in each posterior draw.
Args:
x (Chains): The result of sampling the turing.jl model.
d (Int8): The number of objects.
Returns:
"""
function rank(x, d)
# Get the dims of the posterior.
iters = size(x,1)
chains = size(x,3)
# Create a matrix of samples for the α parameter.
samples = MCMCChains.group(x, :α).value
# Initialize an array of rankings.
rank_arr = Array{Integer, 3}(undef, iters, length(d), chains)
# Rank the α for each sample.
# This should produce an array with the ranking
# for each team -- in order.
for c in 1:chains
for i in 1:iters
# Get the current sample for iteration i and chain j
current_sample = samples[i, :, c]
# Rank the options by sorting the α values
# in descending order (higher α means higher rank).
ranked_indices = sortperm(current_sample, rev=true)
# Assign ranks to each option based on sorted order
for rank_idx in 1:length(d)
rank_arr[i, ranked_indices[rank_idx], c] = rank_idx
end
end
end
# Initialize a DataFrame.
df = DataFrame()
# Place the rankings into a DataFrame.
for c in 1:chains
for (key, value) in d
temp_df = DataFrame(
iter = repeat(iters:-1:1, outer=1)
, Rank = rank_arr[:, value, c]
, Team = key
, chain = c
)
# Append the temporary DataFrame
# to the main DataFrame
append!(df, temp_df)
end
end
# Return the result.
return df
end
function fit_model(df, ids)
# Fit the model.
mod = turing(
Matrix(select(df, [:home_team, :away_team]))
, df.home_win
, length(ids)
);
fit = Turing.sample(mod, NUTS(), MCMCThreads(), 4_000, 4);
turing_ranks = rank(fit, ids);
return turing_ranks
end
# Benchmark it.
fit_model(df, ids);
run(@benchmarkable fit_model(df, ids) samples=times evals=1)
The benchmarks for loading the data:
BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
Range (min … max): 5.200 ms … 6.431 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.344 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.362 ms ± 140.335 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂ ▅ ▂ ▄
▆▁▁▅█▃▆▅▆▇▆██▆████▁▇▇█▁▁▅▅▅▁▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
5.2 ms Histogram: frequency by time 5.78 ms <
Memory estimate: 115.76 KiB, allocs estimate: 2806.
Given the extensive examples of comparing between R
and Julia
, Julia
was quite fast relative to R
to load the data from the same database, using DuckDB
, and using the same SQL
code.
The benchmarks for creating the dictionary mapping the team abbreviations and team ids:
BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
Range (min … max): 25.959 μs … 64.583 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 28.060 μs ± 4.041 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁ █▄▁▁▁
▆██▄▆█████▆▃▄▁▄▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
26 μs Histogram: frequency by time 40.7 μs <
Memory estimate: 55.27 KiB, allocs estimate: 11.
Again, Julia
is very very fast relative to R
. Note, that the benchmarks in Julia
are in terms of nanoseconds while the benchmarks in R
are in terms of milliseconds.
The benchmarks for fitting the models:
BenchmarkTools.Trial: 100 samples with 1 evaluation per sample.
Range (min … max): 14.073 s … 114.021 s ┊ GC (min … max): 7.59% … 4.14%
Time (median): 54.673 s ┊ GC (median): 4.63%
Time (mean ± σ): 53.880 s ± 25.798 s ┊ GC (mean ± σ): 4.93% ± 1.98%
▅ ▂▂█▅ ▅ ▂ ▂ ▅ ▂▅ ▅ █ ▂
█▅▅████▅██▁▁▅███▅▅█▁▅▁▅▁▁███▁█▅▅▅▁█▁██▅█▁██▅▅█▅▅█▁█▁████▅█ ▅
14.1 s Histogram: frequency by time 92.9 s <
Memory estimate: 10.93 GiB, allocs estimate: 39417142.
Here is what is quite interesting, cmdstanr
is VERY fast relative to Turing.jl
. I am a huge fan of Stan
. Stan.jl
is a wonderful interface to Stan
in Julia
, however, it appears that a lot of the work on this has been limited to a small group of people who very understandably need to scale back and have begun archiving some of the repositories for it and have slowed down their commits. The latest commit to Stan.jl
was 8 months ago as of writing this blog post. Hopefully more of us Stan
and Julia
fans can pitch in!
Before I close out, another word of caution is important here. In addition to the disclaimers earlier, it is also important to note that the relative performance of cmdstanr
and Julia
are limited to the simple Bradley-Terry model – which is a quite niche scenario. The generalizability of these results could be quite poor, so take this all with a grain of salt. It seems like others who have compared Turing.jl
and Stan
have also documented the power of Stan
. (Thank you Stephen!)
But hopefully this is fun and interesting! It’s been fun and interesting for me to work on!