Previously: Fun With Pyro

In a typical clinical trial, the difference between the treatment group and the control group is measured with p-values. If the p-value for the “primary endpoint” is less than .05, the effect is said to be “statistically significant”; if not, then no effect has been found. There are lots of issues with this, but one of them is conflating not finding any evidence with finding negative evidence (evidence against the theory being tested). A small, poorly-controlled, or badly measured study can’t tell us anything either way, but a large, well-done study can provide strong negative evidence – “no evidence for X” is not the same thing as “evidence for no X”. With Pyro, this can be measured in a clean. quantitative way.

To illustrate, we can run simulations using the last post’s model of the Pikachu virus, only this time, the simulated treatment and control data will all be randomly drawn from the same distribution. Since there is no difference in the data, on average, the probability we assign to “no effect” should increase from the baseline 80% after a trial is run, and it should increase more with larger trials that provide stronger negative evidence. However, this is only an average. Sometimes, by luck, the data drawn will make it look like there is an effect, and p(no effect) will actually go down. To smooth this out, we can simulate a large number of trials and average them, but running lots and lots of simulations would take a long time. Luckily, PyTorch’s batch_shape feature and the efficiency of modern SIMD architectures makes it (relatively) easy to run them all in parallel:

def mixture_n(which_world, dists):
  Generalizes MaskedMixture to n > 2 components.
  assert dists
  if len(dists) == 1: return dists[0]
  assert which_world.max() < len(dists)

  return reduce(lambda a, b: dist.MaskedMixture(
    which_world == b[0], a[1] if isinstance(a, tuple) else a, b[1]),

def model(patients):
  patients = patients.clone()
  scaling = common.make_tensor(100.0)
  patients /= scaling
  n_trials = patients.size()[2]

  p_effect_sizes = common.make_tensor(PRIORS['p_effect_sizes'], n_trials)
  mode_effect_size = common.make_tensor(PRIORS['mode_effect_size'], n_trials)
  effect_k = common.make_tensor(PRIORS['effect_k'], n_trials)
  small_effect_k = common.make_tensor(PRIORS['small_effect_k'], n_trials)
  scale = common.make_tensor(PRIORS['scale']) / scaling

  distributions = make_distributions(mode_effect_size, effect_k, small_effect_k)
  with pyro.plate("world"):
    which_world = pyro.sample("which_world", dist.Categorical(p_effect_sizes))

    effect_size_raw = pyro.sample(
      "effect_size", common.mixture_n(which_world, distributions))
  effect_size = effect_size_raw * scale

  mode0 = common.make_tensor(0.5, n_trials)
  k0 = common.make_tensor(8.1269, n_trials)

  # beta reparameterized
  original_dist = common.beta(mode0, k0)
  new_dist = common.beta(mode0 - effect_size, k0)

  with pyro.plate("trial"):
    with pyro.plate("trial2"):
      pyro.sample("severity_trial", new_dist, obs=patients[1])

  with pyro.plate("test1"):
    with pyro.plate("test2"):
      pyro.sample("severity_test", original_dist, obs=patients[0])

We can then graph out the strength of negative evidence, averaged over a thousand simulated trials, for a range of different possible sample sizes:

(Note that the prior is a bit under 83% here, as this graph includes all probability density with effect size < 1. This is an average of probabilities rather than an average of log-odds, which might have been more representative.)

This is also a good scalability test; my desktop got up to one million data points before the simulation even started to slow down. At that point, I switched to CUDA, which got up to ten million data points before fully stressing the GPU.