Can someone informed give some suggestions as to compare/contrast to other tools at the intersection of probabilistic programming and deep learning? What are relative strengths and weaknesses vs edward or pyro?

Turing.jl is in an interesting spot because it is essentially a DSL-free probabilistic programming language. While it technically has a DSL of sorts given by the `@model` macro, anything that is AD-compatible can be used in this macro and since Julia's AD tools work on things written in the Julia language, this means that you can just throw code from other Julia packages into Turing and just expect AD-compatible things to work with Hamiltonian Monte Carlo and all of that. So things like DifferentialEquations.jl ODEs/SDEs/DAEs/DDEs/etc. work quite well with this, along with other "weird things for a probabilistic programming language to support" like nonlinear solving (via NLsolve.jl) or optimization (via Optim.jl, and I mean doing Bayesian inference where a value is defined as the result of an optimization). If you are using derivative-free inference methods, like particle sampling methods or variants of Metropolis-Hastings, then you can throw pretty much any existing Julia you had as a nonlinear function and do inference around it.

So while it's in some sense similar to PyMC3 or Stan, there's a huge difference in the effective functionality that you get by supporting a language-wide infrastructure vs the more traditional method of one-by-one adding features and documenting them. So while PyMC3 ran a Google Summer of Code to get some ODE support (https://docs.pymc.io/notebooks/ODE_API_introduction.html) and Stan has 2 built-in methods you're allowed to use (https://mc-stan.org/docs/2_19/stan-users-guide/ode-solver-ch...), with Julia you get all of DifferentialEquations.jl just because it exists (https://docs.sciml.ai/latest/). This means that Turing.jl doesn't document or doesn't have to document most of its features, but they exist due to composibility.

That's quite different from a "top down" approach to library support. This explains why Turing has been able to develop so fast as well, since it's developer community isn't just "the people who work on Turing", but it's pretty much the whole ecosystem of Julia. Its distributions are defined by Distributions.jl (https://github.com/JuliaStats/Distributions.jl), its parallelism is given by Julia's base parallelism work + everything around it like CuArrays.jl and KernelAbstractions.jl (https://github.com/JuliaGPU/KernelAbstractions.jl), derivatives come from 4 libraries, ODEs from etc. the list keeps going.

So bringing it back to deep learning, Turing currently has 4 modes for automatic differentiation (https://turing.ml/dev/docs/using-turing/autodiff), and thus supports any library that's compatible with those. It turns out that Flux.jl is compatible with them, so therefore Turing.jl can do Bayesian deep learning. In that sense it's like Edward or Pyro, but supporting "anything that AD's with Julia AD packages" (which soon will allow multi-AD overloads via ChainRules.jl) instead of "anything on TensorFlow graphs" or "anything compatible with PyTorch".

As for performance and robustness, I mentioned in a SciML ecosystem release today that our benchmarks pretty clearly show Turing.jl as being more robust than Stan while achieving about a 3x-5x speedup in ODE parameter estimation (https://sciml.ai/2020/05/09/ModelDiscovery.html). However, that's utilizing the fact that Turing.jl's composibility with packages gives it top notch support (I want to work with Stan developers so we can use our differential equation library with their samplers to better isolate differences and hopefully improve both PPLs, but for now we have what we have). If you isolate it down to just "Turing.jl itself", it has wins and losses against Stan (https://github.com/TuringLang/Turing.jl/wiki). That said, there's some benchmarks which indicate using the ReverseDiff AD backend will give about 2 orders of magnitude performance increases in many situations (https://github.com/TuringLang/Turing.jl/issues/1140, note that ThArrays is benchmarking PyTorch AD here) which would then probably tip the scales in Turing's favor. As for benchmarking against Pyro or Edward, it would probably just come down to benchmarking the AD implementations.