I'm only a beginner in Julia and not and AD expert, but I went through the exercise of porting this to python and found it very enlightening
But while Julia targets Python fast and concise (while not compromising speed or power), it does not target the slower but more correct (though there is a culture of testing, which is quite important for math-oriented problems since the type system will not catch the most subtle and troublesome problems). There is space for a language to do exploratory/research work that can be quickly deployed in a fast iterative cycle and another for the new Spark/Flink or critical production areas that needs to take the extra effort (like self-driving cars), which could be Rust (or Scala, or Haskell, or Swift or stay with C++/Fortran).
That's also an approach that Julia excels because of multiple dispatch which you can see explained in [1].
In that case you have effectively two separate languages, the language used to generate the graph and the graph, each. This approach applies the transformation directly on the Julia IR to generate the gradient descent as if you wrote it directly on Julia side by side with the code that was written in a way that is completely unaware of that transformation (such as the ability to differentiate libraries that were built before that approach even existed). So the end product is something that is similar to the tensorflow graph (it has all control flow already embedded and can be pre-optimized by a compiler), but that is even easier to write than tensorflow eager (which is also the intent of Swift for Tensorflow).
The short answer is that CAS systems work with a "programming language" that doesn't have these features and is therefore a bit too limited for the kinds of models we're interested in.