What does HackerNews think of serving?
A flexible, high-performance serving system for machine learning models
This project gained popularity due to the HIGH DEMAND for running large models with 1B+ parameters, like `llama`. Python dominates the interface and training ecosystem, but prior to llama.cpp, non-ML professionals showed little interest in a fast C++ interface library. While existing solutions like tensorflow-serving [1] in C++ were sufficiently fast with GPU support, llama.cpp took the initiative to optimize for CPU and trim unnecessary code, essentially code-golfing and sacrificing some algorithm correctness for improved performance, which isn't favored by "ML research".
NOTE: In my opinion, a true pioneer was DarkNet, which implemented the YOLO model series and significantly outperformed others [2]. Same trick basically like llama.cpp
[1] https://github.com/tensorflow/serving [2] https://github.com/pjreddie/darknet
Meanwhile, here is a list of open source ML deployment packages:
https://github.com/oracle/graphpipe
https://github.com/eliorc/denzel
https://github.com/tensorflow/serving
https://github.com/ucbrise/clipper
TensorFlow Serving: https://github.com/tensorflow/serving
ReCeption (actually they call in Inception v3. Not sure where I got the ReCeption name - though I'm sure I read it somewhere?): https://www.tensorflow.org/versions/r0.7/tutorials/image_rec...
Using a SVN on neural network extracted features: http://blog.christianperone.com/2015/08/convolutional-neural...
If you want a quick and dirty version here's some Python to create a web service that calls a Caffe based Image recognizer: https://gist.github.com/nlothian/c3519adb81b3452c1938