That's an informative article.

Last time I took a look at TensorFlow Lite, they had a vision where you would export your model into a .tflite file (which is FlatBuffer encoded execution graph with weights) and then use it on mobile for inference like this, in pseudo-code :

  model = tflite.interpreter().load_model("my_model.tflite")
  model.set_input_data(my_input_buffer)
  model.execute()
  model.get_result(my_output_buffer)
Which is nice, since you can easily update the model by simply distributing a new file "my_model.tflite". The TFLite interpreter library would use whatever capability (SIMD instructions, DSP cores, etc) is available on the device to accelerate inference, so the application developer doesnt have to worry about writing different code for different platforms or even understand how the prediction works under the hood.

Is Qnnpack a library directly competing with TFLite? Are the file formats used for the model the same between tflite and this? Does it support TensorFlow Cores created by Google for inference, and/or more generally specialized cores like Qualcomm's DSP?

From the look of the first paragraphs, it would seem that QNNPACK is a library similar to Intels MKL/MKL-DNN. So you get "compiled" functions/kernels that accelerate a particular (compute-intensive) task.

With regards to TensorFlow Lite, this means that Google could posibly build tflite with QNNPACK, and (maybe) get better performance out of the resulting binary, in a set of mobile platforms supported by QNNPACK.

Edit: by the end of the article, they say how they built Tensor Flow Lite with QNNPACK and got substantial speedups accross a different range of phones.

I didn't understand it that way, they didn't built TensorFlow Lite with a QNNPACK "backend". They compared both version on the same benchmarks, but they didn't "merge" the solutions.

So, theorically QNNPACK could be used to implement a TensorFlow Lite interpreter. However it seems the most interesting implementations will use hardware specific accelerations, such as TensoCore RT from Nvidia, or the Google's TensorCores, but QNNPACK seems to only target SIMD optimizations from CPUs.

That still a good amount of work to identify the optimizable building blocks, or validate other approaches such as TFLite, but each mobile processor vendors (Qualcomm, ARM, Intel) already provide implementations of the Android NN API that maximizes the usage of the hardware.

That's why I'm not sure how QNNPACK integrates with the entire ecosystem.

Edit : as I see it, to consume a model for an application, the diagram looks like this : developer <-> tflite interpreter API <-> Android NN API (if target is Android) <-> vendor provided accelerated implementation (blackbox/binary blob, that's where most of the acceleration is supposed to happen)

Edit2 : Now that I think about it, it doesn't make sense to compare TensorFlow Lite in a benchmark. TensorFlow Lite is only an API and a file format spec, it's not a specific implementation, from what I understand.

Replying to your other comments (about how QNNPACK integrates and implementations of the Android NN API):

I'm not entirely sure what they're aiming for there. Usually when you see talk about "kernels" it's more of how particular filters/convolutions/low-level operations are optimized, and it is implied that kernels run on GPU (most of the time). They do talk a lot about microarchitectural details, size of caches and ARM/NEON operations, so it seems to be all implemented on CPU, but I don't really grasp how it ties with the vendor-specific implementations that you mention.

It could be that these are some new algorithms/implementations that focus on the strength of the systems (not particularly the CPU or the microarchitecture) and try to "go easy" on the memory bandwidth, for example, to get a better performance out of equivalent (maybe?) code.

This reminds me a bit of the numexpr[0] project, that accelerates numpy computations on python by rearranging data on memory to be more cache-friendly.

[0] https://github.com/pydata/numexpr