K-L Divergence is something that Keeps coming up in my research but I still don't understand what it is.

Could someone give me a simple explanation as to what it's is.

And also, what practical use cases does it have?

Intuitively, it measures the difference between two probability distributions. It's not symmetric, so it's not quite that, but in my opinion, it's good intuition.

As motivation, say you're an internet provider, providing internet service to a business. You naturally want to save money, so you perhaps want to compress packets before they go over the wire. Let's say the business you're providing service to also compresses their data, but they've made a mistake and do it inefficiently.

Let's say the business has, incorrectly, determined the probability distribution for their data to be $q(x)$. That is, they assign probability of seeing symbol $x$ to be $q(x)$. Let's say you've determined the "true" distribution to be $p(x)$. The entropy, or number of bits, they expect to transmit per packet/symbol will be $-\sum p(x) lg(q(x))$. Meaning, they'll compress their stream under the assumption that the distribution is $q(x)$ but the actually probability of seeing a packet, $x$, is $p(x)$, which is why the term $p(x) lg(q(x))$ shows up.

The number of bits you're transmitting is just $-\sum p(x) lg(p(x))$. Now we ask, how many bits, per packet, is the savings of your method over the businesses? This is $-\sum p(x) lg(q(x)/p(x))$, which is exactly the Kullback-Leibler divergence (maybe up to a sign difference).

In other words, given a "guess" at a distribution and the "true" distribution, how bad is it between them? This is the Kullback-Leibler distribution and why it shows up (I believe) in machine learning and fitness functions.

As a more concrete example, I just ran across a paper talking [0] about using WFC [1] to asses how well it, and other algorithms, do when trying to create generative "super mario brothers" like levels. Take a 2x2 or 3x3 grid, make a library of tiles, use that to generate a random level, then use the K-L divergence to determine how well your generative algorithm has done compared to the observed distribution from an example image.

[0] https://arxiv.org/pdf/1905.05077.pdf

[1] https://github.com/mxgmn/WaveFunctionCollapse