For what it's worth, this definitely isn't the theoretical limit for compressing a chess board position, which I would probably resort to calculating mathematically. In particular, (simplifying to boards with no pieces captured) this scheme uses 24 bytes for representing piece positions, but the constraint that no two pieces are in the same square means you should actually need only log2(64!/32!) = 22.3 bytes for that case, with 13 bits of savings. The scheme proposed here reclaims some of that overhead by granting special meaning when a piece's location overlaps with the king, but still is capable of representing other illegal positions where non-king pieces occupy the same square.

Unfortunately, decoding a mathematically optimal encoding quickly devolves to making a list of all possible board configurations and indexing into it, so I'd definitely believe that the proposal here is close to the smallest practically useful representation.

Couldn't you make it practical by assuming all possible positions (64) for the first piece, then 63 for the second, and so on, and encode one piece at a time? (Arithmetic coding piece by piece)

Yes, although

1. You'd definitely want to deduplicate positions of identical pieces for even more savings 2. That only handles the case where no pieces have been captured, and the full arithmetic coding would probably need separate "sections" of the integer range for different cases, and the number of sections is also quite high. 3. There's extra nuanced things you might want to handle in the coding, like that pawns can't be on their own back row. That is significantly harder.

It looks to me like https://github.com/tromp/ChessPositionRanking has resolved these sorts of issues, but I haven't dug into exactly how.