Regarding the opcode dispatch: setting up the RTS in this way is quite expensive, and (if you've got the room) you could be better off assembling a little thunk somewher in memory. 4C 00 >SET (JMP >SET*256). You'd do this on startup.

A JMP to this thunk costs 3 cycles, and the JMP in the thunk costs 3 cycles, so that buys you nothing compared to the RTS. And the STx to set up the low byte takes up 3 cycles (zero page) or 4 cycles (elsewhere), which is the same or worse than the PHA. But because the high byte is always set up, you save the 5 cycles spent setting that up.

(If you're running from RAM, you don't even need the thunk.)

(Also: the opcode dispatch's EOR trick is space-efficient, but takes an extra cycle - and one fewer bytes, I won't deny - compared to doing a TAY after fetching the byte, then a TYA:AND $F0 later. That sequence takes 6 cycles, whereas the LSR:EOR (R15L),Y sequence takes 7 or 8.)

What do you think about using 6C, indirect jump? Instead of a little thunk, some 16 bits of data can be set aside in a fixed location. We mutate only the low order address and then do an indirect jump through it.

   LDA  OPTBL-2,Y
   STA  OPADDR
   JMP  (OPADDR)
The contents of OPADDR+1 is initialized once on entry into the interpreter. Or perhaps statically.

Another thing would be self-modifying code (if we can forgo ROM-ming this, which Woz couldn't): the interpreter mutates the operand of an immediate JMP instruction to set up the address. That instruction then simply follows; there is no need to branch to it. Same as your thunk, but placed inline.

Ah, the first machine language program I wrote was on the 6502 and used self-modifying code to march through the graphics buffer. Indexed addressing modes were the next chapter in the Rodney Zaks book.

This is what the PLASMA VM does (https://github.com/dschmenk/PLASMA). All opcodes are even numbers so that the dispatch addresses can be stored as two-byte addresses.