Regarding the opcode dispatch: setting up the RTS in this way is quite expensive, and (if you've got the room) you could be better off assembling a little thunk somewher in memory. 4C 00 >SET (JMP >SET*256). You'd do this on startup.
A JMP to this thunk costs 3 cycles, and the JMP in the thunk costs 3 cycles, so that buys you nothing compared to the RTS. And the STx to set up the low byte takes up 3 cycles (zero page) or 4 cycles (elsewhere), which is the same or worse than the PHA. But because the high byte is always set up, you save the 5 cycles spent setting that up.
(If you're running from RAM, you don't even need the thunk.)
(Also: the opcode dispatch's EOR trick is space-efficient, but takes an extra cycle - and one fewer bytes, I won't deny - compared to doing a TAY after fetching the byte, then a TYA:AND $F0 later. That sequence takes 6 cycles, whereas the LSR:EOR (R15L),Y sequence takes 7 or 8.)
What do you think about using 6C, indirect jump? Instead of a little thunk, some 16 bits of data can be set aside in a fixed location. We mutate only the low order address and then do an indirect jump through it.
LDA OPTBL-2,Y
STA OPADDR
JMP (OPADDR)
The contents of OPADDR+1 is initialized once on entry into the interpreter. Or perhaps statically.Another thing would be self-modifying code (if we can forgo ROM-ming this, which Woz couldn't): the interpreter mutates the operand of an immediate JMP instruction to set up the address. That instruction then simply follows; there is no need to branch to it. Same as your thunk, but placed inline.
Ah, the first machine language program I wrote was on the 6502 and used self-modifying code to march through the graphics buffer. Indexed addressing modes were the next chapter in the Rodney Zaks book.