1.1. How should I decide which integer type to use?

Programmer: Hmmm, I think these three should be shorts and these 10 can fit in chars.

Modern compiler: Fuck that, you're all 32-bits. I ain't got time for unaligned memory access...

Modern compilers generally prefer register-sized data.

Ah thanks, that's what I was thinking. But somehow I got caught up in unaligned cacheline accesses, which are much larger than 32-bits. It's early where I am. :)

I mean there are 4 different concepts at play here:

Unaligned memory access, types generally have an alignment which is the same as their length and structs have the alignment of their largest member because many architectures get quite cross with unaligned access e.g. trying to load a 32b integer from memory to register while it's not aligned to 32b, this is where compilers will pad structures to ensure everything is aligned properly by default, and where (in C/C++ anyway) you want to take care of your struct layout to avoid unnecessary padding. That doesn't prevent "these three should be shorts and these 10 can fit in chars." at all, but if you intersperse them the structure will increase in size due to padding.

Then there's C being specified in terms of lowest capabilities so while "char" and "short" are at least 8 and at least 16 bits… they can also be larger (POSIX does specify that char must be exactly 8 bits but non-POSIX platforms don't require that). That matches everything being 32b. It's a somewhat common occurrences for DSPs to have char of 16 or 32 bits, it really has nothing to do with the compiler (except insofar as the compiler does what the architecture description specifies).

Then there's the preferred data size, which is probably a function of the architectural ALUs than the native registers e.g. if an architecture only works on 32b datum then computations on 8 bit data will require loading 8 bits, zeroing the upper 24, performing the computation in 32b, copying whichever bit is concerned to the overflow flag, then zeroing that. Whereas performing the computation on a 32b-native type would require loading 32b, performing the computation, done (that was never a concern on x86/64 but IIRC older revisions of ARM could only natively work in 32b, and not all possible arithmetic operations got expanded to 8/16b at the same time).

And finally there's the cache line alignment. There are actually two opposite issues with cacheline alignments: you usually want your structure to not span cachelines (as that gets more expensive) but sometimes you want your structures be always start at the start of a cacheline (and possibly take the entirety of the cacheline) to avoid false sharing issues: if two unrelated structures are on the same cache line and they're used from different cores, the cores will need a lot more synchronisation than if they were on different cache lines.

cache alignment also gets stupid complicated when you add in what kind of malloc you use, c.f. google's tcmalloc: https://github.com/google/tcmalloc