Please help!! , there is something fundamental I don't know how to achieve: Let's say you have a word. You want to compress it by making the smallest possible hash/UUID while being conflict free. Let's say we wanna compress the word "eating". I propose the following scheme: eating become: 1abc But it doesn't mean 1abc, let me explain: Here the 1 would be a char indicating a Metadata, this Metadata would denote a specific alphabet, such as Egyptian hieroglyphs. 1abc abc are the actual hash and a here denote the first letter of the hieroglyph alphabet. So first question: how to generate minimal hashes that leverage UTF8 symbols for reducing length. Also UTF8 non-base symbols weight 3-4 bytes which is not great and this is perfectly avoidable!! What a shame this concept is not widespread.. My goal is to have a string encoding that allow 255 distinct "ASCII tables", dynamically switched via the first letter encoding which table it refer to. Therefore any character in my encoding stay in one byte. The only limit is that therefore not all hieroglyphs can be encoded, only the 255 firsts but I don't care! I don't want to encode Egyptian, I want to generate maximally small hashes/UUID for normal text by going far beyond the standard alphabet symbols. How to achieve this?? This musy have already been done!

(note ideally you could specify more that one character for language selection, which would massively increase the encoding compression but jjst one would be a good start!) Plz help

TL;DR ASCII char encoding can only represent 255 1 byte characters My proposed scheme enable to have N * 255 * 255 1 byte characters. With a constant cost per string of 1 byte. This seems revolutionary.

Sounds similar to: https://github.com/Cyan4973/FiniteStateEntropy

https://arxiv.org/abs/1311.2540

> The modern data compression is mainly based on two approaches to entropy coding: Huffman (HC) and arithmetic/range coding (AC). The former is much faster, but approximates probabilities with powers of 2, usually leading to relatively low compression rates. The latter uses nearly exact probabilities - easily approaching theoretical compression rate limit (Shannon entropy), but at cost of much larger computational cost. Asymmetric numeral systems (ANS) is a new approach to accurate entropy coding, which allows to end this trade-off between speed and rate: the recent implementation [1] provides about 50% faster decoding than HC for 256 size alphabet, with compression rate similar to provided by AC. This advantage is due to being simpler than AC: using single natural number as the state, instead of two to represent a range. Beside simplifying renormalization, it allows to put the entire behavior for given probability distribution into a relatively small table: defining entropy coding automaton. The memory cost of such table for 256 size alphabet is a few kilobytes. There is a large freedom while choosing a specific table - using pseudorandom number generator initialized with cryptographic key for this purpose allows to simultaneously encrypt the data. This article also introduces and discusses many other variants of this new entropy coding approach, which can provide direct alternatives for standard AC, for large alphabet range coding, or for approximated quasi arithmetic coding.

Check out his other papers / the github project (looked super interesting and similar).