This is awesome! Wonder how it compares with mine (github.com/billyeh/termchat). Would love to share notes with the maker, since audio was something I did not even consider while I was making it. The image processing part is fun, though!
I helped create it at a recent hackathon. the ascii is rendered by calculating the intensity of any given pixel, mapping that intensity to a character based on how much that character fills up space ('@' would be bright and '.' would be dark) and then approximating the color of the pixel to fit within the 256 available
I've noticed your implementation uses websockets, which are built on top of TCP. we used UDP and encapsulated everything is a pretty simple to use (and easily the most well documented part of the project) p2plib.c. the fear was that video and audio can get backed up if we were to use TCP. so audio and video were sent via UDP packets and rendered every time a UDP packet was recieved.
You can go a bit further to get significantly higher color depths using ASCII block characters 176, 177, 178. By using the foreground and background ANSI colors and blending them by using these three characters (25%, 50%, 75%) you can achieve some quite realistic images. If you're more interested in resolution, using half block characters like 220 and 223 can allow you to double your vertical resolution at the cost of not being able to use the color trick.
https://github.com/dhotson/txtcam
Another thing I'd like to try to get a bit more resolution is to perhaps use the braille character set in combination with 256 color mode: https://github.com/asciimoo/drawille
Also, some terminals are starting to support 24bit color (iTerm2 nightlies) which could pretty drastically improve the possibilities of terminal based video: https://github.com/frytaz/txtcam/tree/color :)