I run a file sharing / content delivery platform called pixeldrain: https://pixeldrain.com
The system serves 4 PB of data to 60 million visitors per month. I have served 30 PB and 700 million file views since I started tracking usage somewhere in 2018.
I'll go from front to back:
- Most of the frontend is plain HTML, CSS and JS. I have started transitioning some pages to Svelte. I like this framework for its speed and simplicity
- Cloudflare Analytics to get basic info like which pages are popular and where my users are from
- The structure of the website (page wrap, menu, footer, etc) is managed with Go's template system
- Constellix for Geo-DNS. This automatically sends users to the server closest to them by doing Geo-IP lookups on the nameservers
- The user-facing servers are dedicated 10 Gbps Leaseweb servers, stuffed to the brim with SSDs in RAID6 for caching. Each of these servers cost €1200 per month. The storage servers are from Hetzner's SX line.
- The OS is Ubuntu 20.04 server edition. I use Ubuntu over Debian because it ships with TCP BBR
- The API is written in plain Go. The only HTTP libraries I use are httprouter for routing and Gorilla Websockets
- The storage system is custom built to spread files over multiple servers. I call it pixelstore, it's not open source (yet)
- The database is ScyllaDB. I landed on this one after going through multiple other systems with severe bottlenecks. I started with MySQL which was limited to a single location, so other locations had high latency. Then I tried CockroachDB, but it kept hanging under the load no matter how much hardware I threw at it. ScyllaDB is very fast and relatively reliable.
- UptimeRobot for monitoring
- Mailgun for account e-mail verifications
Feel free to ask me more questions :-)
You mention that you have user-facing servers as well as storage servers. So do the user-facing servers act as reverse proxies for the storage servers, or do you simply serve a redirect?
I'd expect that the file access patterns are power-law distributions, i.e. recently uploaded files are requested more often than older files. If that's the case, can you use this property for sharding by having hot and cold storage servers?
How do you handle users uploading forbidden content? I see from another comment that you ban the usual types of illegal content. But in practice, do you manually review every mail you get on your abuse contact and take appropriate action? What's the most common type of complaint?
From a business perspective: How did you grow your site? I imagine the competition must be rough since file hosting is such a "simple" service.
> So do the user-facing servers act as reverse proxies for the storage servers, or do you simply serve a redirect?
Yes, my user-facing servers are proxying the files to the users.
> I'd expect that the file access patterns are power-law distributions
Exactly. On each server I have a sorted slice in-memory which keeps track of how often files are being requested. The most popular files are cached and files which drop out of the cache are moved to HDD storage servers in Helsinki. This way I can serve 10 Gbps of data to the users while only putting a 1 Gbps load on the storage nodes.
> How do you handle users uploading forbidden content?
I get a lot of abuse mails, mainly copyright violations. I have a mailserver which is hooked into pixeldrain which scans mails from common copyright offices and automatically blocks the files. For other types of abuse I have a report button on the download page. Users report content which breaks the rules and once the number of reports hits a certain threshold the file is blocked.
> From a business perspective: How did you grow your site?
Most file sharing sites are really terrible. Like almost unusable. It's pretty easy to beat the competition in UX. I just started using pixeldrain myself on reddit and other forums. In the beginning people complained in the comments about the ergonomics of the site and I listened carefully. Eventually when the kinks were worked out other people started using it too.
Thanks for satisfying my curiosity! Also, congrats on your success!
> Yes, my user-facing servers are proxying the files to the users.
I've never operated a service as large as yours, so take my question with a grain of salt: I'm wondering whether it would make sense to split off the actual file front-end servers from the user-facing servers (going for a redirect approach instead of proxying), since the requirements for serving the UI (low latency, low bandwidth) are so different from the file serving requirements (high bandwidth, but latency is not an issue). In theory, the traffic load from the files could negatively impact the UI latency leading to perceived sluggishness of the website. But perhaps that's not an issue in practice?
Since you mentioned elsewhere that you wanted to move to content delivery: What kind of content delivery do you have in mind? At the moment I can only think of either classic CDNs (but that's a few order of magnitudes larger) or ads (but that's an entirely different area).
Proxying the files has a number of benefits.
- The first is that I can have all my API endpoints under one domain. This simplifies downloading as you don't need to make a separate request to figure out where the file is stored.
- The storage servers that Hetzner sells only have 1 Gbps bandwidth. That runs out very quickly when a file goes viral. The 10 Gbps caching servers do a lot of heavy lifting here, this makes sure the disks in the storage nodes last longer.
- I can also decide to switch to a different storage system on my storage nodes when I want. I have been considering to deploy reed-solomon encoding for a while. That would make it impossible to link directly to a single storage server as a single file would also be distributed.
- Sending out this much data uses a lot of RAM for TCP send buffers. Installing this RAM on a single content delivery node is cheaper than installing it on every storage server.
To prevent the bandwidth load from affecting the UI speed I have a rate limiter on the download API which slows down when the uplink reaches 95% capacity. This way there is always some bandwidth left for the HTML and database communications.
With regards to content delivery: I want to use pixeldrain to serve static files. Nothing like the fancy site-wrapping tech that cloudflare uses. The idea is that users can have a file tree on pixeldrain somewhat like dropbox. They can copy the direct download link to that file and use it to embed videos, audio and pictures in their own websites. Because this is a lot simpler than other CDN services I can offer it at a very competitive price.
Check out https://github.com/chrislusf/seaweedfs/ implementation of reed solomon. Small files can still be served from 1 server.
It's also efficient for small files, which a image store requires.