What does HackerNews think of snowplow-javascript-tracker?
Snowplow event tracker for client-side and server-side JavaScript. Add analytics to your websites, web apps and servers.
Nice post! It's always fun reading about people being creative and challenging the analytics status quo (aka GA). Besides the joy of doing it yourself, you've accomplished a couple other things worth mentioning:
1. You'll never be sampled. GA samples historical data pretty heavily, and you have to pay for 360 to retain unsampled event data (at a tune of $160k+ per year).
2. You have full access to all generated data.
I'd highly recommend using Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker) in a very similar manner to what you've outlined here. You'll get a ton of extra functionality out of the box, which would add yet another level of insight. With snowplow, you get the following for free:
1. Sessionization, which is consistent with google analytics' definition - effectively a 30 minute window of activity.
2. User identification - the tracker drops a persistent cookie (just like GA), so you can see returning visitors.
3. Tools for splitting requests
4. A variety of event types, out of the box: https://github.com/snowplow/snowplow/wiki/2-Specific-event-t...
5. Ability to respect Do Not Track
6. Time on page, browser width/height, etc
7. Ability to make your event tracking 100% first-party
(Disclaimer: I don't work for them, but I've seen the system work very well a number of times.)
I'm running a similar setup on my blog, and it costs well under $1 per month: https://bostata.com/client-side-instrumentation-for-under-on.... I'm doing the same exact thing with Cloudfront log forwarding and have several lambdas that process the files in S3. From there, I visualize traffic stats with AWS Athena (but retain a ton of flexibility, since they are all structured log files).
Nice article! I did something very similar to this for my blog but used Snowplow's javascript tracker (https://github.com/snowplow/snowplow-javascript-tracker), a cloudfront distribution with s3 log forwarding, a couple lambda functions (with s3 "put" triggers), S3 as the post-processed storage layer, and AWS athena as the query layer. The system costs under $1 per month, is very scalable, and is producing amazingly good/structured data with mid-level latency. I've written about it here:
https://bostata.com/post/client-side-instrumentation-for-und...
By using the snowplow javascript tracker, you get a ton of functionality out of the box when it comes to respecting "do not track", structured event formatting, additional browser contexts, etc. If you want to see how the blog site is functionally instrumented, filter network requests by "stm" (sent time) and you'll see what's being collected.
I've found (after setting similar systems for 15+ companies of varying scale) that where a system like this breaks down is when you want to warehouse event data and tie it to other critical business metrics (stripe, salesforce, database tables that underpin the application, etc). Another point it starts to break down is when you need low-latency data access. At that point it makes more and more sense to run data into a stream (kinesis/kafka/etc) and have "low latency" (couple hundred ms or less) and "high latency" (minutes/hours/etc) points of centralization.
Using multi-az/replicated stream-based infrastructure (like snowplow's scala stuff) has been completely transformational to numerous companies I've set it up at. A single source of truth when it comes to both low-latency and med/high-latency client side event data is absolutely massive. Secondly, being able to tie many sources of data together (via warehousing into redshift or snowflake) is eye-opening every single time. I've recently been running ~300k+ requests/minute through snowplow's stream-based infrastructure and it's rock-solid.
Again, nice post! It's awesome to see people doing similar things. :)
Snowplow is a bit more generalized than just Piwik, and it shows in Piwik having a more robust featureset for website analytics specific. But Snowplow has a lot more usefulness if you're having to merge a bunch of data sources together to get a picture of what's happening.
However, this event ID is not enough to identify and then dedupe all types of duplicate events. This blog post provides more information:
https://snowplowanalytics.com/blog/2015/08/19/dealing-with-d...
Big thanks to pragmacoders for putting this tutorial together! It's awesome seeing what you are doing with the Snowplow platform :-)