
Parcours Analytics: A Self-Hosted Web Analytics Platform
A production-style analytics app built with Django, FastAPI, PostgreSQL, Docker, and AWS
Parcours Analytics: A Self-Hosted Web Analytics Platform
Project Summary
Parcours Analytics is a self-hosted web analytics platform I designed and built to track visitor behavior across websites without depending on a third-party analytics SaaS. It collects browser events through a lightweight JavaScript tracker, validates and buffers incoming traffic through a FastAPI ingestion service, processes events asynchronously with a Python worker, stores raw and derived analytics in PostgreSQL, and presents the results through an authenticated Django dashboard.
The system is intentionally split into distinct services: nginx acts as the public reverse proxy, Django owns users and web property management, one FastAPI service handles metrics ingestion, another FastAPI service serves dashboard query APIs, and a background worker performs enrichment, sessionization, and rollups. This architecture separates user-facing dashboard traffic from write-heavy analytics ingestion and slower background processing.
From an SRE perspective, Parcours is built around production-style concerns: containerized services, isolated database roles, separate application and metrics databases, persistent volumes, AWS-hosted deployment, secrets loaded from AWS Systems Manager Parameter Store, and a file-backed event buffer between ingestion and processing. The project demonstrates full-stack application design, backend data pipeline design, and practical operational thinking for hosting a multi-service web application in AWS.
Why I Built It
I built Parcours Analytics because Google Analytics felt far more complicated than what I actually wanted from a website analytics tool. For many small sites, blogs, and content projects, the important questions are straightforward: how many people visited, where did they come from, what pages did they read, how long did they stay, and what content performed well. Google Analytics can answer those questions, but it often feels like using an enterprise marketing platform when all you need is clear traffic and engagement data.
I also looked at simpler analytics products like Clicky and Fathom, which influenced the direction of the project. I liked the idea of a focused analytics tool with a smaller, more understandable feature set: fast dashboards, useful visitor/session data, referrers, countries, devices, and content performance without requiring a lot of configuration.
The other reason I built it was technical. I wanted a project that was more substantial than a typical CRUD app and closer to a real production system. Web analytics has interesting engineering problems: collecting high-volume browser events, validating tracked properties, buffering ingestion, enriching raw events, assigning sessions, aggregating data, and presenting it back through a dashboard. It gave me a practical way to design a multi-service application and operate it in AWS using patterns I care about as a site reliability engineer.
System Architecture
Parcours Analytics is built as a small multi-service application. I split the system into separate components because analytics traffic has different workloads: collecting events should be fast and lightweight, processing events can happen asynchronously, and dashboard queries should be isolated from the ingestion path.
At a high level, the system looks like this:
Tracked Website
-> Parcours JavaScript Tracker
-> nginx
-> Metrics Ingestion API
-> File-backed Event Buffer
-> Background Worker
-> Metrics PostgreSQL Database
-> Dashboard API
-> Django Dashboard
The public entry point is nginx. It acts as the reverse proxy for the application and routes requests to the correct internal service. Normal dashboard traffic goes to Django, tracking events go to the metrics ingestion API, dashboard data requests go to the dashboard API, and generated browser tracking scripts are served as static JavaScript files.
Django is responsible for the user-facing application. It handles user accounts, authentication, web property management, and the dashboard pages. When a user adds a new website, Django creates a unique property ID and generates a custom tracking script for that property. That script is then served by nginx and installed on the tracked website.
The metrics ingestion API is a FastAPI service that receives browser events from tracked websites. It validates that the submitted property ID exists, captures request metadata such as IP address and user agent, sanitizes the incoming payload, and writes the event to a file-backed buffer. I intentionally kept this service lightweight so that the public tracking endpoint does as little work as possible.
A Python worker runs separately in the background. It reads buffered event files, enriches the data, writes raw events into the metrics database, assigns events to sessions, and builds derived records used by the dashboard. This includes things like visitor sessions, scroll-depth data, referrer information, device/browser details, country lookup, and content metadata.
The dashboard API is a second FastAPI service focused only on reporting queries. Django renders the authenticated dashboard pages, but the charts and tables call this API for analytics data. The dashboard API validates the user’s Django session, checks that the user owns the requested web property, and then queries the metrics database for visitor counts, referrers, devices, countries, landing pages, visitor journeys, and content performance.
The data layer is split into two PostgreSQL databases. The Django database stores application data such as users, sessions, and web properties. The metrics database stores analytics data such as raw visitor events, processed sessions, scroll events, referrer information, and aggregate reporting tables. This separation keeps application state and analytics workloads from interfering with each other and makes the service boundaries clearer.
In production, the services run as Docker containers on AWS, with nginx in front, persistent volumes for PostgreSQL data, and secrets loaded from AWS Systems Manager Parameter Store. This gave me a deployment model that was close to how I would approach a real internal service: isolated containers, explicit service boundaries, durable storage, and externalized configuration.
Data Collection Flow
The data collection flow starts when a user adds a website inside the Parcours dashboard. Django creates a new WebProperty record for that site and assigns it a unique property_id. That property_id becomes the key that ties the tracked website, incoming browser events, stored metrics, and dashboard queries together.
When the property is created, Django also generates a custom JavaScript tracking file for it. The script is based on a template, with the property ID and metrics endpoint injected into the final file. nginx serves these generated scripts from a static route, so a tracked site can load a script like:
<script defer src="https://app.useparcours.com/livelongandprosper/<property_id>.js"></script>
Once installed on a website, the tracker sends events back to Parcours using the browser’s sendBeacon API. It records basic page and engagement events, including:
page_viewwhen the page loadspage_exitwhen the visitor leavespingafter the visitor has stayed on the page for a period of time- scroll depth as the visitor moves through the page
Each event includes information such as the property ID, visitor ID, page URL, page title, referrer, browser language, locale, duration on page, scroll depth, and timestamp. The visitor ID is generated in the browser from a lightweight fingerprint using values like user agent, screen size, language, timezone, hardware concurrency, and device memory.
For WordPress sites, the Parcours plugin adds extra page metadata before loading the tracking script. This includes the page type, author, categories, tags, and WordPress page flags such as whether the page is a home page, single post, page, or archive. That allows the analytics system to report not only on URLs, but also on content structure and topic performance.
Incoming events are sent to the /metrics endpoint. nginx forwards those requests to the FastAPI metrics ingestion service and passes along useful request metadata, including the original IP address and user agent. The ingestion API validates the submitted property_id against the Django database before accepting the event, so random or invalid property IDs are rejected.
After validation, the ingestion service normalizes and sanitizes the payload. It cleans the referrer URL, bounds scroll depth to a valid percentage, captures the visitor IP address, and records the event timestamp in UTC. Instead of writing directly to PostgreSQL, it writes each accepted event as a JSON file into a shared visitor_data directory.
That file-backed buffer is an intentional design choice. It keeps the public tracking endpoint fast and reduces the amount of synchronous work required during event collection. The ingestion API’s job is simply to validate, clean, and durably stage the event. The heavier work, such as user-agent parsing, GeoIP lookup, bot detection, database inserts, sessionization, and rollups, is handled later by the background worker.
Event Processing Pipeline
After events are accepted by the metrics ingestion API, they are not written directly into the analytics database. Instead, each event is staged as a JSON file in a shared visitor_data directory. A separate Python worker is responsible for turning those staged files into structured analytics data.
The worker runs periodically and processes event files in modification-time order. For each JSON file, it validates that the required fields are present, parses the payload, and extracts optional fields such as scroll depth, page type, author, categories, tags, and WordPress page flags. If the file is malformed or missing required data, the worker leaves it in place so it can be inspected instead of silently dropping the event.
Once the event is parsed, the worker enriches it before inserting it into PostgreSQL. This enrichment includes:
- parsing the user agent into browser and operating system fields
- normalizing the referrer into a referrer domain
- detecting likely bot traffic
- looking up the visitor’s country from their IP address using GeoIP
- preserving page metadata such as title, page type, categories, tags, and author
- carrying through scroll-depth and duration data
The enriched event is inserted into the visitor_events_raw table in the metrics database. This table acts as the canonical raw event store for the analytics system. It contains both the original tracking information and the derived fields needed for reporting, filtering, and later aggregation.
After a file is successfully written to the database, the worker moves it into an archive_data directory. This gives the system a simple audit trail and a practical recovery path: processed event files can be reviewed or replayed if needed. If a database write fails, the file remains in visitor_data and can be retried on the next worker run.
Once raw events are inserted, the worker performs additional processing passes. One major step is session assignment. Since browser events arrive independently, the worker groups events into sessions using the visitor ID and a 30-minute inactivity window. Events from the same visitor are stitched into the same session unless the time gap between events exceeds 30 minutes, in which case a new session ID is created.
The worker then builds higher-level records from those sessionized events. It populates a visitor_sessions table with session start and end time, duration, total events, pageview count, bounce status, entry page, exit page, referrer domain, country, browser, operating system, locale, and entry page type. This gives the dashboard a clean session-level view without having to recompute sessions from raw events on every request.
The pipeline also extracts specialized reporting data. For example, scroll-depth events are copied into a visitor_scroll_events table so the dashboard can report how far visitors read on each page. The worker also syncs web property timezones from the Django database into the metrics database, which allows reporting queries to calculate day and hour boundaries in the site owner’s local timezone.
The overall goal of the pipeline is to keep collection simple and fast while moving expensive work into the background. The ingestion API only validates and stages events. The worker handles enrichment, database writes, sessionization, and derived reporting tables. This makes the system easier to operate because failures in processing do not immediately break event collection, and each stage has a clear responsibility.
Dashboard And Query Layer
The dashboard side of Parcours is split between Django and a separate FastAPI service. Django owns the authenticated web application: user login, signup, password reset, account preferences, web property management, and the HTML dashboard pages. The dashboard API owns the analytics queries that power the charts and tables.
This split keeps the responsibilities clear. Django is good at user-facing application concerns like sessions, templates, forms, and authentication. The FastAPI dashboard service is focused on reading from the metrics database and returning structured JSON for the dashboard UI.
When a user logs in, Django loads the web properties they own and stores that list in the session. The user can then open an overview dashboard, add a new property, edit a property’s settings, or drill into a detailed dashboard for a specific property_id.
The detailed dashboard page is rendered by Django, but the analytics data itself is loaded through API calls. Browser-side JavaScript calls the dashboard API through nginx under the /dashboard/... route. nginx forwards those requests to the FastAPI dashboard service.
Before returning analytics data, the dashboard API performs two important checks. First, it validates the Django session cookie by looking up the session in the Django database. Second, it verifies that the requested property_id belongs to the authenticated user. This prevents a user from querying analytics for a property they do not own, even if they know or guess another property ID.
The dashboard API then queries the metrics database for the requested report. It exposes endpoints for visitor counts, browser breakdowns, device breakdowns, locale and country data, traffic sources, visitor lists, visitor timelines, top referrers, landing pages, content lists, most viewed pages, average time on page, scroll depth, and topic performance by category or tag.
Most of these queries are built around the processed analytics tables created by the worker, especially visitor_events_raw, visitor_sessions, visitor_scroll_events, and hourly/page aggregate tables. This means the dashboard does not have to reconstruct every metric from raw browser events on each page load. Some reports can use session-level or pre-aggregated data, while more detailed reports can still drill into the raw event table when needed.
The dashboard API also handles common reporting concerns such as date ranges, comparison periods, filters, time zones, referrer grouping, browser/device grouping, and percentage-change calculations. This keeps the Django templates focused on presentation while the API service owns the reporting logic.
Overall, the dashboard/query layer is designed so that user management, page rendering, and analytics querying are related but separate concerns. Django provides the authenticated product experience, while FastAPI provides a dedicated analytics read API backed by the metrics database.
Database Design
Parcours uses two PostgreSQL databases: one for application data and one for analytics data. I made that split because the two workloads are very different. The application database stores relatively small, transactional data such as users, sessions, and web properties. The metrics database stores higher-volume event data and derived analytics tables.
The Django database is the source of truth for user-facing application state. It stores user accounts, authentication sessions, password reset state, and web property records. The most important application-level table is the web property table, which maps a user to a tracked website and its generated property_id. That property_id is the identifier used throughout the rest of the system.
The metrics database stores the analytics pipeline output. Its central table is visitor_events_raw, which acts as the canonical event store. Each row represents a browser event after it has been accepted by the ingestion API and enriched by the worker. It includes fields such as property ID, visit time, visitor ID, page URL, page title, referrer URL, referrer domain, event type, user agent, language, locale, duration, scroll depth, bot flag, country, page type, author, categories, tags, browser, operating system, and session ID.
On top of the raw event table, the worker builds derived tables that are easier for the dashboard to query. The most important one is visitor_sessions, which stores one row per session. It includes session start and end time, duration, total events, pageview count, bounce status, entry page, exit page, referrer domain, country, browser, operating system, locale, and entry page type. This avoids having to recalculate session boundaries from raw events every time the dashboard loads.
There are also specialized tables for specific reporting needs. For example, visitor_scroll_events stores scroll-depth data by session and page, which supports engagement reports about how far visitors read. web_property_timezones mirrors property timezones from the Django database into the metrics database, allowing analytics queries to calculate local dates and hours correctly. Other tables support hourly aggregation, page-level visit aggregation, visitor device details, locale data, referrer data, and visitor journey reporting.
The service permissions are intentionally separated. Django writes to the Django database. The metrics ingestion API only needs enough access to validate that a submitted property ID exists. The worker writes to the metrics database and reads web property metadata from the Django database. The dashboard API reads from the metrics database and validates user sessions and property ownership against the Django database. This keeps each service’s database access aligned with its responsibility.
This design gives Parcours both a raw event history and query-friendly reporting tables. The raw table is useful for debugging, replaying, and building new reports later. The derived tables make the dashboard faster and simpler because common concepts like sessions, scroll events, and hourly totals are already materialized.
AWS And Deployment
Parcours is currently deployed on a single AWS EC2 instance. I intentionally kept the hosting model simple for this version: one VM runs the full application stack with Docker Compose. I considered using ECS, and may move in that direction later, but for this stage a single EC2 instance gave me enough control to build and operate the system without adding orchestration complexity too early.
Each major component runs as its own Docker container: nginx, Django, the metrics ingestion API, the dashboard API, the background worker, the Django PostgreSQL database, and the metrics PostgreSQL database. Docker Compose defines the service relationships, networking, port mappings, environment variables, and persistent volumes. This gives the project a clean local-to-production workflow while still keeping the services isolated from each other.
nginx is the public entry point on the EC2 instance. It listens on HTTP/HTTPS ports and routes traffic to the appropriate internal container. Dashboard requests go to Django, /metrics requests go to the ingestion API, /dashboard/... requests go to the dashboard API, and generated tracking scripts are served from a static directory. This keeps the public interface simple even though the app is made up of several backend services.
Secrets are stored outside the repo in AWS Systems Manager Parameter Store. On startup, the deployment script retrieves values such as the Django secret key, database passwords, dashboard API credentials, worker credentials, metrics API credentials, and Mailgun API key. Those values are exported into the environment before Docker Compose starts the containers. This avoids baking secrets into images or committing them to source control.
The two PostgreSQL containers use mounted volumes for persistence. The Django database stores application state such as users, sessions, and web properties. The metrics database stores raw analytics events and processed reporting tables. Keeping those volumes outside the container lifecycle allows containers to be rebuilt or restarted without losing data.
The deployment is deliberately straightforward: build the Docker images, fetch runtime secrets from AWS, and start the stack with Docker Compose. That made it easier to focus on service boundaries, networking, data persistence, and operational behavior before introducing a larger orchestration layer. A future version could move the services into ECS, replace the database containers with RDS, and define the infrastructure with Terraform or CloudFormation.
From an SRE perspective, this deployment still exercises many real operational concerns: containerized services, reverse proxy routing, persistent storage, environment-specific configuration, externalized secrets, service-specific database credentials, and the tradeoffs of running a multi-service application on a single host.
Reliability-Oriented Design Choices
One of my main goals with Parcours was to keep the system highly decoupled. I did not want the dashboard, ingestion path, and background processing pipeline to all fail together. Analytics collection is the most important path in the system, so I designed it to keep accepting events even if other parts of the application are unhealthy.
The clearest example is the separation between the Django frontend and the metrics ingestion API. Django handles users, authentication, web property management, and dashboard pages. The ingestion API is a separate FastAPI service behind nginx. If the Django dashboard were unavailable, existing tracking scripts could still continue sending events to /metrics, and the backend could continue collecting analytics data.
I also separated event collection from event processing. The ingestion API does not try to enrich events, assign sessions, calculate aggregates, or run expensive database queries during the browser request. Its job is intentionally narrow: validate the property ID, sanitize the payload, capture request metadata, and write the event to a JSON file. That keeps the public tracking endpoint lightweight and reduces the chance that slow processing work will affect collection.
The file-backed event buffer is an important reliability choice. If the worker process fails, events can still accumulate as JSON files in visitor_data. If a file cannot be processed because it is malformed or a database write fails, the worker leaves it in place instead of deleting it. Successfully processed files are moved to archive_data, which gives me a simple audit trail and a practical recovery path for debugging or replaying events.
The worker is responsible for slower and more failure-prone tasks: user-agent parsing, GeoIP lookup, bot detection, session assignment, scroll-depth extraction, and rollups. By moving this work out of the ingestion path, the system can absorb temporary processing failures without immediately losing incoming traffic data.
I made a similar separation on the dashboard side. Django renders the authenticated application and handles “Django things”: login, sessions, forms, templates, account settings, and web property management. The dashboard’s analytics data comes from a separate FastAPI query service. This keeps Django focused and helps the frontend remain fast, because the heavier reporting logic lives in a dedicated read API backed by the metrics database.
The same service boundaries would also make the system easier to scale horizontally later. Today the stack runs on a single EC2 instance with Docker Compose, but the architecture maps cleanly to ECS. The ingestion API could be scaled out by running multiple tasks behind a load balancer, and the worker layer could be scaled separately by running additional worker tasks as event volume grows. Because ingestion, processing, dashboard rendering, and dashboard queries are already separate services, scaling one part of the system would not require scaling everything together.
The database design also supports this decoupling. Application data and analytics data live in separate PostgreSQL databases. The Django database stores users, sessions, and properties. The metrics database stores raw events, sessions, and reporting tables. Each service has only the database access it needs, which reduces coupling and limits the impact of failures or mistakes in one part of the system.
Overall, the reliability strategy is based on clear boundaries and graceful degradation. The dashboard can fail without necessarily stopping collection. The worker can fail without immediately losing events. Expensive processing can lag behind ingestion. And the user-facing Django app is kept separate from the high-volume analytics pipeline. For a small system currently running on a single EC2 instance, those choices give Parcours a better failure model today and a clear path toward ECS-based horizontal scaling later.
WordPress Integration
I also built a small WordPress plugin for Parcours so that a site owner can install tracking without manually editing theme files. The plugin adds a settings page where the user enters their Parcours property ID, then injects the correct tracking script into public pages.
The plugin also adds a small JSON metadata block alongside the script. That metadata includes WordPress-specific context such as page type, author, categories, tags, and whether the current page is a post, page, home page, or archive. The browser tracker includes this data with each event, which lets Parcours report on content performance by category, tag, author, and page type instead of only reporting raw URLs.
Security And Privacy Considerations
Parcours is designed so that analytics data is tied to authenticated users and their own web properties. The Django app owns user accounts, login sessions, password reset, and property management. When the dashboard API receives a request for analytics data, it validates the Django session cookie and checks that the requested property_id belongs to the authenticated user before returning results.
The ingestion path also performs validation before accepting events. Incoming browser events must include a valid property_id, and the metrics ingestion API checks that property against the Django database. This prevents the system from blindly accepting events for arbitrary or nonexistent properties.
Secrets are kept outside the repository and loaded at runtime from AWS Systems Manager Parameter Store. This includes the Django secret key, database passwords, service-specific database credentials, and the Mailgun API key. The application is configured through environment variables rather than committed secrets.
Database access is separated by service role. Django uses the application database for user and property data. The metrics API only needs enough access to validate web properties. The worker writes to the metrics database and reads property metadata. The dashboard API reads metrics data and validates users against the Django database. This reduces the blast radius of any single service credential.
The ingestion API also sanitizes incoming data before staging it for processing. Referrer URLs are normalized, query strings and fragments are stripped, invalid referrers are rejected, and scroll depth is bounded to a valid percentage range. The worker also detects bot traffic and stores that classification separately so reporting can filter or reason about non-human traffic.
From a privacy perspective, the system is intentionally focused on first-party analytics. It does not depend on Google Analytics or another third-party analytics SaaS to collect visitor behavior. That said, it still handles sensitive data such as IP addresses, user agents, referrers, and visitor identifiers, so future improvements would include stronger retention controls, IP anonymization, rate limiting, stricter CORS rules, and clearer user-facing privacy settings.
Technical Challenges
One of the main challenges was designing the ingestion path so it could stay fast and reliable. Browser tracking requests should not be slowed down by database writes, user-agent parsing, GeoIP lookup, session calculations, or reporting rollups. I solved this by keeping the ingestion API narrow: validate the property, sanitize the event, and write it to a JSON buffer. Everything heavier happens later in the worker.
The hardest technical problem was handling time correctly. Parcours receives events from visitors anywhere in the world, but the dashboard needs to make sense from the property owner’s point of view. If a site owner is in Vancouver, they expect “today,” “yesterday,” and hourly charts to be based on Vancouver time, not UTC and not the visitor’s local timezone.
To make that work, inbound events are stored with UTC timestamps, and each web property has its own configured timezone. The worker syncs those property timezones into the metrics database so reporting queries can translate UTC event times into the property’s local time when building date ranges and charts.
This became especially important for dashboard filters like today, yesterday, last 7 days, and last 30 days. Those ranges can cross UTC day boundaries depending on the property’s timezone. A request that belongs to “today” for a site owner in North America may already be “tomorrow” in UTC, or still “yesterday” somewhere else. The dashboard queries had to account for that instead of relying on simple UTC date comparisons.
The end result is that Parcours stores events in a consistent global format, but reports them in the timezone that matters to the property owner. That was one of the trickier parts of making the analytics feel correct rather than just technically stored.
What I Learned
This project reinforced how important clear service boundaries are in a production-style application. Keeping ingestion, processing, dashboard rendering, and dashboard queries separate made the system easier to reason about and gave each component a specific failure mode.
I also learned that analytics systems are much more about data modeling than charts. Raw events are useful, but dashboards need sessions, time ranges, rollups, referrer groups, scroll-depth records, and timezone-aware reporting to feel accurate and useful.
The biggest practical lesson was around time. Storing events in UTC is the right foundation, but user-facing reports still need to be calculated in the property owner’s timezone. Getting that right affected the worker, the database design, and the dashboard query layer.
Future Improvements
The main improvement I would make next is moving the containerized services from a single EC2 instance to ECS. The application is already split into separate containers, so ECS would be a natural next step. It would let me scale the metrics ingestion API, dashboard API, Django app, and worker processes independently instead of scaling the entire host as one unit.
The second major improvement would be replacing the file-backed ingestion buffer with Redis. Today, the metrics API writes incoming events to JSON files and the worker picks them up from disk. That works well for a simple single-host deployment, but Redis would make the ingestion pipeline cleaner and easier to scale across multiple containers or hosts. The ingestion API could push events into Redis, and one or more workers could consume from that queue.
Together, those changes would make the architecture better suited for horizontal scaling: multiple ingestion containers accepting traffic, multiple worker containers processing events, and a shared queue between them instead of a shared filesystem.
Technologies
- AWS Elastic Container Service (ECS)
- Linux
- Golang, gRPC, Protocol Buffers
- QuickBooks, SAGE50
Key Takeaway: We reduced deployment complexity and improved reliability by helping replace a customer-managed networking solution with a more secure cloud-connected architecture.
