Share:

How Cloudflare Built Town Lake, Its Data Platform, and the Skipper AI Agent

Cloudflare processes more than one billion events per second across a network that already spans over 330 cities in more than 120 countries. Behind every HTTP request, every Workers invocation, and every R2 read, there is an absurd amount of data being generated all the time. But for many years, this ocean of information was hard to access in a simple, reliable way.

The data was scattered across dozens of production databases, ClickHouse clusters, Kafka streams, buckets on external clouds, BigQuery datasets, and an endless queue of one-off pipelines. Seemingly simple questions, like finding how many domains that signed up today are already in the Top 100 by traffic, required knowing exactly:

  • which system to query,
  • which credentials to use,
  • which query language to write,
  • and whether the data was complete, sampled, or delayed by several days.

In practice, that made it much harder to extract high-quality insights.

To tackle this problem head-on, Cloudflare created two internal tools that today form the backbone of the company’s data strategy:

  • Town Lake, the unified data analytics platform;
  • Skipper, an artificial intelligence agent that runs on top of it.

Town Lake offers a single SQL interface for practically everything Cloudflare knows about its traffic and products. Skipper lets anyone with the right permissions ask questions in natural language and get auditable answers in seconds.

From Data Chaos to a Unified View

Anyone who has worked at a hyper-growth company knows what data sprawl looks like. For Cloudflare, a few symptoms were obvious:

  • Too many systems, disconnected from each other. A product engineer trying to investigate a customer issue had to access Postgres for account metadata, ClickHouse for analytics events, BigQuery for usage aggregations, R2 for raw logs, and Kafka topics for real-time signals. Each system with its own login, its own syntax, and its own retention policy.
  • Sampled data in scenarios where that simply cannot happen. For dashboards, sampling traffic makes sense. But for billing, security, or sensitive investigations, it does not. Cloudflare’s analytics pipeline has to deal with hundreds of millions of events per second, but billing requires an exact count, not an estimate.
  • Reliance on external vendors for critical internal data. Part of the reporting layer came from services running on other clouds, which meant financial cost and also a strong dependency on third-party infrastructure to operate strategic data.
  • Difficulty finding the source of truth. Even with the right credentials, you had to know exactly which table to query, in which cluster, with which join, using which internal identifier. A lot of this lived as tribal knowledge, concentrated in a few people.

On top of that, the internal culture treated data infrastructure as backstage plumbing instead of seeing it as a critical part of Cloudflare’s own platform.

The goal became very clear: build a single place where anyone with the right access profile could answer questions like:

  • Show me the top 100 customers by revenue in the last quarter.
  • List all Bot Management scoring events with a score higher than 0.9 in the last 48 hours from a specific ASN.
  • Find the top 100 billing-related support tickets from customers who have already spent more than 100 dollars.

This place needed to deliver fresh, precise, unsampled data when the topic was sensitive (such as billing or security) and, at the same time, fast and possibly sampled data when the need was exploration or dashboards.

Another basic requirement: security and governance built in from day one. That means:

  • automatic PII detection,
  • sensitive tables locked down by default,
  • auditable access,
  • time-limited permissions.

And all of this had to be built on Cloudflare’s own platform: R2 for storage, Workers for compute, Cloudflare Access for authentication, Workflows for orchestration. If they were going to invest heavily in data, it made sense to build on the same products the company offers to its customers.

Finally, there was a more ambitious goal: an interface that did not require SQL knowledge. The idea was that any authorized person should be able to see the data flows running through Cloudflare’s network, not just analysts and data engineers.

That last wish eventually became Skipper.

Town Lake: Cloudflare’s Lakehouse

At the core of the data architecture, Cloudflare adopted the data lakehouse pattern: a query engine that reads from object storage, with a metadata layer that makes that structure behave like a database. This platform was named Town Lake, inspired by an urban lake in Austin, Texas.

The main components are:

Query Engine with Apache Trino

Trino is the execution brain. With it, a single SQL query can join, at runtime, tables from Postgres, ClickHouse, and Iceberg data stored in R2, without having to copy or materialize everything in an intermediate system.

Example: a question like who are the top 100 paying customers by Workers requests this week becomes an execution plan that:

  • applies filters and aggregations in ClickHouse,
  • joins with account dimensions in Postgres,
  • combines everything with billing rollups in Iceberg tables on R2,
  • and returns the ranking in a single pass.

R2 Data Catalog with Apache Iceberg

R2 Data Catalog is the managed service that hosts cold and warm data using Apache Iceberg. Iceberg brings several key features:

Receive the best innovation content in your email.

All the news, tips, trends, and resources you're looking for, delivered to your inbox.

By subscribing to the newsletter, you agree to receive communications from Método Viral. We are committed to always protecting and respecting your privacy.

  • schema evolution without breaking consumers,
  • time travel to query a table’s state at a past point in time,
  • partition evolution,
  • data compaction as it ages.

This enables a strategy where:

  • per-minute usage for the last week is stored at high resolution,
  • then rolled up to hourly aggregates,
  • and eventually consolidated into daily summaries.

Storage cost tracks how recent the data is, but the data remains fully queryable. Parquet files on R2 are far cheaper than keeping all of that in a hot OLAP database.

DataHub as the Metadata Catalog

DataHub is the central catalog. It stores information about:

  • all tables and columns,
  • data types,
  • owners and maintainers,
  • lineage between sources and derived tables,
  • glossary entries and human annotations.

When someone wants to know what exists in a table like townlake.dim.accounts, DataHub responds with a description, column documentation, the responsible team, and even which tables feed this dimension and which depend on it.

Lifeguard: Access Control and Policies

Lifeguard is the access-control service. It:

  • stores rules in D1,
  • pulls user and group information in real time from the internal identity system,
  • generates a combined JSON policy that Trino reads over HTTP.

The same Lifeguard powers Skipper and the access gateway, ensuring that blocking happens as early as possible, before a query even runs.

Skimmer: Continuous PII Detection

Skimmer acts as a constant scanner to find PII. It runs continuously, sampling rows from all columns in all tables and using Workers AI to classify whether each column contains personal or sensitive information.

The process has two steps:

  1. A fast per-column classifier does the initial triage.
  2. If something suspicious appears, a second, more agentic and contextual step looks at the entire table and can even run queries through Trino to validate.

The findings are stored in DataHub and also feed Lifeguard’s allowlist, allowing human reviewers to make fine-grained approvals.

Transformer: ELT on Top of Workflows

Transformer is the ELT engine built on Cloudflare Workflows. Users define a SQL transformation DAG using YAML with:

  • target table,
  • materialization mode,
  • dependencies,
  • schedule.

Transformer compiles this graph and executes everything via Trino. State is managed with Durable Objects, definitions live in R2, and execution history is stored in D1.

Ingestion: Bridge Between Operational Systems and the Lake

The ingestion layer connects operational databases to Town Lake. An orchestrator runs on Kubernetes, reads pipeline configurations, and launches short-lived jobs that:

  • extract data from Postgres or ClickHouse,
  • transform it into Parquet,
  • and load it into Iceberg tables on R2.

Each pipeline can run in full-replace or incremental-append mode, depending on the nature of the source.

Governance: Closed by Default, with Automation and Light Review

A clear risk of centralizing data is creating a huge exposure surface. The traditional approach is to open everything and restrict by exception. Cloudflare did the opposite with Town Lake: everything starts closed.

In practice, it works like this:

  • when a new database is connected to Trino or a new table is created,
  • Skimmer scans and classifies the columns,
  • registers the object in a central allowlist as pending,
  • no one can access it until a reviewer approves the table and specific columns.

This flow would be painful if it were manual. But two factors make it viable:

  1. Heavy automation. The PII classifier is good at catching both obvious PII (email, IP, name, phone) and less-obvious sensitive data (tokens with certain prefixes, opaque IDs that allow tracking users). The reviewer sees what was detected and approves, adjusts, or denies. In general, the review takes just a few seconds.
  2. Self-service flow. If someone tries to query a table they do not have access to, the message is not just a dry permission denied. The response indicates that the table needs review and provides a path to request it. Skipper itself even suggests which RBAC group makes sense to request access to.

Another important detail: schema discovery is separate from data access. People can see that a table exists, but unreviewed columns are hidden from commands like DESCRIBE, SHOW COLUMNS, and even SELECT *. This prevents breaking old dashboards when new columns have not been reviewed yet.

On top of that, access to PII is opt-in per session. By default, Trino masks sensitive columns before any result appears on screen. If someone needs to see raw data, for example in a fraud investigation, they flip a session flag, go through a permission check, and every query is logged in detail.

Skipper: The AI Agent for Conversational Data Access

Having just a query engine does not solve everything. SQL is still a barrier for a lot of people, and knowing which table to use in a sea of tens of thousands is not trivial. The answer to that was Skipper, a conversational artificial intelligence agent designed to walk from a natural language question to a validated answer using Cloudflare’s real data, code, and institutional knowledge.

Skipper runs on top of Town Lake and also on top of Cloudflare’s developer platform: Workers, Workers AI, Durable Objects, D1, R2, Workflows, and KV.

In the interface, it is a chat. A classic example of a question:

Show the 10 customers with the highest R2 storage cost in the last 30 days and the change compared to the previous 30 days.

Skipper then:

  • finds the right tables using DataHub,
  • pulls schema and lineage information,
  • writes the most appropriate SQL query,
  • submits that query to Trino,
  • waits for the result and generates a table or chart.

If the user follows up with:

Now break it down by region and ignore Cloudflare internal accounts.

Skipper keeps the context, adjusts the query, reruns it, and returns the new view. If it sees obvious issues, like a join that returned zero rows or a filter that removed more than it should, it will investigate, fix, and try again in a closed reasoning loop.

Beyond ad hoc exploration, Skipper can package visualizations into shareable dashboards that can be embedded into other internal tools. It also provides features to build transformation graphs through Transformer and check permissions via Lifeguard.

How Skipper Avoids Hallucinations with Context Layers

If you just feed a language model a SQL prompt and a list of table names, it will often:

  • invent joins that do not exist,
  • use the wrong columns,
  • and return very confident numbers that are completely wrong.

In the first internal tests, Cloudflare saw this up close. The solution came from a multi-layer context architecture that the agent accesses while generating answers.

Layer 1: Schema and Usage Metadata

DataHub knows:

  • all columns and their types,
  • primary and foreign keys,
  • and which tables are usually joined with which, based on historical queries.

Tools like search_datasets and get_entity_details expose this to Skipper.

Layer 2: Human Annotations

Teams document the tables they maintain. When the team responsible for dim.accounts notes that it has one row per account and that each account belongs to a single customer via customer_id, that description goes into DataHub and becomes part of the agent’s context. Tags like curated mark validated tables that Skipper should prioritize, avoiding scratch spaces.

Layer 3: Knowledge Derived from Code

Much of the real meaning does not live in the catalog, but in the SQL that produces each table. The Transformer pipeline publishes a .meta.json file for each node on every successful run, feeding DataHub.

So when Skipper looks at a table like fct.billings_allocated, it does not just see the schema, but also that:

  • it is a pre-joined fact table,
  • built from dim.accounts, dim.customers, and seed.product_classification,
  • and that the alloc_amount column is calculated as billed_amount / 12 for annual plans and billed_amount for monthly plans.

This level of nuance is what separates a correct answer from a well-formatted guess.

Layer 4: Curated Data Models

Cloudflare maintains a concise set of data model pages, short documents that explain how to think about billing, customers, accounts, and zones. They include guidance like:

  • prefer tables tagged as curated,
  • avoid scratch tables or those tagged as internal,
  • search using model terms (like billing product revenue) instead of loose language.

Skipper accesses these documents as additional resources when the question’s topic lines up with one of these domains.

Tools we use daily

Layer 5: Real-Time Introspection

When the rest is not enough, Skipper can perform live inspections in Trino:

  • DESCRIBE table,
  • SELECT DISTINCT column LIMIT 20,
  • SELECT COUNT(*).

This is used carefully because it has a cost, but it works as a safety net that prevents many silent errors.

Execution as Code: Less Chat, More Flow

In most tool-using agents, the common pattern is to define dozens of tools in the prompt, let the model call them one by one, and repeat context on every turn. That is verbose and expensive. For Skipper’s MCP server, Cloudflare adopted Code Mode.

Instead of exposing 30 tools, the API basically exposes two: search and execute. The model writes a JavaScript snippet that programmatically calls the entire Skipper tool suite:

const datasets = await skipper.search_datasets({ query: "billing product revenue" })
const queryId = await skipper.start_query({ sql: "SELECT ..." })
const results = await skipper.fetch_results({ queryId, mode: "inject" })
return skipper.create_chart({ chartType: "bar", data: results.rows })

This code runs in an isolated dynamic Worker. With that, the model describes multi-step flows in a single turn, in a language it is already good at. It is faster, cheaper, and still leaves the flow auditable as code.

Another core point: everything Skipper does runs as the calling user. If the person does not have access to a table, the agent does not either. If a saved query is shared, permission checks happen at view time, not when it is saved.

Real Use Cases Inside Cloudflare

Once it was consolidated, Town Lake and Skipper started being used across many internal domains.

Billing and Billable Usage

This was the use case that drove the project. The Billable Usage Dashboard, the panel that shows pay-as-you-go customers how much they owe, is powered by a measurement pipeline whose source of truth is Iceberg tables on R2 queried via Trino. The dashboard API reads the same compact rows (date, account_id, metric_name, usage) that the billing system uses, guaranteeing that the number the customer sees matches the number on the invoice.

Billing-related queries represent more than half of all queries served by Town Lake. Those old 200–300-line SQL blocks used to compute per-customer revenue rollups have now been condensed into much smaller, far more readable queries.

Business Intelligence

Questions like top 100 customers by revenue or how many domains that signed up today are already in the top 100 started taking about three seconds in Skipper. The same goes for a whole range of business questions that used to become Jira tickets and wait days until someone with free time put together the right query.

Security and Threat Analysis

The Bot Management team uses Town Lake to query machine learning scoring events with scores greater than 0.9 in the last 48 hours, filtered by ASN and geography. Threat researchers have built their own query kits on top of the platform, and Trust & Safety teams tap into those signals to help fight abuse and malicious behavior.

Customer Support

Tasks that used to be multi-day projects, such as finding the top 100 billing tickets from customers who have already spent more than 100 dollars, turned into quick Skipper queries, freeing the team to focus on analysis instead of building ad hoc pipelines.

Some Lessons Learned

A few key lessons stood out along the way:

  • Less detailed instruction, better results. Overly prescriptive prompts telling the model step by step which tool to use hurt answer quality. The model did better with high-level guidance, choosing the flow on its own.
  • Overlapping tools confuse the model. Having too many similar tools made Skipper call the wrong one frequently. Simplifying and giving each tool a single clear purpose helped a lot.
  • Code carries the meaning. The biggest accuracy gains came when the team started ingesting the SQL that produces the tables, not just the schema. Business rule details show up there, not in column descriptions.
  • Memory matters. There is a long tail of corrections like you must always filter this way or always ignore these tables. Without a memory layer, the agent has to relearn that in every conversation.
  • The boring parts hurt the most. Trino and Iceberg are not exactly new. The hard, energy-consuming parts are the boring ones: row-level access control, closed-by-default allowlists, query auditing, time-bounded credentials, idempotent pipelines, controlled schema evolution. That is what makes the platform truly safe.

What’s Next for the Data Platform and Skipper

Cloudflare is continuing to expand the agent’s surface area. Skipper already runs as an MCP server integrated with compatible IDEs, and the next step is to deepen its presence in internal chats and ticketing systems. The idea is that asking the data becomes the first reflex for anyone who is:

  • investigating an incident,
  • designing a new project,
  • or just validating a quick hypothesis.

In parallel, the company is heavily investing in the Transformer pipeline. The vision is that any team can build a curated dataset with a few SQL files and a .meta.json, deploy it as a Workflow, get automatic scheduling and monitoring, and see everything show up, already integrated, in DataHub and Skipper.

On the analytics engine side, R2 SQL, Cloudflare’s own serverless, distributed engine, is gaining capabilities quickly. As it matures, many parts of the Town Lake flow are expected to migrate to this new component, keeping the same lakehouse concept but on a base that is even more integrated with the rest of the platform.

In the end, the bet is simple: the next big product will probably come from someone looking at the data and seeing something no one else noticed yet. Town Lake, the data platform that powers it, and the Skipper artificial intelligence agent exist precisely to make sure that person can get there.

Picture of Rafael

Rafael

Operations

I transform internal processes into delivery machines — ensuring that every Viral Method client receives premium service and real results.

Fill out the form and our team will contact you within 24 hours.

Related publications

Alteryx: AI and Automation in 380 Million Workflows

Automation at Scale: How Alteryx Processes 380M Workflows/Year, Combining AI, Data Governance & Enterprise Productivity

High-Performance Computing, Artificial Intelligence, Automation, and Digital Marketing in Digital Transformation

AI drives automation, personalized marketing and high-performance computing to transform operations, experiences and decision-making.

Artificial Intelligence and Automation Drive Investments

Investment in AI and automation becomes a priority in the American Midwest: companies move from pilots to practical implementations to

Receba o melhor conteúdo de inovação em seu e-mail

Todas as notícias, dicas, tendências e recursos que você procura entregues na sua caixa de entrada.

Ao assinar a newsletter, você concorda em receber comunicações da Método Viral. A gente se compromete a sempre proteger e respeitar sua privacidade.

Rafael

Online

Atendimento

Calculadora Preço de Sites

Descubra quanto custa o site ideal para seu negócio

Páginas do Site

Quantas páginas você precisa?

4

Arraste para selecionar de 1 a 20 páginas

📄

⚡ Em apenas 2 minutos, descubra automaticamente quanto custa um site em 2026 sob medida para o seu negócio

👥 Mais de 0+ empresas já calcularam seu orçamento

Fale com um consultor

Preencha o formulário e nossa equipe entrará em contato.