Cloudflaretutorialdatasets

Tutorial: Build a Dataset Delivery System with Cloudflare Workers and Signed URLs

cchallenges

2026-02-11

10 min read

Build a secure, resume-capable dataset delivery system with Cloudflare Workers, signed URLs, and edge streaming to cut bandwidth and delivery headaches.

Hook: Stop losing time and bandwidth delivering datasets

If you're on an ML team, you know the pain: massive labeled datasets that are expensive to host, slow to download, and risky to share. Teams waste engineer hours re-uploading, debugging broken downloads, and chasing partial transfers. In 2026, that problem is solvable at the edge: build a delivery system that combines Cloudflare Workers, short-lived signed URLs, and resume-capable downloads so data scientists get reliable access — and platform engineers keep costs and risk under control.

Why this matters now (2026 trends)

Two industry signals accelerated this pattern in late 2025 and early 2026. First, the market's explosion of AI data demand increased dataset distribution needs across organizations and partners. Second, edge platforms — Cloudflare in particular — moved aggressively into AI and data services (including acquisitions and partnerships in the AI data marketplace space), making edge-backed dataset delivery a practical, secure option for teams.

The upshot: teams can now serve large labeled datasets from edge workers, enforce access control with signed URLs, and support resume via HTTP Range requests — all while keeping bandwidth predictable and reducing origin egress charges.

High-level architecture

Here’s the pattern we’ll implement step-by-step. Keep this diagram in mind as we go.

Object storage (R2 / S3 / GCS) holds the dataset files.
Backend auth service issues short-lived signed tokens (or signed tokens) after user auth & authorization checks.
Cloudflare Worker acts as the gate: validates signatures, enforces rate limits, and proxies or streams the file while honoring Range headers for resume support.
Client uses the signed URL to download with resume, letting standard HTTP Range semantics handle partial transfers.
CI/CD pipeline and code review ensure deployments are auditable and tested.

Design decisions and trade-offs

Signed URL vs. presigned S3 URL: Presigned S3/R2 URLs let clients fetch directly, offloading bandwidth from workers. But if you need access checks, telemetry, IP binding, or token revocation, fronting with a Worker gives you control. For teams thinking about monetization or audit trails, see work on paid-data marketplaces for security and billing trade-offs.
Resume strategy: Use standard HTTP Range requests — universal and well-supported. For truly interrupted large uploads/downloads, consider application-level chunking or protocols like tus, but for downloads of static dataset files, Range is sufficient.
Caching and cost: Cache at the edge where possible. Cloudflare's edge cache can serve repeated downloads without origin hits and reduce egress charges.

Prerequisites

Cloudflare account with Workers enabled and a domain configured.
Object storage (Cloudflare R2, AWS S3, or GCS) containing the dataset.
Backend service (Node.js example below) that can issue signed tokens and has credentials to verify user authorization.
Git repository, GitHub Actions (or similar) for CI/CD, and a code-review workflow (protected branches, PR templates).

Step 1 — Prepare your dataset and metadata

Organize datasets in object storage with predictable keys. Add a metadata manifest per dataset (JSON) that includes file checksums (SHA256), sizes, labels, and a version field. This helps validate integrity on the client and allows partial updates (deltas).

Example manifest (dataset-manifest.json)

{
  "dataset": "imagenet-mini",
  "version": "2026-01-01",
  "files": [
    {"key": "imagenet-mini/part-0001.tar.gz", "size": 1500000000, "sha256": "..."},
    {"key": "imagenet-mini/part-0002.tar.gz", "size": 1200000000, "sha256": "..."}
  ]
}

Step 2 — Build a safe signed-token scheme

Use a server-side signer that creates compact tokens the Worker can validate with a secret. Include these fields in the signature: dataset key, expiry timestamp, user id (optional), and an optional IP or client fingerprint.

Node.js signer (example)

const crypto = require('crypto');

function signToken({ key, userId, expiresAt }, secret) {
  const payload = JSON.stringify({ key, userId, exp: expiresAt });
  const hmac = crypto.createHmac('sha256', secret).update(payload).digest('base64url');
  return `${Buffer.from(payload).toString('base64url')}.${hmac}`;
}

function verifyToken(token, secret) {
  const [payloadB64, mac] = token.split('.');
  const payload = Buffer.from(payloadB64, 'base64url').toString();
  const expected = crypto.createHmac('sha256', secret).update(payload).digest('base64url');
  if (!crypto.timingSafeEqual(Buffer.from(mac), Buffer.from(expected))) throw new Error('Invalid signature');
  return JSON.parse(payload);
}

Key points: keep tokens short-lived (30–300 seconds depending on use), store the signing secret in a vault (Cloudflare Workers secret or your backend's secret manager), and include per-user or per-request metadata when necessary.

Step 3 — Worker: authenticate, authorize, and stream with Range support

The Worker will perform these tasks:

Validate the signed token.
Authorize the user for the requested dataset key (extra step if token is generic).
Forward the client's Range header to the origin storage fetch or use efficient R2 streaming.
Set proper response headers to support resume: Accept-Ranges, 206 Partial Content, and Content-Range.

Cloudflare Worker (JavaScript) — proxy streaming approach

export default {
  async fetch(request, env) {
    const url = new URL(request.url);
    const token = url.searchParams.get('token');
    const key = url.searchParams.get('key');
    if (!token || !key) return new Response('missing token/key', { status: 400 });

    // Validate token (we assume HMAC verification utility available)
    let payload;
    try {
      payload = verifyToken(token, env.SIGNING_SECRET);
    } catch (err) {
      return new Response('invalid token', { status: 403 });
    }

    if (payload.key !== key) return new Response('not authorized for key', { status: 403 });

    // Prepare forward headers (preserve Range if present)
    const forwardHeaders = new Headers();
    const range = request.headers.get('range');
    if (range) forwardHeaders.set('Range', range);

    // Internal origin URL (private endpoint or signed internal path)
    const originUrl = `${env.ORIGIN_BASE_URL}/${key}`;

    // Add internal auth header for the origin fetch (rotate and store secrets securely)
    forwardHeaders.set('Authorization', `Bearer ${env.ORIGIN_SERVICE_TOKEN}`);

    const originResp = await fetch(originUrl, { headers: forwardHeaders, cf: { cacheTtl: 3600 } });

    // If origin returned 206 or 200, forward status and headers (but sanitize headers)
    const headers = new Headers(originResp.headers);
    headers.set('Accept-Ranges', 'bytes');

    // Security: set cache-control, restrict exposed headers
    headers.set('Cache-Control', 'public, max-age=3600');
    headers.delete('x-internal-metadata');

    return new Response(originResp.body, { status: originResp.status, headers });
  }
};

Notes:

If you use Cloudflare R2 and bind the bucket to Workers, you can fetch directly from the R2 binding; or you can proxy to a private origin S3 endpoint. See notes on edge caching and bindings.
Always keep the origin auth token secret and rotated. Use Cloudflare Workers secrets and a secret manager in CI/CD.

Step 4 — Client: resume-capable downloader

Use HTTP Range to resume. Many tools (curl, wget, Python requests with Range header) already support this. For programmatic downloads, implement a small resume loop that retries partial downloads and validates the checksum per piece.

Example curl usage

# initial download
curl -C - -o part-0001.tar.gz "https://datasets.example.com/download?key=imagenet-mini/part-0001.tar.gz&token=..."

# resume an interrupted download (-C - automatically resumes)

Example resumable Node.js downloader

const fs = require('fs');
const fetch = require('node-fetch');

async function downloadWithResume(url, outPath) {
  let start = 0;
  if (fs.existsSync(outPath)) start = fs.statSync(outPath).size;

  const headers = { Range: `bytes=${start}-` };
  const res = await fetch(url, { headers });
  if (![200, 206].includes(res.status)) throw new Error('download failed');

  const dest = fs.createWriteStream(outPath, { flags: 'a' });
  await new Promise((resolve, reject) => {
    res.body.pipe(dest);
    res.body.on('error', reject);
    dest.on('finish', resolve);
  });
}

Step 5 — CI/CD and code review workflow

Implement a reliable deployment pipeline so dataset delivery changes are reviewed and tested before reaching production.

Use branches and pull requests for all Worker and backend changes. Require at least one approving review and passing CI checks.
CI steps: lint, unit tests, integration tests against a local emulator (Miniflare for Workers), and an end-to-end smoke test that requests a signed token and performs a small-range download.
Deployment: use GitHub Actions to run tests and then invoke wrangler publish with secrets set in GitHub Actions secrets or use Cloudflare API tokens.

Sample GitHub Actions (outline)

name: CI
on: [pull_request, push]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node
        uses: actions/setup-node@v4
        with: node-version: '20'
      - run: npm ci && npm test

  deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run build
      - name: Publish to Cloudflare
        env:
          CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }}
        run: npx wrangler publish

Observability, throttling, and abuse protection

Protect your dataset and control costs.

Telemetry: log token issuance, Worker requests, origin response codes, and bytes served per token. Use these metrics to detect abuse and unusual bandwidth spikes; tie this into a cost impact analysis to understand dollar exposure.
Rate limiting: enforce per-user and per-token limits in the Worker (or use Cloudflare rate-limiting rules) to stop bursty downloads from overrunning budget.
Revocation: keep signed tokens short-lived. If you need revocation, maintain a small denylist with an in-memory store (Durable Object or KV) checked before serving larger files.

Security best practices

Always sign tokens with rotated secrets and store them in a secrets manager. Use Cloudflare Workers secrets or a dedicated vault for backend services.
Minimize the scope of origin credentials: create a service user with read-only access to specific dataset prefixes.
Prefer short-lived tokens and bind tokens to user IDs or IP when possible.
Sanitize headers and remove internal metadata before returning responses to clients.
Consider adding checksum validation on the client side to ensure file integrity after resume. For compliance and training-data readiness, follow guidance on offering content as compliant training data.

Bandwidth and cost optimizations

Large datasets cost money. Here are practical levers to reduce cost and improve performance.

Edge caching: Use Cloudflare's cache rules to serve repeated downloads from edge caches rather than origin. Set appropriate Cache-Control with immutable content versions. See an edge signals writeup for strategies that combine caching and personalization.
Sharding & chunking: Split big archives into part-N files so clients only download what they need; this also enables parallel downloads and better resume granularity.
Compression: Pre-compress files with zstd or gzip if your clients can handle them. For ML datasets, consider tarring datasets and compressing per-file rather than entire corpus when feasible.
Delta updates: For frequently updated datasets, publish diffs rather than full re-uploads.
Monitoring & alerts: Set budget alerts on bytes served per dataset to catch accidental oversharing early.

Testing & validation

Setup a test matrix in CI that includes:

Unit tests for token signer/verifier (timing-safe comparisons).
Local Worker integration tests via Miniflare simulating Range requests and verifying 206 responses.
End-to-end smoke tests that request a real small file from a non-production bucket and validate checksum and range resume.

Real-world checklist before production

Manifest and checksums available for all dataset versions.
Signed tokens expire rapidly (< 5 minutes) and require re-issue for long transfers (implementation detail).
Edge cache rules configured and validated for immutable dataset keys.
Rate-limiting policy and monitoring alerts in place.
CI/CD pipeline with branch protections and integration tests green.

Advanced strategies and future-proofing (2026+)

Looking forward, consider these advanced strategies aligned with 2026 trends:

Data marketplaces & attribution: with recent moves in the industry toward paying data creators and provenance tracking, add metadata hooks for audit trails and usage attribution (useful for licensing).
Edge-integrated compute: shift some preprocessing to the edge (e.g., per-request compression, format conversion) for clients with limited local tooling.
Tokenized billing: integrate dataset consumption metrics into billing systems so teams can monetize or allocate costs per project.

"Edge-first dataset delivery reduces origin load and tightens control — especially when you combine signed tokens, Range-aware streaming, and a strong CI/CD workflow."

Actionable takeaways

Use short-lived signed tokens issued by a backend service and verified at the Worker to retain control and observability.
Support HTTP Range in your Worker proxy so standard clients can resume downloads without custom protocols.
Cache immutable datasets at the edge and shard large archives to reduce bandwidth and improve reliability.
Automate tests and deploys with CI that runs Miniflare tests and smoke downloads to avoid regressions.

Next steps — starter checklist to implement today

Create dataset manifests and upload to object storage.
Implement signer service and generate a few test tokens.
Deploy the Worker that validates tokens and forwards Range headers.
Add a client test that performs a resumed download and verifies checksum.
Set up GitHub Actions and branch protections for your repo.

Call to action

Ready to stop wrestling with unreliable dataset distribution? Start with a branch in your repo: add the signer tests, a Worker route for /download, and a CI job to run a resume-scenario smoke test. If you want, paste your Worker code and CI config into our community review thread for feedback — we’ll review it and suggest improvements based on real-world traffic and cost patterns.

challenges

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.