hackathonrecommendationcommunity

Hackathon Theme: Build the Best AI-Powered Vertical Video Recommender

UUnknown

2026-01-25

10 min read

Join a community hackathon to build AI recommenders for short episodic vertical video—datasets, evaluation, latency, diversity, leaderboard.

Hook: Solve the gap between academic recommenders and real-world short, episodic vertical video

You're a developer or data scientist who knows recommender systems — but the models you trained on desktop video or e-commerce data don't map cleanly to mobile-first, short, episodic vertical video. Teams struggle with noisy micro-interactions, strict on-device latency budgets on mobile, and evaluation metrics that ignore episode continuity and content diversity. This hackathon blueprint fixes that: a community competition to build the best AI-powered vertical video recommender, complete with dataset packs, robust evaluation, latency targets, and diversity-aware prize categories.

Why this matters in 2026

In late 2025 and early 2026 the industry doubled down on short, serialized vertical video. Investors backed companies positioning themselves as mobile-first streaming platforms (e.g., Holywater's Jan 2026 funding) and AI-native video startups scaled creator tooling and distribution (e.g., Higgsfield's rapid growth). These moves reflect a broader shift: attention is mobile, storytelling is episodic, and creators need discovery systems that respect replay and sequence.

"Holywater positions itself as the mobile-first Netflix for short episodic vertical video." (Forbes, Jan 16, 2026)

That shift changes the recommender problem: you must optimize for watch-through-rate across episodes, suggest the next episode with contextual continuity, and enforce strict on-device latency while preserving diversity so small creators get discovery. A hackathon that centers these constraints will produce practical, job-ready solutions contributors can showcase.

High-level hackathon design (inverted pyramid)

Start with the essentials so participants can build a working pipeline fast, then layer complexity.

Seed datasets — realistic interaction logs, episode metadata, thumbnails, transcripts.
Baseline pipeline — candidate generation + reranking + online simulation code.
Evaluation toolkit — offline metrics (accuracy + diversity), latency benchmarks, contest leaderboard.
Prize categories — accuracy, diversity, latency, multi-episode continuity, and creator fairness.
Community features — mentorship, workshops, public leaderboard, reproducible submissions.

Dataset packs: what to provide (and why)

Participants need datasets that reflect vertical video constraints: short sessions, frequent sequential episodes, sparse explicit feedback, and mobile signal noise. Provide three dataset packs so teams can iterate from simple to production-like.

1) Starter pack (synthetic, immediate onboarding)

~10k users, ~5k videos, ~50k interactions. Interaction types: view_start, view_complete, swipe_away, like, share.
Episode chains: sequences of 2–12 videos representing serialized microdramas.
Metadata: category, creator_id, duration_seconds, tags, vertical_aspect=true flag.
Static thumbnail embeddings (128-d), synthetic subtitles tokens.
Usage: fast prototyping, reproducible baselines.

2) Realistic pack (session-level logs, multi-signal)

~100k users, ~50k videos, ~2M interactions collected over 30 days.
Signals: dwell_time, scroll_speed, watch_percentage, app_foreground, network_type (4G/5G/Wi-Fi).
Creator metadata and cross-episode IDs to model continuity. Thumbnails + ASR transcripts for short captions.
Supply both raw logs and precomputed candidate pools (for fairness and to reduce compute barriers).

3) Challenge pack (hard tasks: cold-start, long-tail)

New creators and new episodes introduced in test-time (cold-start). Hidden test interactions for leaderboard scoring.
Skewed popularity distribution to test long-tail discovery.
Edge-case metadata: explicit content flags, regional licenses, and temporal release windows (timed episodes).

Provide clear schema (CSV/Parquet), sample ingestion notebooks, and a small Docker runtime to execute evaluation locally.

Baseline pipeline: a practical starter solution

Ship a baseline with runnable code so participants can focus on innovation, not plumbing. Keep it modular: candidate generation -> feature enrichment -> reranker.

Candidate generation (recall)

Use collaborative filtering + content-based fusion: nearest neighbors on light-weight embeddings (use FAISS approximate search).
Create embeddings by averaging a frame-level visual encoder + text (ASR/caption) embeddings. Use efficient encoders (MobileViT, DistilBERT) for speed.
Precompute top-500 candidates per user session to bound downstream compute.

Feature enrichment

Session features: time_since_last_view, session_position, device_network.
Item features: episode_index, series_id, creator_reputation, release_time delta.
Cross features: session_episode_match, creator_follow_state, watch_history_overlap.

Reranker

Train a lightweight model that reorders top candidates for final ranking. The baseline uses an MLP that predicts watch_probability and watch_through_rate, with a post-processing step to enforce minimal diversity (e.g., no >2 consecutive items from same creator).

Example pipeline steps (fast start)

Ingest dataset pack into a local Parquet store.
Precompute item embeddings and a FAISS index.
For each test session, recall top-500 candidates via FAISS, compute features, and score with reranker.
Output top-10 per session to leaderboard scorer.

Evaluation: combine accuracy, episodic continuity, and diversity

Offline accuracy metrics are necessary but insufficient. This hackathon enforces a composite evaluation protocol that values engagement, multi-episode coherence, diversity, and latency/resource cost.

Primary offline metrics

NDCG@10 — rank-aware accuracy for short sessions.
MRR — useful for single-click next-episode prediction.
Watch-Through Rate (WTR) — predicted/actual fraction of video watched; prioritize completion of episodes.

Diversity & discovery metrics

Intra-List Diversity (ILD) — average distance between recommended items in embedding space.
Catalog Coverage — percentage of unique items surfaced across all top-K lists.
Novelty / Serendipity — downweight overserved popular items (e.g., popularity-based re-ranking penalty).
Creator Fairness — Gini coefficient or top-k share per creator to measure concentration.

Multi-episode continuity metric

Define Sequence Consistency Score (SCS): reward recommendations that maintain narrative continuity when the user is mid-episode sequence. Compute the fraction of times the true next episode appears in top-K when session includes previous episodes from the same series.

Latency & resource evaluation

Measure end-to-end inference latency under constrained CPU and memory limits representative of mobile edge services.
Report p50, p95, p99 latencies and cold-start times.
Include model size, peak memory, and per-request CO2-ish energy estimate for sustainability scoring (optional).

Leaderboard scoring aggregation

Use a weighted scoring function so teams must balance objectives. Example weights (customizable):

0.40 * NDCG@10
0.20 * WTR
0.15 * SCS
0.15 * Diversity Score (normalized ILD + coverage)
0.10 * Latency bonus/penalty (negative if p99 > target)

Publish both the composite score and per-metric breakdown so judges can award special categories.

Prize categories and judging rubrics

Prize diversity encourages varied, real-world solutions. Suggested categories:

Best Overall Recommender — highest composite leaderboard score.
Latency Champion — strict p99 and p50 under resource caps; prize for edge-friendly design.
Diversity & Discovery — best improvement on ILD and catalog coverage vs. baseline.
Episode Continuity — highest SCS for serialized content.
Creator Fairness — solutions that demonstrably reduce concentration and help small creators.
Most Reproducible Entry — clear code, Dockerized runtime, and CI demonstrating correctness.
Community Pick — voted by participants and mentors (encourages documentation & storytelling).

Practical tips for participants — what top teams do

From mentoring experience running leaderboards and hackathons, top teams optimize both model architecture and deployment constraints.

Start with a strong recall step. A good recall set that preserves episode continuity drastically improves reranker outcomes. Use session-aware heuristics (last-watch-series boost) plus vector search.
Use hybrid embeddings. Fuse visual frame embeddings + text (ASR/caption) + metadata into compact vectors (128–256 dims). Quantize them for FAISS to reduce memory and speed up search.
Train on session slices. Treat sequences as short sessions (5–15 events) instead of independent impressions. Sequence-aware features catch binge behavior.
Constrain reranker size. Use a small, well-regularized model for reranking. Huge Transformer rerankers often fail the latency contest unless distilled.
Post-process for diversity. Apply constrained optimization (e.g., MMR — Maximal Marginal Relevance) or greedy diversity constraints to final lists to hit diversity targets without sacrificing too much accuracy.
Measure real latency. Containerize your inference stack and simulate mobile-like CPU/memory limits. Always report p99 and cold-start numbers.
Document reproducibility. Provide a Dockerfile, seed data, and a reproducible scoring script. Public GitHub + CI helps judges and community learning.

Sample solution outline: hybrid recall + distilled reranker

Here's a concise, actionable pipeline you can implement during a weekend hackathon.

Compute per-item embeddings: average 4 uniformly sampled frames passed through MobileViT (image) and concatenate with DistilBERT caption embedding. Dimensionality reduce to 128 via PCA.
Build an IVF+PQ FAISS index with 256 centroids and 8-byte PQ to recall top-500 quickly.
Train a LightGBM model on logged features to predict watch_probability. Use groupwise cross-validation by user sessions to avoid leakage.
Distill LightGBM scores into a small MLP deployed as the reranker model for latency. Or use ONNX for fast serving.
Run a post-processing step: ensure no more than 2 items from same creator in top-10 and insert one exploration slot every 5 results for novelty.

Leaderboard mechanics and anti-cheat

Design your leaderboard to reward genuine modeling, not leakage or overfitting to test data.

Use a two-phase scoring: public validation (smaller holdout) and private test (hidden interactions) for final ranking.
Limit submission frequency to prevent brute-force tuning to the public split.
Require a reproducible Docker image and seed to reproduce results before awarding prizes.
Reward open-source entries: extra points for clear README, reproducibility, and community docs.

Community operations: workshops, mentorship, and judging

The competition is a learning vehicle as much as a contest. Run weekly workshops and pair participants with mentors from industry who have built recommender infrastructure at scale. Suggested schedule:

Week 0: Kickoff — datasets, baseline, rules.
Week 1: Retrieval & embeddings workshop (FAISS, quantization, hybrid features).
Week 2: Reranking & diversity (MLR, MMR, constrained optimization).
Week 3: Latency & deployment (model distillation, ONNX, Docker edge simulation).
Final week: Submissions, judging, and demo day.

2026 trends to leverage and watch

Stay current. Recent developments (late 2025–early 2026) impact what's feasible in the contest:

Mobile-first vertical streaming growth, highlighted by new funding and platform builds, increases demand for episodic recommendation systems (Forbes, Jan 2026).
Advances in efficient video encoders and on-device inference make sub-100ms rerankers realistic for top-ranked systems.
Creator-centric discovery and fairness have become public policy and product priorities; judge for creator fairness explicitly in 2026 contests.
AI video tooling growth (e.g., startups raising large rounds for creator tooling) democratizes content creation — hence vaster, more long-tailed catalogs to recommend from.

Sample evaluation script checklist (what judges will run)

Validate output format: top-K per session in required JSON/CSV schema.
Run offline scorer to compute NDCG@10, MRR, WTR, ILD, coverage, SCS.
Spin a Dockerized evaluation harness to measure p50/p95/p99 inference latency under specified vCPU/RAM caps.
Re-run sample submissions to verify reproducibility and check for data leakage.
Compute composite score and per-metric leaderboards.

How to get started (action plan for teams)

Register on the hackathon portal and clone the provided starter repo.
Load the Starter dataset pack and run the baseline to get a working submission (under 2 hours).
Improve recall by adding session-aware boosts and hybrid embeddings.
Iterate on reranker while measuring both NDCG and your latency profile.
Submit to the public leaderboard, join mentor office hours, and prepare a 5-min demo for judges.

Final notes — measurement matters

In 2026, recommender systems are judged not only by accuracy but by their real-world impact: latency on constrained mobile devices, fair discovery for creators, and the ability to preserve episodic narratives. A well-designed hackathon that enforces these constraints trains contributors for production realities and surfaces innovations that can be adopted by startups scaling vertical video platforms.

Call to action

Ready to build the next-generation vertical video recommender? Join our community hackathon: download dataset packs, run the baseline, and submit your first result to the leaderboard. Whether you’re optimizing p99 latency, inventing a novel diversity-aware reranker, or improving episode continuity, your work will be judged by industry mentors and showcased to hiring teams.

Sign up, grab the starter repo, and claim your lane — latency, diversity, or creator fairness. See you on the leaderboard.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.