docs: add design document for local video key-frame extractor

This commit is contained in:
Ben
2026-05-14 00:21:32 -07:00
parent 2ea3fdee30
commit 2e704349f7
+929
View File
@@ -0,0 +1,929 @@
Recommended approach
Use:
<input type="file">
→ URL.createObjectURL(file)
→ hidden <video>
→ sample frames periodically
→ draw sampled frame to canvas
→ compare against previous accepted frame
→ keep only significant-change frames
→ user clicks Copy on selected image
→ navigator.clipboard.write()
URL.createObjectURL(file) gives the browser a local blob URL for the selected file, and you should release it with URL.revokeObjectURL() when done.
Do not analyze every frame at full resolution
For change detection, use a small analysis canvas:
video frame → draw to 160×90 or 320×180 canvas → compare pixels
Then only render the full-resolution frame when you decide it is worth keeping.
Example strategy:
const ANALYSIS_W = 160;
const ANALYSIS_H = 90;
const SAMPLE_INTERVAL = 0.5; // seconds
const CHANGE_THRESHOLD = 0.18; // tune this
That means a 10-minute video sampled every 0.5 seconds is only:
10 min × 60 / 0.5 = 1200 comparisons
At 160×90, that is cheap.
Simple frame-difference algorithm
Use a downscaled canvas and compare luminance differences:
function frameDifference(a, b) {
let changed = 0;
const pixels = a.data.length / 4;
for (let i = 0; i < a.data.length; i += 4) {
const lumA = 0.299 * a.data[i] + 0.587 * a.data[i + 1] + 0.114 * a.data[i + 2];
const lumB = 0.299 * b.data[i] + 0.587 * b.data[i + 1] + 0.114 * b.data[i + 2];
if (Math.abs(lumA - lumB) > 32) changed++;
}
return changed / pixels;
}
Then:
if (diff > CHANGE_THRESHOLD) {
// save this timestamp as a candidate key frame
}
Better heuristic
Instead of comparing every sampled frame to the immediately previous sample, compare it to the last accepted key frame.
That avoids saving several frames during the same transition.
Frame A accepted
Frame B slightly different → reject
Frame C slightly different → reject
Frame D substantially different from A → accept
Also add a minimum time gap:
const MIN_SECONDS_BETWEEN_ACCEPTED = 2.0;
Clipboard copying
Copying images to the clipboard is possible with navigator.clipboard.write() and ClipboardItem. Browsers commonly support PNG image data, but clipboard access requires a secure context such as HTTPS or localhost, and browser/user activation rules apply.
Example:
async function copyCanvasToClipboard(canvas) {
const blob = await new Promise(resolve =>
canvas.toBlob(resolve, "image/png")
);
await navigator.clipboard.write([
new ClipboardItem({
"image/png": blob
})
]);
}
Important: this should happen from a user gesture, like clicking a Copy button.
Memory profile
This is light if implemented correctly:
1 local video blob URL
1 hidden video element
1 small analysis canvas
1 full-res preview canvas/image
410 saved output blobs
Avoid:
Storing hundreds/thousands of full-res canvases
Storing every sampled frame
Using FileReader.readAsDataURL(video)
Exporting a huge ZIP unless needed
For your target, keeping 410 PNG/WebP blobs in memory is fine.
Practical UX flow
I would build it like this:
User selects video.
App scans with adjustable settings:
sample interval
change threshold
minimum seconds between captures
max captures
App shows candidate thumbnails.
User deletes bad picks if needed.
Each thumbnail has:
Copy PNG
Download
timestamp
Optional: button to rescan with stricter/looser threshold.
Caveats
Plain <video> seeking may not be perfectly frame-accurate.
Sampling every 0.251.0s is much more practical than decoding every frame.
Browser codec support matters. MP4/H.264 is safest.
Safari/Firefox clipboard image support can be more limited than Chromium-based browsers.
For exact frame extraction, use WebCodecs or ffmpeg.wasm, but for your “significant scene change” use case, <video> + canvas is probably enough.
My recommended V1 stack
React/Vite
HTMLVideoElement
Canvas API
Blob/Object URLs
navigator.clipboard.write()
Optional: localForage or IndexedDB if you want session persistence
No backend required. No upload required.
What would be the best framework or library to handle these tasks? Come up with a detailed implementation / design document.
Ill treat this as a local-only browser app design, optimized for your 1001000 MB input range and 410 final images. Ill ground the library/API choices against current browser support rather than assuming older constraints.
Thought for 5s
Recommendation
Use React + Vite + TypeScript for the app shell, but do the actual video work with native browser APIs first:
Vite + React + TypeScript
HTMLVideoElement
Canvas / OffscreenCanvas
Web Worker
Blob URLs
Clipboard API
For your use case, I would not start with ffmpeg.wasm. It is powerful, but it is overkill for scene/key-frame extraction and can become memory-heavy with 1001000 MB files. ffmpeg.wasm is a browser/WebAssembly port of FFmpeg, but many real-world reports and project issues revolve around large-file memory limits and browser stability.
For a stronger V2, consider Mediabunny + WebCodecs. Mediabunny is a TypeScript media toolkit designed for browser-side reading/writing/conversion, and WebCodecs exposes native VideoFrame, EncodedVideoChunk, VideoDecoder, and VideoEncoder primitives.
Design Document: Local Video Key-Frame Extractor
1. Goal
Build a browser-only app that allows a user to:
Select a local video file.
Analyze it without uploading to a server.
Detect frames where the image content has significantly changed.
Present a small set of candidate key images.
Let the user copy selected images to the clipboard.
Target input:
Video size: 1001000 MB
Expected output: 410 still images
Processing model: local-only
Backend: none required
2. Recommended architecture
User selects video
Create local Blob URL
Hidden <video> element loads local file
Sampling engine seeks through video
Downscaled analysis canvas compares frames
Candidate detector stores timestamps
Full-resolution renderer extracts selected frames
Gallery shows thumbnails
User copies PNG to clipboard
Core browser APIs:
Task API
Local file selection File API
Local video loading URL.createObjectURL()
Video decoding/seeking HTMLVideoElement
Pixel comparison Canvas API
Off-main-thread analysis Worker + OffscreenCanvas where available
Image export canvas.toBlob()
Clipboard copy navigator.clipboard.write() + ClipboardItem
URL.createObjectURL() is appropriate here because it creates a local blob URL for a File/Blob, avoiding the need to load the entire file into JavaScript memory. The Clipboard API supports writing non-text data through ClipboardItem, but it requires a secure context such as HTTPS or localhost.
3. Framework choice
Best V1 stack
Vite
React
TypeScript
Zustand or simple React state
Canvas API
Web Worker
Why:
Vite: fast local dev, easy static deployment.
React: good for gallery, controls, progress UI, and stateful review flow.
TypeScript: useful because media processing code quickly accumulates edge cases.
No backend: app can deploy as static files.
No heavy media dependency initially: browser decoding is enough for threshold-based change detection.
Suggested install:
npm create vite@latest video-keyframe-extractor -- --template react-ts
cd video-keyframe-extractor
npm install
npm run dev
Optional dependencies:
npm install zustand
I would avoid extra UI frameworks unless you want polished controls quickly.
4. Library/API decision matrix
Option Use for V1? Notes
Native <video> + canvas Yes Best starting point. Simple, no server, no large WASM payload.
Web Worker Yes Keeps UI responsive during scans.
OffscreenCanvas Yes, optional Good for worker-side pixel analysis where supported.
WebCodecs V2 Better low-level control, but more complex. WebCodecs is browser-native and provides direct access to decoded frames.
Mediabunny V2 / advanced Good candidate for more robust container/media handling. It advertises efficient browser-side reading and only loading what is needed.
ffmpeg.wasm Avoid for V1 Heavy and memory-risky for 1001000 MB videos. Better for fallback/export/transcoding, not first-pass scene detection.
OpenCV.js Probably avoid Useful for advanced CV, but heavy for simple scene-difference detection.
5. App modules
5.1 File intake module
Responsibilities:
Accept local video file.
Validate file type and size.
Create blob URL.
Load video metadata.
Display duration, resolution, codec/browser compatibility status if possible.
Example state:
type LoadedVideo = {
file: File;
objectUrl: string;
duration: number;
width: number;
height: number;
};
Important cleanup:
URL.revokeObjectURL(objectUrl);
Never use this for the video:
FileReader.readAsDataURL(file);
That would load and base64-expand the whole video in memory.
5.2 Scan settings module
Expose these controls:
type ScanSettings = {
sampleIntervalSeconds: number; // default: 0.5
analysisWidth: number; // default: 160
analysisHeight: number; // default: 90
pixelDeltaThreshold: number; // default: 32
changedPixelRatioThreshold: number; // default: 0.18
minSecondsBetweenCaptures: number; // default: 2.0
maxCandidates: number; // default: 20
finalTargetCount: number; // default: 8
};
Suggested defaults:
const defaultSettings: ScanSettings = {
sampleIntervalSeconds: 0.5,
analysisWidth: 160,
analysisHeight: 90,
pixelDeltaThreshold: 32,
changedPixelRatioThreshold: 0.18,
minSecondsBetweenCaptures: 2.0,
maxCandidates: 20,
finalTargetCount: 8,
};
5.3 Frame sampling module
V1 should seek using a hidden video element:
async function seekVideo(video: HTMLVideoElement, time: number): Promise<void> {
return new Promise((resolve, reject) => {
const onSeeked = () => {
cleanup();
resolve();
};
const onError = () => {
cleanup();
reject(video.error);
};
const cleanup = () => {
video.removeEventListener("seeked", onSeeked);
video.removeEventListener("error", onError);
};
video.addEventListener("seeked", onSeeked, { once: true });
video.addEventListener("error", onError, { once: true });
video.currentTime = Math.min(time, video.duration);
});
}
Then draw into a small analysis canvas:
function captureAnalysisFrame(
video: HTMLVideoElement,
canvas: HTMLCanvasElement,
width: number,
height: number
): ImageData {
canvas.width = width;
canvas.height = height;
const ctx = canvas.getContext("2d", { willReadFrequently: true });
if (!ctx) throw new Error("Could not get canvas context");
ctx.drawImage(video, 0, 0, width, height);
return ctx.getImageData(0, 0, width, height);
}
6. Change detection algorithm
6.1 Simple and effective V1
Compare current sampled frame against the last accepted key frame, not merely the immediately previous sample.
This prevents over-detecting long transitions.
function frameDifferenceRatio(
a: ImageData,
b: ImageData,
pixelDeltaThreshold: number
): number {
const dataA = a.data;
const dataB = b.data;
let changed = 0;
const pixels = dataA.length / 4;
for (let i = 0; i < dataA.length; i += 4) {
const lumA =
0.299 * dataA[i] +
0.587 * dataA[i + 1] +
0.114 * dataA[i + 2];
const lumB =
0.299 * dataB[i] +
0.587 * dataB[i + 1] +
0.114 * dataB[i + 2];
if (Math.abs(lumA - lumB) > pixelDeltaThreshold) {
changed++;
}
}
return changed / pixels;
}
Acceptance logic:
function shouldAcceptCandidate(params: {
diffRatio: number;
currentTime: number;
lastAcceptedTime: number | null;
changedPixelRatioThreshold: number;
minSecondsBetweenCaptures: number;
}) {
const {
diffRatio,
currentTime,
lastAcceptedTime,
changedPixelRatioThreshold,
minSecondsBetweenCaptures,
} = params;
if (diffRatio < changedPixelRatioThreshold) return false;
if (
lastAcceptedTime !== null &&
currentTime - lastAcceptedTime < minSecondsBetweenCaptures
) {
return false;
}
return true;
}
6.2 Better V1.5: histogram difference
Raw pixel comparison can be sensitive to motion, noise, flashes, compression artifacts, and camera pans.
A better approach is to compare low-resolution color/luma histograms.
Recommended hybrid score:
score = 0.7 × luminance histogram difference
+ 0.3 × pixel difference ratio
This tends to better detect scene/content changes instead of tiny motion.
6.3 Optional V2: perceptual hash
For more stable detection:
Generate a small grayscale frame, e.g. 32×32.
Compute average hash or difference hash.
Compare Hamming distance.
Accept when distance exceeds threshold.
This is good when videos have subtle compression differences but meaningful visual changes.
7. Scan workflow
Pseudo-code:
async function scanVideo(
video: HTMLVideoElement,
settings: ScanSettings,
onProgress: (progress: number) => void
): Promise<CandidateFrame[]> {
const analysisCanvas = document.createElement("canvas");
const candidates: CandidateFrame[] = [];
let lastAcceptedFrame: ImageData | null = null;
let lastAcceptedTime: number | null = null;
for (
let time = 0;
time < video.duration;
time += settings.sampleIntervalSeconds
) {
await seekVideo(video, time);
const imageData = captureAnalysisFrame(
video,
analysisCanvas,
settings.analysisWidth,
settings.analysisHeight
);
if (!lastAcceptedFrame) {
candidates.push({
time,
score: 1,
reason: "initial-frame",
});
lastAcceptedFrame = imageData;
lastAcceptedTime = time;
continue;
}
const diffRatio = frameDifferenceRatio(
lastAcceptedFrame,
imageData,
settings.pixelDeltaThreshold
);
if (
shouldAcceptCandidate({
diffRatio,
currentTime: time,
lastAcceptedTime,
changedPixelRatioThreshold: settings.changedPixelRatioThreshold,
minSecondsBetweenCaptures: settings.minSecondsBetweenCaptures,
})
) {
candidates.push({
time,
score: diffRatio,
reason: "visual-change",
});
lastAcceptedFrame = imageData;
lastAcceptedTime = time;
}
onProgress(time / video.duration);
if (candidates.length >= settings.maxCandidates) {
// optional: stop early or continue and later rank candidates
}
}
return rankAndTrimCandidates(candidates, settings.finalTargetCount);
}
Candidate type:
type CandidateFrame = {
time: number;
score: number;
reason: "initial-frame" | "visual-change" | "manual";
thumbnailUrl?: string;
fullBlob?: Blob;
};
8. Candidate ranking
Since the desired final count is only 410, do not simply take every threshold crossing.
Use ranking:
Always include first frame if useful.
Sort detected changes by score.
Enforce time diversity.
Keep the top N.
Example:
function rankAndTrimCandidates(
candidates: CandidateFrame[],
targetCount: number
): CandidateFrame[] {
return [...candidates]
.sort((a, b) => b.score - a.score)
.slice(0, targetCount)
.sort((a, b) => a.time - b.time);
}
Better version:
Divide video into N temporal buckets
Pick the strongest candidate from each bucket
Then fill remaining slots by highest score
This avoids ending up with all images from one active segment.
9. Full-resolution extraction
Only render full-resolution images for selected candidates.
async function extractFullResolutionFrame(
video: HTMLVideoElement,
time: number,
type: "image/png" | "image/jpeg" | "image/webp" = "image/png",
quality?: number
): Promise<Blob> {
await seekVideo(video, time);
const canvas = document.createElement("canvas");
canvas.width = video.videoWidth;
canvas.height = video.videoHeight;
const ctx = canvas.getContext("2d");
if (!ctx) throw new Error("Could not get canvas context");
ctx.drawImage(video, 0, 0);
return new Promise((resolve, reject) => {
canvas.toBlob(
blob => {
if (!blob) reject(new Error("Canvas export failed"));
else resolve(blob);
},
type,
quality
);
});
}
For clipboard, prefer PNG:
async function copyImageBlobToClipboard(blob: Blob) {
await navigator.clipboard.write([
new ClipboardItem({
[blob.type]: blob,
}),
]);
}
Use:
const blob = await extractFullResolutionFrame(video, candidate.time, "image/png");
await copyImageBlobToClipboard(blob);
Clipboard image writing should be initiated by a user action, such as clicking a Copy button, and the app should be served from HTTPS or localhost because the Clipboard API is gated behind secure contexts.
10. UI design
Main layout
┌──────────────────────────────────────────────┐
│ Video Key Frame Extractor │
├──────────────────────────────────────────────┤
│ [Select video] filename.mp4 │
│ Duration: 04:32 | Size: 612 MB | 1920×1080 │
├──────────────────────────────────────────────┤
│ Scan Settings │
│ Sample interval: [0.5s] │
│ Change threshold: [18%] │
│ Min gap: [2.0s] │
│ Target images: [8] │
│ [Scan Video] │
├──────────────────────────────────────────────┤
│ Progress bar │
├──────────────────────────────────────────────┤
│ Candidate Gallery │
│ [thumb] 00:04.5 score .24 [Copy] [Download] │
│ [thumb] 00:18.0 score .41 [Copy] [Download] │
│ [thumb] 01:22.5 score .37 [Copy] [Download] │
└──────────────────────────────────────────────┘
Controls
Minimum controls:
Select video
Scan
Stop scan
Reset
Threshold slider
Sample interval dropdown
Target count
Copy image
Download image
Advanced controls:
“Prefer fewer images”
“Prefer more images”
“Ignore tiny motion”
“Include first frame”
“Manually add current preview frame”
11. State model
Use Zustand or plain React reducer.
type AppState = {
video: LoadedVideo | null;
scanStatus: "idle" | "loading" | "scanning" | "done" | "error";
settings: ScanSettings;
candidates: CandidateFrame[];
selectedCandidateId: string | null;
progress: number;
error: string | null;
};
Actions:
type AppActions = {
loadVideo(file: File): Promise<void>;
scanVideo(): Promise<void>;
cancelScan(): void;
updateSettings(settings: Partial<ScanSettings>): void;
copyCandidate(candidateId: string): Promise<void>;
downloadCandidate(candidateId: string): Promise<void>;
removeCandidate(candidateId: string): void;
};
12. Worker design
For V1, the main thread can own the hidden video element because HTMLVideoElement is DOM-bound. But pixel comparison can be pushed into a worker.
Main thread:
seek video
draw frame to analysis canvas
get ImageData
send ImageData buffer to worker
worker returns diff score
Worker message:
type AnalyzeFrameMessage = {
type: "analyze-frame";
current: ImageData;
previous: ImageData;
pixelDeltaThreshold: number;
};
Worker response:
type AnalyzeFrameResult = {
type: "frame-score";
diffRatio: number;
};
For your target size, this may not be necessary at first. But it is worth structuring the code so the algorithm can move into a worker later.
13. Performance expectations
For a 10-minute video:
Sample interval: 0.5s
Comparisons: ~1200
Analysis resolution: 160×90
Pixels per comparison: 14,400
Total pixel comparisons: ~17.3 million
That is very reasonable in browser JavaScript.
For a 30-minute video:
Sample interval: 0.5s
Comparisons: ~3600
Analysis resolution: 160×90
Total pixel comparisons: ~51.8 million
Still reasonable. The bottleneck will usually be video seeking/decoding, not the pixel math.
14. File-size considerations
Your 1001000 MB range is practical because:
The video is referenced via local file blob URL.
You only decode sampled frames.
You only store 410 output blobs.
Analysis frames are tiny.
Memory use should look roughly like:
Video decoder buffers: browser-managed
Analysis frame A: ~160 × 90 × 4 = 57.6 KB
Analysis frame B: ~57.6 KB
Full-res output image: only when accepted/copied
Candidate blobs: 410 images
At 1920×1080, one raw RGBA canvas frame is about:
1920 × 1080 × 4 = ~8.3 MB
At 3840×2160:
3840 × 2160 × 4 = ~33.2 MB
Still fine if you only create one full-res canvas at a time.
15. Browser compatibility
Best target
Start with Chromium-based browsers:
Chrome
Edge
Brave
Arc
Chromium
Reasons:
Strong media support.
Good Clipboard API support.
Good WebCodecs path for later.
Firefox should work for the V1 <video> + canvas path. WebCodecs VideoDecoder is listed by MDN as supported in Firefox 130+, while Chromium browsers have supported it since version 94.
Safari may work for simple video/canvas extraction, but clipboard image writing and codec behavior are areas to test carefully.
16. Security and privacy model
The app should explicitly state:
Your video is processed locally in your browser.
The file is not uploaded.
No server receives the video or extracted images.
Implementation requirements:
Do not send file contents to analytics.
Do not log object URLs remotely.
Avoid third-party scripts if privacy is part of the value prop.
Serve over HTTPS.
Make the app installable as a PWA if desired.
17. Error handling
Handle:
Case Response
Unsupported video codec Show “This browser cannot decode this video.”
Huge resolution Offer lower-quality extraction or warn about memory.
Clipboard blocked Offer download fallback.
Scan too slow Let user increase sample interval.
Too many candidates Raise threshold or bucket results.
No candidates found Lower threshold or include first/middle/end frames.
Clipboard fallback:
try {
await copyImageBlobToClipboard(blob);
} catch {
downloadBlob(blob, "frame.png");
}
18. Suggested file structure
src/
app/
App.tsx
store.ts
components/
FilePicker.tsx
VideoMetadataPanel.tsx
ScanSettingsPanel.tsx
ProgressPanel.tsx
CandidateGallery.tsx
CandidateCard.tsx
media/
loadVideo.ts
seekVideo.ts
captureFrame.ts
extractFullFrame.ts
clipboard.ts
analysis/
frameDifference.ts
histogramDifference.ts
candidateRanking.ts
scanVideo.ts
workers/
frameAnalysis.worker.ts
types/
media.ts
scan.ts
19. MVP implementation plan
Phase 1: Core prototype
Build:
File picker
Hidden video loader
Metadata display
Scan button
Basic frame-difference detection
Candidate list with timestamps
Thumbnail generation
No worker yet.
Phase 2: Copy/export
Add:
Full-res frame extraction
Copy PNG to clipboard
Download PNG fallback
Remove candidate
Manual add candidate
Phase 3: Better detection
Add:
Histogram comparison
Temporal bucketing
Threshold presets
“Strict / Balanced / Sensitive” modes
Example presets:
const presets = {
strict: {
sampleIntervalSeconds: 1.0,
changedPixelRatioThreshold: 0.28,
minSecondsBetweenCaptures: 4,
},
balanced: {
sampleIntervalSeconds: 0.5,
changedPixelRatioThreshold: 0.18,
minSecondsBetweenCaptures: 2,
},
sensitive: {
sampleIntervalSeconds: 0.25,
changedPixelRatioThreshold: 0.1,
minSecondsBetweenCaptures: 1,
},
};
Phase 4: Worker optimization
Move frame comparison into a worker.
Phase 5: Advanced media backend
Evaluate:
Mediabunny + WebCodecs
This is the path if you want better control over frame extraction, codecs, and container-level handling. Mediabunny is actively positioned as a modern browser media toolkit, and WebCodecs is the underlying native browser primitive for direct encoded/decoded media handling.
20. Final recommendation
For your app, build V1 with:
Vite + React + TypeScript
HTMLVideoElement
Canvas
Blob URLs
Clipboard API
Optional Zustand
Do not start with:
ffmpeg.wasm
OpenCV.js
Server-side processing
Best V1 strategy:
Sample every 0.5s
Analyze at 160×90
Compare against last accepted key frame
Use threshold + minimum time gap
Rank candidates down to 410
Extract full-res only when displaying/copying