We Automated Our Product Walkthrough Video. The Whole Thing.

We needed a walkthrough video for Article Image Studio. You know the kind — someone clicks through the app while a voice explains what's happening. Three minutes, clear narration, every feature covered.

The traditional approach: screen record yourself clicking through the app, re-record when you stumble, add a voiceover in post, re-record again when the UI changes, spend more time on the video than you spent building the feature. We've all been there.

So we automated the entire thing. The narration is AI-generated. The browser clicks itself. The video records automatically. One command produces a finished walkthrough with consistent audio, perfect timing, and no retakes. When the UI changes, we run it again.

The Script Is the Source of Truth

Everything starts with a narration file. It's a JavaScript array of segments, each with an ID, the text to be spoken, and a type that tells the automation system what to do during that segment:

JavaScript

const SEGMENTS = [
    {
        id: "intro",
        text: "Welcome to Article Image Studio. This tool lets you turn any article into a set of beautifully styled AI-generated images, ready for your blog.",
        type: "action",
    },
    {
        id: "tour-s1-upload",
        text: "Here is the article upload area. You can paste your article text directly, or upload a text, markdown, or PDF file.",
        type: "tour",
    },
    // ... 20 more segments
];

Two types: tour segments highlight a specific UI element using the app's built-in help tour overlay. action segments are when the automation actually does something — pasting an article, clicking a button, waiting for generation to finish.

This file is the single source of truth for the entire video. Change the text, regenerate the audio, re-run the walkthrough. The video updates itself.

Step 1: Generate the Narration Audio

Each segment gets its own MP3 file through ElevenLabs, using the same stitching approach we use for our blog podcast audio. The generator walks through the segments in order, passing previous request IDs forward so the voice maintains consistent pacing and tone across all segments:

JavaScript

for (const segment of segments) {
    const { buffer, requestId } = await generateSegmentAudio(
        segment.text,
        voiceId,
        requestIds
    );
    fs.writeFileSync(`audio/${segment.id}.mp3`, buffer);
    if (requestId) requestIds.push(requestId);
}

It also measures the duration of each generated MP3 and writes a durations.json manifest. This is critical — the walkthrough automation needs to know exactly how long each narration clip is so it can time the browser actions to match the audio:

JSON

{
  "intro": 8420,
  "tour-s1-header": 6180,
  "tour-s1-upload": 5930,
  "action-paste-article": 3210,
  ...
}

Run it once and the audio is done. Run it again with --force to regenerate after editing the narration text. Run it with --id intro to regenerate just one segment you tweaked.

Step 2: Record the Browser

This is where it gets interesting. A Playwright script launches a browser, navigates to the running app, and performs every action in the walkthrough — timed to match the narration audio.

The app has a built-in help tour component that highlights UI elements with an overlay. The walkthrough script controls it programmatically through a window.__helpTour API that the tour component exposes:

JavaScript

const tour = {
    open: (page) => page.evaluate(() => window.__helpTour?.open()),
    close: (page) => page.evaluate(() => window.__helpTour?.close()),
    next: (page) => page.evaluate(() => window.__helpTour?.nextStep()),
    goTo: (page, idx) => page.evaluate((i) => window.__helpTour?.goToStep(i), idx),
};

For each tour segment, the script opens the tour, advances to the right step, takes a screenshot, and waits for the duration of that segment's narration audio. For each action segment, it performs real interactions — filling text areas, clicking buttons, waiting for API responses:

JavaScript

// Paste the sample article
const textarea = await page.waitForSelector("textarea");
await textarea.fill(SAMPLE_ARTICLE);

// Click generate and wait for the AI to finish
const genBtn = await page.$('[data-help="generate-btn"] button');
await genBtn.click();
await page.waitForSelector('[data-help="content-brief"]', { timeout: 120000 });

The timing is driven by the durations manifest. After each action, the script sleeps for exactly as long as the corresponding narration segment lasts:

JavaScript

async function waitForSegment(durations, segmentId) {
    const ms = durations[segmentId] || 3000;
    await sleep(ms);
}

This means the browser actions are synchronized with the narration even though the audio isn't playing during recording. The video and audio tracks are the same length because they're both derived from the same timing data.

Playwright records the entire session as a WebM video file. Every tour highlight, every click, every loading spinner — all captured automatically.

Step 3: Merge Video and Audio

The final step combines the silent Playwright recording with the narration audio. A merge script creates a single audio track from all the segments with configurable silence gaps between them, then muxes it with the video using ffmpeg:

JavaScript

// Generate silence gaps between segments
execSync(
    `ffmpeg -y -f lavfi -i anullsrc=r=44100:cl=mono -t ${
        gap / 1000
    } -q:a 9 "${silencePath}"`
);

// Build a concat list: segment, silence, segment, silence...
for (let i = 0; i < SEGMENTS.length; i++) {
    lines.push(`file '${AUDIO_DIR}/${SEGMENTS[i].id}.mp3'`);
    if (i < SEGMENTS.length - 1) {
        lines.push(`file '${silencePath}'`);
    }
}

// Concatenate all audio
execSync(
    `ffmpeg -y -f concat -safe 0 -i "${concatListPath}" -c copy "${combinedAudioPath}"`
);

// Merge video + audio into final MP4
execSync(
    `ffmpeg -y -i "${videoPath}" -i "${combinedAudioPath}" -c:v libx264 -c:a aac -b:a 192k -shortest "${finalPath}"`
);

The -shortest flag is the safety net — if the video and audio are slightly different lengths (they shouldn't be, but real-world timing has variance), ffmpeg trims to the shorter track rather than padding with silence or frozen frames.

The output includes a timing summary so you can verify everything lines up:

Code

Segment timing:
  [00:00:00] intro (8.4s)
  [00:00:09] tour-s1-header (6.2s)
  [00:00:16] tour-s1-upload (5.9s)
  ...
  Total: 3m 12s

Why This Matters

The obvious benefit is that you don't have to re-record the video when the UI changes. Update the app, run the pipeline, upload the new video. But the less obvious benefit is consistency. Every walkthrough sounds the same, looks the same, covers the same features in the same order. There's no "oh, I forgot to show the color palette picker" or "I stumbled over that sentence, let me start over."

It also means the walkthrough is version-controlled. The narration script is in your repo. The automation script is in your repo. The audio generation and video merge are reproducible commands. You can diff your walkthrough the same way you diff your code.

For a small shop like ours, this is the difference between having a walkthrough video and not having one. The traditional approach — record, edit, re-record, add voiceover, re-record again when something changes — takes half a day and produces something you're reluctant to update. This approach takes a few minutes of compute time and produces something you can regenerate on every release.

The Full Content Stack

If you've been following this series, the complete pipeline from idea to fully-produced content now looks like this:

An AI interview extracts your expertise into raw material. Editing shapes it into a finished article. Article Image Studio generates styled illustrations. A TTS script produces a podcast episode. And now, the walkthrough automation produces a product video.

Five outputs from one conversation. The thinking was the hard part. Everything else is rendering.

The Code

The walkthrough toolkit is four files:

narration.js defines the segments — the single source of truth for what gets said and when. generate-audio.js turns those segments into individual MP3 files with duration metadata. run-walkthrough.js launches Playwright, drives the browser through every step timed to the narration durations, and records the screen. merge-video.js combines the silent recording with the audio segments into a finished MP4.

The commands:

Bash

# 1. Generate narration audio for all segments
node generate-audio.js

# 2. Start your dev server, then record the walkthrough
node run-walkthrough.js

# 3. Merge video + audio into final MP4
node merge-video.js

The walkthrough script supports --headed to watch the browser do its thing, --dry-run to print the timeline without launching a browser, --screenshots-only to grab stills without video, and --skip-gen to skip the AI generation steps if you already have a project with generated content.

The whole toolkit assumes your app has a help tour component with a window.__helpTour API. If your app doesn't have that, the tour segments won't work — but the action segments (clicking, typing, waiting) work with any web app Playwright can drive.

If you're building developer tools, SaaS products, or anything with a UI that changes faster than you can record videos — this is the approach. Write your narration once, automate the browser, and let the pipeline produce a new video every time you ship.

We Automated Our Product Walkthrough Video. The Whole Thing.

The Script Is the Source of Truth

Step 1: Generate the Narration Audio

Step 2: Record the Browser

Step 3: Merge Video and Audio

Why This Matters

The Full Content Stack

The Code

Keywords

Related Articles

Every Article You Write Is Already a Podcast. You Just Haven't Pressed the Button Yet.

Your Article Is Done. Now It Needs Pictures.

Stop Writing Articles. Start Having Conversations.