How to Handle WhatsApp Media Messages in Webhooks (Images, Docs, Voice Notes)

WhatsApp media webhook pipeline — media ID to download URL to converted file, with voice note OGG to MP3 conversion and Whisper transcription

In this guide: Media type reference table · The 5-minute URL expiry · Full download pipeline · All media payload schemas · Image & document handling · Voice note OGG→MP3 conversion · Whisper transcription · S3/GCS storage patterns · Complete handler with all types · SocialHook normalized format

Media types: complete reference table

The Cloud API delivers 8 media message types, each with different formats, size limits, and handling requirements. Know the constraints before you build the handler.

	type value	Formats accepted	Size limit	Special notes
🖼️	image	JPEG, PNG	5 MB	Optional `caption`. GIF not supported inline — send as document.
🎵	audio	AAC, AMR, MP3, OGG/Opus	16 MB	`voice: true` if recorded in-app (OGG/Opus). Regular audio files may vary. Always convert OGG before transcription.
🎬	video	MP4, 3GPP	16 MB	Optional `caption`. H.264 video + AAC audio recommended for broadest compatibility.
📄	document	Any MIME type Meta accepts	100 MB	Includes `filename` — store it. PDF, DOCX, XLSX, images as documents, etc.
💿	sticker	WebP	Static 500KB / Animated 100KB	`animated: true` if animated. WebP format — browsers support it natively now.

The 5-minute URL expiry — the detail that breaks most implementations

This is the most commonly missed detail in WhatsApp media handling, and it causes the most production bugs. When you call the Cloud API media endpoint to resolve a media ID, you receive a temporary download URL. That URL is valid for approximately 5 minutes.

The two patterns developers use — and why one breaks:

❌ Store the URL, download later — you receive the webhook, call the media endpoint, store the temporary URL in your database for async processing. By the time your worker picks it up, the URL has expired. You get a 403. This pattern fails.
✓ Download immediately, store the file — you receive the webhook, immediately resolve the media ID to a URL, immediately download the bytes, store the file on your own storage (S3, GCS, disk), save only the storage path in your database. Async processing works on the stored file. This pattern is correct.

Alternative: store the media ID, resolve on demand. If you don't need the file immediately, you can store just the media ID and resolve it fresh when you need the file. Media IDs are valid for 30 days — much longer than the temporary download URL. This is useful when you're not sure if you'll ever need the binary (e.g. stickers you might ignore). Just know the resolve call adds latency when you finally need it.

The download pipeline: ID → URL → bytes → storage

Receive webhook — extract media ID

msg.image.id / msg.audio.id / msg.document.id — NOT a URL, just a reference

Return HTTP 200 immediately do this first

Acknowledge to Meta before any downloads. Push media processing to async queue.

Resolve media ID → temporary download URL

GET graph.facebook.com/v21.0/{media_id} — requires Authorization: Bearer {token} — returns url (expires ~5min) + mime_type + file_size

Download the file bytes within 5 minutes

GET {download_url} — ALSO requires Authorization: Bearer {token} — returns raw binary bytes

Store file to S3 / GCS / local disk

Organize by: media/{client_id}/{media_type}/{date}/{media_id}.{ext}

Process (type-specific)

Image: extract metadata, generate thumbnail. Audio/voice: convert OGG→MP3, transcribe. Document: extract text for search. Video: generate thumbnail frame.

Payload schemas for every media type

Here is the exact webhook value object structure for each media type. The nesting from the top-level envelope (entry[0].changes[0].value.messages[0]) is already extracted when using SocialHook's normalized format.

JSON — all media type schemas
// image message
{ "type": "image", "image": { "id": "wamid.img...", "mime_type": "image/jpeg", "sha256": "abc...", "caption": "optional" }}

// audio message (in-app voice note)
{ "type": "audio", "audio": { "id": "wamid.aud...", "mime_type": "audio/ogg; codecs=opus", "sha256": "def...", "voice": true }}

// audio message (audio file, not voice note)
{ "type": "audio", "audio": { "id": "wamid.aud...", "mime_type": "audio/mpeg", "sha256": "ghi..." }}
// note: voice field is absent or false for uploaded audio files

// video message
{ "type": "video", "video": { "id": "wamid.vid...", "mime_type": "video/mp4", "sha256": "jkl...", "caption": "optional" }}

// document message
{ "type": "document", "document": { "id": "wamid.doc...", "mime_type": "application/pdf", "sha256": "mno...", "filename": "invoice-2026.pdf", "caption": "optional" }}

// sticker message
{ "type": "sticker", "sticker": { "id": "wamid.stk...", "mime_type": "image/webp", "sha256": "pqr...", "animated": false }}

Image and document handling

Images and documents share the same two-step download pattern. The key difference: documents include a filename field that you should preserve in your storage key — it's what the customer named the file and what you'll want to show in your UI.

Node.js
downloadMedia.js
const GRAPH = 'https://graph.facebook.com/v21.0';
const TOKEN = process.env.WA_TOKEN;

// Step 1: Resolve media ID → temporary download URL (~5 min expiry)
async function resolveMediaUrl(mediaId) {
  const res = await fetch(`${GRAPH}/${mediaId}`, {
    headers: { 'Authorization': `Bearer ${TOKEN}` },
  });
  if (!res.ok) throw new Error(`Media resolve failed: ${res.status}`);
  const { url, mime_type, file_size } = await res.json();
  return { url, mime_type, file_size };
}

// Step 2: Download the file bytes from the temporary URL
async function downloadMedia(downloadUrl) {
  const res = await fetch(downloadUrl, {
    headers: { 'Authorization': `Bearer ${TOKEN}` }, // required!
  });
  if (!res.ok) throw new Error(`Media download failed: ${res.status}`);
  return Buffer.from(await res.arrayBuffer());
}

// Complete handler: resolve → download → store
async function handleMediaMessage(msg, from) {
  const type     = msg.type; // 'image' | 'document' | 'video' | 'sticker'
  const media    = msg[type];
  const mediaId  = media.id;
  const filename = media.filename ?? mediaId; // documents have filename
  const caption  = media.caption ?? null;

  // Resolve then download immediately — URL expires in ~5 min
  const { url, mime_type } = await resolveMediaUrl(mediaId);
  const bytes            = await downloadMedia(url);

  // Store the file — see storage patterns section
  const storagePath = await storeFile(bytes, type, mediaId, mime_type, filename);

  return {
    mediaId,
    storagePath,
    mimeType: mime_type,
    filename,
    caption,
    from,
    type,
  };
}

Python
download_media.py
import os, requests

GRAPH = "https://graph.facebook.com/v21.0"
TOKEN = os.environ["WA_TOKEN"]
HEADERS = { "Authorization": f"Bearer {TOKEN}" }

def resolve_media_url(media_id: str) -> dict:
    """Step 1: Resolve media ID → temporary download URL."""
    res = requests.get(f"{GRAPH}/{media_id}", headers=HEADERS)
    res.raise_for_status()
    return res.json()  # { url, mime_type, file_size, id }

def download_media(download_url: str) -> bytes:
    """Step 2: Download file bytes — URL expires in ~5 minutes."""
    res = requests.get(download_url, headers=HEADERS)  # auth required here too!
    res.raise_for_status()
    return res.content

def handle_media_message(msg: dict, sender: str) -> dict:
    media_type = msg["type"]
    media      = msg[media_type]
    media_id   = media["id"]
    filename   = media.get("filename", media_id)
    caption    = media.get("caption")

    # Resolve then download immediately — URL valid ~5 min only
    media_info   = resolve_media_url(media_id)
    file_bytes   = download_media(media_info["url"])
    storage_path = store_file(file_bytes, media_type, media_id, media_info["mime_type"], filename)

    return {
        "media_id":     media_id,
        "storage_path": storage_path,
        "mime_type":    media_info["mime_type"],
        "filename":     filename,
        "caption":      caption,
        "sender":       sender,
        "type":         media_type,
    }

Voice notes: OGG/Opus to MP3 conversion

Voice notes recorded in WhatsApp are encoded in OGG/Opus format. You can identify them by the voice: true flag in the audio payload and the mime_type: "audio/ogg; codecs=opus" value. This format is not supported by OpenAI Whisper for transcription, and has limited browser playback support.

The solution: convert to MP3 using ffmpeg. ffmpeg is the universal audio conversion tool, available on every major OS and all cloud environments.

Shell
install ffmpeg
# Ubuntu / Debian
apt-get install -y ffmpeg

# macOS
brew install ffmpeg

# Docker — add to Dockerfile
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

# Node.js wrapper (optional — avoids shell exec)
npm install fluent-ffmpeg

Node.js + fluent-ffmpeg
convertVoiceNote.js
const ffmpeg = require('fluent-ffmpeg');
const { Readable } = require('stream');
const path   = require('path');
const os     = require('os');
const fs     = require('fs/promises');

async function convertOggToMp3(oggBuffer) {
  const tmpDir    = os.tmpdir();
  const inputPath = path.join(tmpDir, `voice-${Date.now()}.ogg`);
  const outPath   = path.join(tmpDir, `voice-${Date.now()}.mp3`);

  // Write OGG buffer to temp file
  await fs.writeFile(inputPath, oggBuffer);

  // Convert OGG/Opus → MP3
  await new Promise((resolve, reject) => {
    ffmpeg(inputPath)
      .audioCodec('libmp3lame')
      .audioBitrate('128k')     // good quality at reasonable size
      .audioFrequency(44100)    // standard sample rate for Whisper
      .save(outPath)
      .on('end', resolve)
      .on('error', reject);
  });

  // Read MP3 bytes
  const mp3Buffer = await fs.readFile(outPath);

  // Clean up temp files
  await Promise.all([fs.unlink(inputPath), fs.unlink(outPath)]);

  return mp3Buffer;
}

// Voice note handler: download → detect → convert → store
async function handleAudioMessage(msg, from) {
  const { id: mediaId, voice, mime_type } = msg.audio;
  const isVoiceNote = voice === true;

  const { url } = await resolveMediaUrl(mediaId);
  const rawBytes = await downloadMedia(url);

  let finalBytes = rawBytes;
  let finalMime  = mime_type;

  if (isVoiceNote || mime_type.includes('ogg')) {
    // Convert OGG/Opus → MP3 for compatibility + Whisper transcription
    finalBytes = await convertOggToMp3(rawBytes);
    finalMime  = 'audio/mpeg';
  }

  const storagePath = await storeFile(finalBytes, 'audio', mediaId, finalMime);

  return { mediaId, storagePath, isVoiceNote, mime_type: finalMime };
}

Python + subprocess
convert_voice_note.py
import subprocess, tempfile, os
from pathlib import Path

def convert_ogg_to_mp3(ogg_bytes: bytes) -> bytes:
    """Convert OGG/Opus voice note bytes → MP3 bytes via ffmpeg."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        input_path  = Path(tmp_dir) / "input.ogg"
        output_path = Path(tmp_dir) / "output.mp3"

        # Write OGG bytes to temp file
        input_path.write_bytes(ogg_bytes)

        # Run ffmpeg conversion
        result = subprocess.run([
            "ffmpeg", "-y",           # overwrite without asking
            "-i", str(input_path),
            "-codec:a", "libmp3lame", # MP3 encoder
            "-qscale:a", "2",          # ~190kbps quality
            "-ar", "44100",             # sample rate for Whisper compatibility
            str(output_path)
        ], capture_output=True)

        if result.returncode != 0:
            raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()}")

        return output_path.read_bytes()

def handle_audio_message(msg: dict, sender: str) -> dict:
    audio      = msg["audio"]
    media_id   = audio["id"]
    is_voice   = audio.get("voice", False)
    mime_type  = audio.get("mime_type", "")

    media_info = resolve_media_url(media_id)
    raw_bytes  = download_media(media_info["url"])

    if is_voice or "ogg" in mime_type:
        final_bytes = convert_ogg_to_mp3(raw_bytes)
        final_mime  = "audio/mpeg"
    else:
        final_bytes = raw_bytes
        final_mime  = mime_type

    storage_path = store_file(final_bytes, "audio", media_id, final_mime)
    return { "media_id": media_id, "path": storage_path, "is_voice": is_voice }

Transcribing voice notes with OpenAI Whisper

Once you have the MP3, send it to OpenAI's Whisper API. Whisper supports 57 languages, handles background noise well, and costs approximately $0.006 per minute of audio — a 30-second voice note costs less than a cent to transcribe. The transcript then becomes queryable, AI-processable text.

Node.js + OpenAI SDK
transcribeVoiceNote.js
const OpenAI = require('openai');
const { Readable } = require('stream');

const oai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function transcribeAudio(mp3Buffer, language = 'en') {
  // Whisper expects a File-like object — wrap the buffer
  const file = new File([mp3Buffer], 'voice_note.mp3', { type: 'audio/mpeg' });

  const response = await oai.audio.transcriptions.create({
    model:    'whisper-1',
    file,
    language, // optional — auto-detect if omitted
    response_format: 'verbose_json', // includes confidence + segments
  });

  return {
    text:     response.text,           // full transcript
    language: response.language,       // detected language
    duration: response.duration,       // audio duration in seconds
    segments: response.segments,       // word-level timing (verbose_json only)
  };
}

// Full voice note pipeline: download → convert → transcribe → store
async function processVoiceNote(msg, from) {
  const { mediaId, storagePath, isVoiceNote } = await handleAudioMessage(msg, from);

  if (!isVoiceNote) return { mediaId, storagePath, transcript: null };

  // Read the stored MP3 for transcription
  const mp3Buffer  = await readFromStorage(storagePath);
  const transcript = await transcribeAudio(mp3Buffer);

  // Store the transcript alongside the audio
  await saveTranscript(mediaId, from, transcript);

  console.log(`[${from}] Voice note: "${transcript.text}"`);

  return { mediaId, storagePath, transcript };
}

Python + OpenAI SDK
transcribe_voice_note.py
import io
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def transcribe_audio(mp3_bytes: bytes, language: str = "en") -> dict:
    """Transcribe MP3 audio bytes using OpenAI Whisper."""
    audio_file = io.BytesIO(mp3_bytes)
    audio_file.name = "voice_note.mp3"  # SDK reads .name for MIME detection

    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language=language,           # omit for auto-detect
        response_format="verbose_json"  # includes segments + confidence
    )
    return {
        "text":     response.text,
        "language": response.language,
        "duration": response.duration,
    }

def process_voice_note(msg: dict, sender: str) -> dict:
    result = handle_audio_message(msg, sender)
    if not result["is_voice"]:
        return {**result, "transcript": None}

    mp3_bytes  = read_from_storage(result["path"])
    transcript = transcribe_audio(mp3_bytes)
    save_transcript(result["media_id"], sender, transcript)
    return {**result, "transcript": transcript}

What to do with the transcript: Store it alongside the audio file. Pass it to your AI agent as the user's message (instead of "audio message received"). Index it in your search database. Log it in your CRM as the conversation text. Customers send voice notes because typing is slower — treating their voice notes as searchable text dramatically improves your AI agent's ability to understand and respond correctly.

Storage patterns: organize your media files

How you organize files in S3 or GCS matters for performance, billing, and debugging. Here is the storage key pattern that scales cleanly across multiple clients and message types:

Node.js + AWS S3
storeFile.js
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');
const mime2ext = require('mime-types');

const s3     = new S3Client({ region: process.env.AWS_REGION });
const BUCKET = process.env.S3_BUCKET;

async function storeFile(bytes, type, mediaId, mimeType, filename) {
  const ext  = mime2ext.extension(mimeType) || 'bin';
  const date = new Date().toISOString().slice(0, 10); // YYYY-MM-DD

  // Organized key: media/{type}/{date}/{mediaId}.{ext}
  // For documents, preserve original filename in metadata
  const key = `media/${type}/${date}/${mediaId}.${ext}`;

  await s3.send(new PutObjectCommand({
    Bucket:      BUCKET,
    Key:         key,
    Body:        bytes,
    ContentType: mimeType,
    Metadata: {
      'original-filename': filename || mediaId,
      'media-id':          mediaId,
    },
  }));

  return `s3://${BUCKET}/${key}`; // store this path in your database
}

// S3 key examples:
// media/image/2026-05-13/wamid.img123.jpeg
// media/audio/2026-05-13/wamid.aud456.mp3  ← converted from OGG
// media/document/2026-05-13/wamid.doc789.pdf
// media/video/2026-05-13/wamid.vid012.mp4

Complete media webhook handler (all types)

One function that receives a normalized media event and routes to the correct handler by type:

Node.js — universal media dispatcher
mediaDispatcher.js
async function dispatchMedia(event) {
  const { from, message } = event;
  const { type } = message;

  // Text messages — not media, handle separately
  if (type === 'text') return handleText(from, message.text.body);

  // Non-media message types
  if (['location', 'contacts', 'reaction', 'interactive'].includes(type)) {
    return handleNonMedia(from, message);
  }

  // Media types — all require download
  switch (type) {

    case 'image':
    case 'video':
    case 'sticker':
    case 'document': {
      const result = await handleMediaMessage(message, from);
      await saveMediaRecord(from, result);
      break;
    }

    case 'audio': {
      // Voice notes get transcribed; regular audio files just stored
      const result = await processVoiceNote(message, from);
      await saveMediaRecord(from, result);

      // If voice note was transcribed, treat transcript as text input
      if (result.transcript?.text) {
        console.log(`Voice note from ${from}: "${result.transcript.text}"`);
        await handleText(from, result.transcript.text, { isVoiceNote: true });
      }
      break;
    }

    default:
      console.warn(`Unhandled media type: ${type}`);
  }
}

SocialHook: pre-extracted media IDs in normalized format

When you use SocialHook, the media payload arrives already extracted from Meta's nested Cloud API envelope. Instead of navigating entry[0].changes[0].value.messages[0], you receive a flat event:

SocialHook normalized media event
// Image message
{
  "platform": "whatsapp",
  "event":    "message.received",
  "from":     "+1 555 000 1234",
  "message": {
    "type": "image",
    "id":   "wamid.HBgL...",
    "image": {
      "id":        "12345678901234",  // ← use this to resolve URL
      "mime_type": "image/jpeg",
      "caption":   "Here's the damage to my package"
    }
  },
  "signature_verified": true
}

// Voice note
{
  "message": {
    "type": "audio",
    "audio": {
      "id":        "98765432109876",
      "mime_type": "audio/ogg; codecs=opus",
      "voice":     true  // ← this tells you to run OGG→MP3 conversion
    }
  }
}

Your dispatchMedia(event) function above receives this format directly — no additional parsing needed. The media ID is at event.message.image.id (or .audio.id, .document.id, etc.), ready to feed into resolveMediaUrl().

FAQ

Common questions

How do I download media from a WhatsApp Cloud API webhook?

Two steps: (1) Call GET graph.facebook.com/v21.0/{media_id} with your access token to get a temporary download URL. (2) Fetch that URL (also with your access token in the Authorization header) to get the file bytes. Download immediately — the URL expires in approximately 5 minutes. Store the file bytes to S3/GCS/disk, not the temporary URL.

Why does my WhatsApp media download URL return 403?

Two causes: (1) URL expired — WhatsApp media download URLs are valid for approximately 5 minutes. If you stored the URL and fetched it later, it expired. Re-resolve from the media ID. (2) Missing Authorization header — the download URL itself also requires Authorization: Bearer {ACCESS_TOKEN} in the request. WhatsApp media URLs are not public CDN URLs.

What format are WhatsApp voice notes in and how do I convert them?

WhatsApp voice notes are OGG/Opus format (identified by voice: true in the payload and mime_type: "audio/ogg; codecs=opus"). Convert to MP3 with ffmpeg: ffmpeg -i input.ogg -codec:a libmp3lame -qscale:a 2 output.mp3. OGG/Opus is not supported by OpenAI Whisper or most browsers — always convert before transcription or playback. Full Node.js and Python code is in the voice notes section above.

How do I transcribe WhatsApp voice notes with Whisper?

Download the OGG → convert to MP3 with ffmpeg → send to OpenAI Audio Transcriptions API with model: "whisper-1". Whisper returns the transcript text. Cost is ~$0.006 per minute of audio. A 30-second voice note costs less than a third of a cent. The transcript can then be treated as a regular text message by your AI agent or CRM.

How long are WhatsApp media IDs valid?

Media IDs are valid for 30 days. The temporary download URL you get from resolving the ID is valid for approximately 5 minutes. This means you can safely store just the media ID in your database and resolve it fresh when you need the file (within 30 days). After 30 days, the media is deleted from Meta's servers and the ID is invalid.

What are the file size limits for WhatsApp media?

By type: Image (JPEG, PNG) — 5MB. Audio — 16MB. Video — 16MB. Document — 100MB. Sticker — 500KB static, 100KB animated. Messages with media exceeding these limits are rejected before your webhook fires — the sender sees a failure, not you.

Get the media IDs

Your pipeline starts with
a clean webhook event.

SocialHook delivers every WhatsApp media event — image, voice note, document — as pre-extracted JSON to your handler. You get the media ID ready to resolve, the MIME type, and the voice flag. No Cloud API parsing. Just pipe it straight to your download function.

Connect WhatsApp free → Read the docs

No credit card required · $50/month after trial · Cancel anytime

How to Handle WhatsApp Media Messages in Webhooks (Images, Docs, Voice Notes)

Media types: complete reference table

The 5-minute URL expiry — the detail that breaks most implementations

The download pipeline: ID → URL → bytes → storage

Payload schemas for every media type

Image and document handling

Voice notes: OGG/Opus to MP3 conversion

Transcribing voice notes with OpenAI Whisper

Storage patterns: organize your media files

Complete media webhook handler (all types)

SocialHook: pre-extracted media IDs in normalized format

Common questions

Your pipeline starts with
a clean webhook event.

Ready to build? Start with SocialHook.

WhatsApp Business API in Brazil: Everything Developers Need to Know

WhatsApp API Rate Limits Explained:What Happens When You SendToo Many Messages

WhatsApp Business API for Agencies:Managing MultipleClient Numbers

Stop managing Meta APIs.
Start building.

How to Handle WhatsApp Media Messages in Webhooks (Images, Docs, Voice Notes)

Media types: complete reference table

The 5-minute URL expiry — the detail that breaks most implementations

The download pipeline: ID → URL → bytes → storage

Payload schemas for every media type

Image and document handling

Voice notes: OGG/Opus to MP3 conversion

Transcribing voice notes with OpenAI Whisper

Storage patterns: organize your media files

Complete media webhook handler (all types)

SocialHook: pre-extracted media IDs in normalized format

Common questions

Continue the pipeline

Your pipeline starts witha clean webhook event.

Ready to build? Start with SocialHook.

WhatsApp Business API in Brazil: Everything Developers Need to Know

WhatsApp API Rate Limits Explained:What Happens When You SendToo Many Messages

WhatsApp Business API for Agencies:Managing MultipleClient Numbers

Stop managing Meta APIs.Start building.

Your pipeline starts with
a clean webhook event.

Stop managing Meta APIs.
Start building.