WhatsApp media webhook pipeline — media ID to download URL to converted file, with voice note OGG to MP3 conversion and Whisper transcription
In this guide: Media type reference table · The 5-minute URL expiry · Full download pipeline · All media payload schemas · Image & document handling · Voice note OGG→MP3 conversion · Whisper transcription · S3/GCS storage patterns · Complete handler with all types · SocialHook normalized format

Media types: complete reference table

The Cloud API delivers 8 media message types, each with different formats, size limits, and handling requirements. Know the constraints before you build the handler.

type value Formats accepted Size limit Special notes
image JPEG, PNG 5 MB Optional caption. GIF not supported inline — send as document.
audio AAC, AMR, MP3, OGG/Opus 16 MB voice: true if recorded in-app (OGG/Opus). Regular audio files may vary. Always convert OGG before transcription.
video MP4, 3GPP 16 MB Optional caption. H.264 video + AAC audio recommended for broadest compatibility.
document Any MIME type Meta accepts 100 MB Includes filename — store it. PDF, DOCX, XLSX, images as documents, etc.
sticker WebP Static 500KB / Animated 100KB animated: true if animated. WebP format — browsers support it natively now.

The 5-minute URL expiry — the detail that breaks most implementations

This is the most commonly missed detail in WhatsApp media handling, and it causes the most production bugs. When you call the Cloud API media endpoint to resolve a media ID, you receive a temporary download URL. That URL is valid for approximately 5 minutes.

The two patterns developers use — and why one breaks:

  • ❌ Store the URL, download later — you receive the webhook, call the media endpoint, store the temporary URL in your database for async processing. By the time your worker picks it up, the URL has expired. You get a 403. This pattern fails.
  • ✓ Download immediately, store the file — you receive the webhook, immediately resolve the media ID to a URL, immediately download the bytes, store the file on your own storage (S3, GCS, disk), save only the storage path in your database. Async processing works on the stored file. This pattern is correct.
Alternative: store the media ID, resolve on demand. If you don't need the file immediately, you can store just the media ID and resolve it fresh when you need the file. Media IDs are valid for 30 days — much longer than the temporary download URL. This is useful when you're not sure if you'll ever need the binary (e.g. stickers you might ignore). Just know the resolve call adds latency when you finally need it.

The download pipeline: ID → URL → bytes → storage

1
Receive webhook — extract media ID
msg.image.id / msg.audio.id / msg.document.id — NOT a URL, just a reference
2
Return HTTP 200 immediately do this first
Acknowledge to Meta before any downloads. Push media processing to async queue.
3
Resolve media ID → temporary download URL
GET graph.facebook.com/v21.0/{media_id} — requires Authorization: Bearer {token} — returns url (expires ~5min) + mime_type + file_size
4
Download the file bytes within 5 minutes
GET {download_url} — ALSO requires Authorization: Bearer {token} — returns raw binary bytes
5
Store file to S3 / GCS / local disk
Organize by: media/{client_id}/{media_type}/{date}/{media_id}.{ext}
6
Process (type-specific)
Image: extract metadata, generate thumbnail. Audio/voice: convert OGG→MP3, transcribe. Document: extract text for search. Video: generate thumbnail frame.

Payload schemas for every media type

Here is the exact webhook value object structure for each media type. The nesting from the top-level envelope (entry[0].changes[0].value.messages[0]) is already extracted when using SocialHook's normalized format.

Image and document handling

Images and documents share the same two-step download pattern. The key difference: documents include a filename field that you should preserve in your storage key — it's what the customer named the file and what you'll want to show in your UI.

Node.js
downloadMedia.js
const GRAPH = 'https://graph.facebook.com/v21.0'; const TOKEN = process.env.WA_TOKEN; // Step 1: Resolve media ID → temporary download URL (~5 min expiry) async function resolveMediaUrl(mediaId) { const res = await fetch(`${GRAPH}/${mediaId}`, { headers: { 'Authorization': `Bearer ${TOKEN}` }, }); if (!res.ok) throw new Error(`Media resolve failed: ${res.status}`); const { url, mime_type, file_size } = await res.json(); return { url, mime_type, file_size }; } // Step 2: Download the file bytes from the temporary URL async function downloadMedia(downloadUrl) { const res = await fetch(downloadUrl, { headers: { 'Authorization': `Bearer ${TOKEN}` }, // required! }); if (!res.ok) throw new Error(`Media download failed: ${res.status}`); return Buffer.from(await res.arrayBuffer()); } // Complete handler: resolve → download → store async function handleMediaMessage(msg, from) { const type = msg.type; // 'image' | 'document' | 'video' | 'sticker' const media = msg[type]; const mediaId = media.id; const filename = media.filename ?? mediaId; // documents have filename const caption = media.caption ?? null; // Resolve then download immediately — URL expires in ~5 min const { url, mime_type } = await resolveMediaUrl(mediaId); const bytes = await downloadMedia(url); // Store the file — see storage patterns section const storagePath = await storeFile(bytes, type, mediaId, mime_type, filename); return { mediaId, storagePath, mimeType: mime_type, filename, caption, from, type, }; }
Python
download_media.py
import os, requests GRAPH = "https://graph.facebook.com/v21.0" TOKEN = os.environ["WA_TOKEN"] HEADERS = { "Authorization": f"Bearer {TOKEN}" } def resolve_media_url(media_id: str) -> dict: """Step 1: Resolve media ID → temporary download URL.""" res = requests.get(f"{GRAPH}/{media_id}", headers=HEADERS) res.raise_for_status() return res.json() # { url, mime_type, file_size, id } def download_media(download_url: str) -> bytes: """Step 2: Download file bytes — URL expires in ~5 minutes.""" res = requests.get(download_url, headers=HEADERS) # auth required here too! res.raise_for_status() return res.content def handle_media_message(msg: dict, sender: str) -> dict: media_type = msg["type"] media = msg[media_type] media_id = media["id"] filename = media.get("filename", media_id) caption = media.get("caption") # Resolve then download immediately — URL valid ~5 min only media_info = resolve_media_url(media_id) file_bytes = download_media(media_info["url"]) storage_path = store_file(file_bytes, media_type, media_id, media_info["mime_type"], filename) return { "media_id": media_id, "storage_path": storage_path, "mime_type": media_info["mime_type"], "filename": filename, "caption": caption, "sender": sender, "type": media_type, }

Voice notes: OGG/Opus to MP3 conversion

Voice notes recorded in WhatsApp are encoded in OGG/Opus format. You can identify them by the voice: true flag in the audio payload and the mime_type: "audio/ogg; codecs=opus" value. This format is not supported by OpenAI Whisper for transcription, and has limited browser playback support.

The solution: convert to MP3 using ffmpeg. ffmpeg is the universal audio conversion tool, available on every major OS and all cloud environments.

Shell
install ffmpeg
# Ubuntu / Debian apt-get install -y ffmpeg # macOS brew install ffmpeg # Docker — add to Dockerfile RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/* # Node.js wrapper (optional — avoids shell exec) npm install fluent-ffmpeg
Node.js + fluent-ffmpeg
convertVoiceNote.js
const ffmpeg = require('fluent-ffmpeg'); const { Readable } = require('stream'); const path = require('path'); const os = require('os'); const fs = require('fs/promises'); async function convertOggToMp3(oggBuffer) { const tmpDir = os.tmpdir(); const inputPath = path.join(tmpDir, `voice-${Date.now()}.ogg`); const outPath = path.join(tmpDir, `voice-${Date.now()}.mp3`); // Write OGG buffer to temp file await fs.writeFile(inputPath, oggBuffer); // Convert OGG/Opus → MP3 await new Promise((resolve, reject) => { ffmpeg(inputPath) .audioCodec('libmp3lame') .audioBitrate('128k') // good quality at reasonable size .audioFrequency(44100) // standard sample rate for Whisper .save(outPath) .on('end', resolve) .on('error', reject); }); // Read MP3 bytes const mp3Buffer = await fs.readFile(outPath); // Clean up temp files await Promise.all([fs.unlink(inputPath), fs.unlink(outPath)]); return mp3Buffer; } // Voice note handler: download → detect → convert → store async function handleAudioMessage(msg, from) { const { id: mediaId, voice, mime_type } = msg.audio; const isVoiceNote = voice === true; const { url } = await resolveMediaUrl(mediaId); const rawBytes = await downloadMedia(url); let finalBytes = rawBytes; let finalMime = mime_type; if (isVoiceNote || mime_type.includes('ogg')) { // Convert OGG/Opus → MP3 for compatibility + Whisper transcription finalBytes = await convertOggToMp3(rawBytes); finalMime = 'audio/mpeg'; } const storagePath = await storeFile(finalBytes, 'audio', mediaId, finalMime); return { mediaId, storagePath, isVoiceNote, mime_type: finalMime }; }
Python + subprocess
convert_voice_note.py
import subprocess, tempfile, os from pathlib import Path def convert_ogg_to_mp3(ogg_bytes: bytes) -> bytes: """Convert OGG/Opus voice note bytes → MP3 bytes via ffmpeg.""" with tempfile.TemporaryDirectory() as tmp_dir: input_path = Path(tmp_dir) / "input.ogg" output_path = Path(tmp_dir) / "output.mp3" # Write OGG bytes to temp file input_path.write_bytes(ogg_bytes) # Run ffmpeg conversion result = subprocess.run([ "ffmpeg", "-y", # overwrite without asking "-i", str(input_path), "-codec:a", "libmp3lame", # MP3 encoder "-qscale:a", "2", # ~190kbps quality "-ar", "44100", # sample rate for Whisper compatibility str(output_path) ], capture_output=True) if result.returncode != 0: raise RuntimeError(f"ffmpeg failed: {result.stderr.decode()}") return output_path.read_bytes() def handle_audio_message(msg: dict, sender: str) -> dict: audio = msg["audio"] media_id = audio["id"] is_voice = audio.get("voice", False) mime_type = audio.get("mime_type", "") media_info = resolve_media_url(media_id) raw_bytes = download_media(media_info["url"]) if is_voice or "ogg" in mime_type: final_bytes = convert_ogg_to_mp3(raw_bytes) final_mime = "audio/mpeg" else: final_bytes = raw_bytes final_mime = mime_type storage_path = store_file(final_bytes, "audio", media_id, final_mime) return { "media_id": media_id, "path": storage_path, "is_voice": is_voice }

Transcribing voice notes with OpenAI Whisper

Once you have the MP3, send it to OpenAI's Whisper API. Whisper supports 57 languages, handles background noise well, and costs approximately $0.006 per minute of audio — a 30-second voice note costs less than a cent to transcribe. The transcript then becomes queryable, AI-processable text.

Node.js + OpenAI SDK
transcribeVoiceNote.js
const OpenAI = require('openai'); const { Readable } = require('stream'); const oai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); async function transcribeAudio(mp3Buffer, language = 'en') { // Whisper expects a File-like object — wrap the buffer const file = new File([mp3Buffer], 'voice_note.mp3', { type: 'audio/mpeg' }); const response = await oai.audio.transcriptions.create({ model: 'whisper-1', file, language, // optional — auto-detect if omitted response_format: 'verbose_json', // includes confidence + segments }); return { text: response.text, // full transcript language: response.language, // detected language duration: response.duration, // audio duration in seconds segments: response.segments, // word-level timing (verbose_json only) }; } // Full voice note pipeline: download → convert → transcribe → store async function processVoiceNote(msg, from) { const { mediaId, storagePath, isVoiceNote } = await handleAudioMessage(msg, from); if (!isVoiceNote) return { mediaId, storagePath, transcript: null }; // Read the stored MP3 for transcription const mp3Buffer = await readFromStorage(storagePath); const transcript = await transcribeAudio(mp3Buffer); // Store the transcript alongside the audio await saveTranscript(mediaId, from, transcript); console.log(`[${from}] Voice note: "${transcript.text}"`); return { mediaId, storagePath, transcript }; }
Python + OpenAI SDK
transcribe_voice_note.py
import io from openai import OpenAI client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) def transcribe_audio(mp3_bytes: bytes, language: str = "en") -> dict: """Transcribe MP3 audio bytes using OpenAI Whisper.""" audio_file = io.BytesIO(mp3_bytes) audio_file.name = "voice_note.mp3" # SDK reads .name for MIME detection response = client.audio.transcriptions.create( model="whisper-1", file=audio_file, language=language, # omit for auto-detect response_format="verbose_json" # includes segments + confidence ) return { "text": response.text, "language": response.language, "duration": response.duration, } def process_voice_note(msg: dict, sender: str) -> dict: result = handle_audio_message(msg, sender) if not result["is_voice"]: return {**result, "transcript": None} mp3_bytes = read_from_storage(result["path"]) transcript = transcribe_audio(mp3_bytes) save_transcript(result["media_id"], sender, transcript) return {**result, "transcript": transcript}
What to do with the transcript: Store it alongside the audio file. Pass it to your AI agent as the user's message (instead of "audio message received"). Index it in your search database. Log it in your CRM as the conversation text. Customers send voice notes because typing is slower — treating their voice notes as searchable text dramatically improves your AI agent's ability to understand and respond correctly.

Storage patterns: organize your media files

How you organize files in S3 or GCS matters for performance, billing, and debugging. Here is the storage key pattern that scales cleanly across multiple clients and message types:

Node.js + AWS S3
storeFile.js
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3'); const mime2ext = require('mime-types'); const s3 = new S3Client({ region: process.env.AWS_REGION }); const BUCKET = process.env.S3_BUCKET; async function storeFile(bytes, type, mediaId, mimeType, filename) { const ext = mime2ext.extension(mimeType) || 'bin'; const date = new Date().toISOString().slice(0, 10); // YYYY-MM-DD // Organized key: media/{type}/{date}/{mediaId}.{ext} // For documents, preserve original filename in metadata const key = `media/${type}/${date}/${mediaId}.${ext}`; await s3.send(new PutObjectCommand({ Bucket: BUCKET, Key: key, Body: bytes, ContentType: mimeType, Metadata: { 'original-filename': filename || mediaId, 'media-id': mediaId, }, })); return `s3://${BUCKET}/${key}`; // store this path in your database } // S3 key examples: // media/image/2026-05-13/wamid.img123.jpeg // media/audio/2026-05-13/wamid.aud456.mp3 ← converted from OGG // media/document/2026-05-13/wamid.doc789.pdf // media/video/2026-05-13/wamid.vid012.mp4

Complete media webhook handler (all types)

One function that receives a normalized media event and routes to the correct handler by type:

Node.js — universal media dispatcher
mediaDispatcher.js
async function dispatchMedia(event) { const { from, message } = event; const { type } = message; // Text messages — not media, handle separately if (type === 'text') return handleText(from, message.text.body); // Non-media message types if (['location', 'contacts', 'reaction', 'interactive'].includes(type)) { return handleNonMedia(from, message); } // Media types — all require download switch (type) { case 'image': case 'video': case 'sticker': case 'document': { const result = await handleMediaMessage(message, from); await saveMediaRecord(from, result); break; } case 'audio': { // Voice notes get transcribed; regular audio files just stored const result = await processVoiceNote(message, from); await saveMediaRecord(from, result); // If voice note was transcribed, treat transcript as text input if (result.transcript?.text) { console.log(`Voice note from ${from}: "${result.transcript.text}"`); await handleText(from, result.transcript.text, { isVoiceNote: true }); } break; } default: console.warn(`Unhandled media type: ${type}`); } }

SocialHook: pre-extracted media IDs in normalized format

When you use SocialHook, the media payload arrives already extracted from Meta's nested Cloud API envelope. Instead of navigating entry[0].changes[0].value.messages[0], you receive a flat event:

Your dispatchMedia(event) function above receives this format directly — no additional parsing needed. The media ID is at event.message.image.id (or .audio.id, .document.id, etc.), ready to feed into resolveMediaUrl().

Common questions

How do I download media from a WhatsApp Cloud API webhook?
Two steps: (1) Call GET graph.facebook.com/v21.0/{media_id} with your access token to get a temporary download URL. (2) Fetch that URL (also with your access token in the Authorization header) to get the file bytes. Download immediately — the URL expires in approximately 5 minutes. Store the file bytes to S3/GCS/disk, not the temporary URL.
Why does my WhatsApp media download URL return 403?
Two causes: (1) URL expired — WhatsApp media download URLs are valid for approximately 5 minutes. If you stored the URL and fetched it later, it expired. Re-resolve from the media ID. (2) Missing Authorization header — the download URL itself also requires Authorization: Bearer {ACCESS_TOKEN} in the request. WhatsApp media URLs are not public CDN URLs.
What format are WhatsApp voice notes in and how do I convert them?
WhatsApp voice notes are OGG/Opus format (identified by voice: true in the payload and mime_type: "audio/ogg; codecs=opus"). Convert to MP3 with ffmpeg: ffmpeg -i input.ogg -codec:a libmp3lame -qscale:a 2 output.mp3. OGG/Opus is not supported by OpenAI Whisper or most browsers — always convert before transcription or playback. Full Node.js and Python code is in the voice notes section above.
How do I transcribe WhatsApp voice notes with Whisper?
Download the OGG → convert to MP3 with ffmpeg → send to OpenAI Audio Transcriptions API with model: "whisper-1". Whisper returns the transcript text. Cost is ~$0.006 per minute of audio. A 30-second voice note costs less than a third of a cent. The transcript can then be treated as a regular text message by your AI agent or CRM.
How long are WhatsApp media IDs valid?
Media IDs are valid for 30 days. The temporary download URL you get from resolving the ID is valid for approximately 5 minutes. This means you can safely store just the media ID in your database and resolve it fresh when you need the file (within 30 days). After 30 days, the media is deleted from Meta's servers and the ID is invalid.
What are the file size limits for WhatsApp media?
By type: Image (JPEG, PNG) — 5MB. Audio — 16MB. Video — 16MB. Document — 100MB. Sticker — 500KB static, 100KB animated. Messages with media exceeding these limits are rejected before your webhook fires — the sender sees a failure, not you.

Your pipeline starts with
a clean webhook event.

SocialHook delivers every WhatsApp media event — image, voice note, document — as pre-extracted JSON to your handler. You get the media ID ready to resolve, the MIME type, and the voice flag. No Cloud API parsing. Just pipe it straight to your download function.

No credit card required · $50/month after trial · Cancel anytime