Discord.js Guide

# Understanding Voice

TIP

It's common for people to get confused with voice and not understand what they're doing. This topic is optional, but it aims to solve these problems by explaining some of the Discord API's jargon and features.

# Glossary

PCM (opens new window) - Think of this as raw audio; it is not encoded in anything special and is used by your computer at a lower level.
Opus (opens new window) - This is a lossy audio format; it's an encoding applied to PCM that makes music playable over Discord. An Opus encoder generates Opus packets, which can play over Discord.
Ogg (opens new window) and WebM (opens new window) - These are audio containers. Think of this as ZIP vs. RAR–they both contain the same audio, but they go about containing it in different ways. In this context, these containers store lots and lots of Opus audio packets, which can be extracted to be played.
FFmpeg (opens new window) - For cases where we don't have Opus audio, e.g., in MP3 files, we need a way of converting them to Opus. FFmpeg is the tool we use; it converts hundreds of media formats to Opus for us so they can play over Discord.

# The voice connection lifecycle

A packet is sent to the main gateway telling Discord we wish to join a voice channel. Discord responds with a voice gateway (WebSocket) to connect to, and the bot authenticates itself. We also select an encryption mode available on that gateway. We now start heart beating.
We find our external port and send it to Discord over UDP to send/receive audio. It follows that a voice connection has two main components, a WebSocket connection–where we can change speaking status–and a UDP socket, where audio is sent and received.
We send our audio packets at regular time intervals to the UDP socket, which forwards them to other users connected to the channel.
The server may change its voice region; in this case, discord.js will disconnect these sockets and reconnect to the new voice servers.
Once we're done with the voice connection, we tell the main gateway that we're no longer in a voice channel. We can now destroy the UDP socket and voice gateway connection without disrupting any regular gateway activity.

# Playing audio

To play audio, we need to send Opus audio packets to Discord at a fixed interval–we have selected 20ms. This is the StreamDispatchers job–it is a WritableStream with Opus audio packets written to it. The dispatcher handles the packets' timing and applies some metadata, e.g., the packet's sequence number, the timestamp, and then encrypts the packet before sending it via the UDP socket.

Obtaining the Opus audio is mainly offloaded to a separate module, prism-media (opens new window). This module converts media to Opus audio, either through "demuxing" Ogg/WebM media files (extracting Opus audio directly) or using FFmpeg to transcode most media formats to Opus audio so we can use it.

It can also change the volume of a stream in real-time without much cost to performance.

In conclusion:

You do VoiceConnection#play(...)
discord.js will convert whatever you've provided into Opus audio in the most efficient way it can find. The worst-case route would be having FFmpeg convert a file to PCM, passing this data through an Opus encoder.
A StreamDispatcher is created and fed data from the Opus audio stream
The StreamDispatcher will take an Opus audio packet every 20ms, applying metadata, and then encrypting it. The packet is then sent over UDP
This occurs until the stream has ended, in which case the StreamDispatcher is destroyed, and all intermediary processes (e.g., FFmpeg transcoder, Volume Transformer, Opus Encoder) are all destroyed.

# Receiving audio

WARNING

Discord does not officially support bots receiving audio. However, discord.js does its best to implement this anyway!

Just as we send audio to the UDP socket, we also receive audio through the UDP socket. Processing this audio is the reverse process of sending audio:

The packet is decrypted
Metadata is removed, leaving an Opus packet
If selected, the packet is decoded into PCM audio
The processed packet is pushed to a user-friendly stream given to you

# Key takeaways

You should be familiar with the definitions in the glossary above
A voice connection does not disrupt the bot's main gateway–separate connections to a voice gateway and UDP socket are created
The StreamDispatcher is essentially a WritableStream that takes Opus audio
Receiving audio is officially unsupported; however, we do our best to maintain it
Playing audio has varying complexity depending on the file/audio you want to play–the simplest (and most efficient) files would be Ogg/WebM files as they already contain Opus audio. Other files must be passed through FFmpeg and an Opus Encoder before they can play.

← Introduction The Basics →