Odunolaoluwa Shadrack Jenrola

Parakeet-streaming: Realtime STT streaming inference engine

Parakeet-streaming is a streaming inference engine that provides real-time speech-to-text over WebSockets. This makes low-latency ASR serving easy to spin up and run, in much the same way tools like vLLM and SGLang made standard LLM inference feel straightforward to use.

Check out the video here that demos it

Today, Parakeet-streaming ships with two NVIDIA ASR models that share the same core architecture and mainly differ in size. Both are RNN-T systems with a FastConformer encoder and an LSTM predictor. The smaller model is around 120M parameters, and the larger model is around 600M.

This project is still actively under development. Profiling is not great yet, and there’s too much Python overhead in the hot path. I still wanted to get it out in the open early, then keep iterating in public, fixing performance issues and addressing anything the community points out along the way.

FastConformer and streaming constraints

NVIDIA has been iterating on Conformer-style encoders for a while, and FastConformer is their efficiency-focused redesign. One of the biggest architectural changes is more aggressive early downsampling. Compared to the original Conformer’s 4x reduction, FastConformer uses 8x downsampling at the start of the encoder. This is done with a convolutional subsampling stack that both expands the feature dimension of log-mel inputs to the model’s hidden size and reduces the time resolution so the following attention and feed-forward blocks run on shorter sequences.

In its original form, the FastConformer encoder is not streaming-first. It is typically trained and evaluated in a non-streaming setup, with attention that can use future context. Even if you later restrict attention to a window or apply chunking for efficiency on long recordings as described in the original paper, the model is still bidirectional within whatever context it is allowed to see. This is similar to chunk-wise attention systems in AssemblyAI’s Universal 1 where attention is restricted to tokens in particular chunks. You cap context for compute reasons, but the encoder itself is not causal.

For true streaming, that bidirectional behavior becomes a problem. A non-causal convolution kernel centred at time t will also read frames after t, which means future audio can leak into the current representation. That breaks the streaming requirement unless any future context is deliberate and tightly bound. I wrote a deepdive about causal convolutions and streaming here. Please check that out too.

That said, prior work has shown that a small, fixed lookahead, where the model is allowed to peek slightly into the future, can noticeably improve streaming accuracy.

The newer Nemotron Speech and Parakeet streaming models take a different route. Instead of training a separate streaming-only network, they modify FastConformer so streaming inference stays consistent with how the model is trained, even though training itself is not framed as a special ā€œstreamingā€ procedure. Beyond making the convolution layers causal, the main idea is to structure inference around the same context rules and reuse cached states so the streaming computation matches the offline one.

The encoder is trained with explicit limits on left and right context, and the convolution layers are made causal. At inference time, the model runs step-by-step using activation caching. You still train efficiently in parallel, but during streaming you reuse intermediate activations from earlier steps so you do not keep recomputing overlapping context.

For attention, it uses chunk-aware lookahead where frames within the same chunk can attend to each other, and each chunk can also attend to a fixed number of previous chunks.

The result of all of this is a model that is still efficient, yet better matched to streaming constraints. The model also matches the performance of offline inference. One other change that shows up in the model stack is the use of LayerNorm instead of BatchNorm. BatchNorm stores global statistics from training and reuses them at inference time. That becomes unstable for our usecase since the dynamics during inference differ a lot from training. LayerNorm normalizes per sample across the feature dimension, which makes it more stable for streaming. There’s also the added benefit of not having to worry about extra stored tensors, however little.

Caching in the encoder

Streaming requires layers that need context to maintain caches. In parakeet streaming, we use the ModelCache. It stores attention key/value caches per layer and convolution caches per layer. Each streaming step updates the caches in place instead of allocating new tensors each time. That keeps memory use stable across long sessions and avoids the annoying memory reallocation that comes with repeated concatenation.

Pointwise convolutions have a kernel size of 1 and only mix channels at a single time step, so they do not need any cache. The depthwise convolutions are causal and use the convolution cache to stitch in the recent past. Attention caches keep a bounded left context and a controlled right context. This is what makes the encoder streaming-friendly without throwing away the benefits of the original FastConformer design.

I also tried not make large changes to the original NeMo implementation code, at least for now. There’s only some refactoring to allow for better readability. The biggest change is in the attention block, where the Q, K, and V projections are now computed via a single linear layer and then chunked to get the query, key and value. This helps reduce frequent kernel launches.

from torch import nn, Tensor

B, T, embed_dim = 4, 8, 64
input = torch.randn(B, T, embed_dim)

# Original NeMO impl
self.q_proj = nn.Linear(embed_dim, embed_dim)
self.k_proj = nn.Linear(embed_dim, embed_dim)
self.v_proj = nn.Linear(embed_dim, embed_dim)

query = self.q_proj(input)
key = self.k_proj(input)
value = self.v_proj(input)

# Rewrite
self.qkv = nn.Linear(embed_dim, embed_dim * 3)
query, key, value = self.qkv(input).chunk(3, dim = -1)

To load the pretrained weights, we now need to concatenate the weights of q_proj, v_proj and k_proj along the last dimension for every layer and load that to the updated model.

num_layers = 17
qkv_dict = dict()

for layer_idx in range(num_layers):
    linear_q = f"encoder.layers.{layer_idx}.self_attn.linear_q.weight"
    linear_v = f"encoder.layers.{layer_idx}.self_attn.linear_v.weight"
    linear_k = f"encoder.layers.{layer_idx}.self_attn.linear_k.weight"

    comb_weight = torch.cat(
        [
            layer_state_dict[linear_q],
            layer_state_dict[linear_k],
            layer_state_dict[linear_v],
        ]
    )
    qkv_dict[f"encoder.layers.{layer_idx}.self_attn.qkv.weight"] = comb_weight

Inference engine overview

The inference engine’s design borrows ideas from NanoVLLM, a minimal, educational version of vLLM focused on the core scheduling and execution loop.

The pipeline is divided into distinct stages. We run each stage continuously in separate threads and connect them with lightweight queues. Each client connection is assigned a stream id, and that is used to create a Sequence.

Sequence

A Sequence is the unit of work for a single stream. It owns:

  1. The ring buffer for incoming audio samples
  2. A raw queue for audio chunks ready for feature extraction (array → mel spec)
  3. An encoded queue for encoder outputs ready for RNNT decoding
  4. The predictor state and the last predictor output used by the RNNT loop for this sequence
  5. The accumulated token ids for the transcript

The Sequence also tracks whether a final chunk has been requested and how many encoder chunks are currently in flight, in some stage. This allows the engine to know when a stream has been fully drained so it can finalise and release resources for incoming streams.

Scheduler

The Scheduler is responsible for admission control and state pooling. It maintains an awaiting queue and an active queue of sequences. If the number of active streams is below max_num_streams, new sequences are admitted immediately. Otherwise, they wait. We do this so that compute and memory growth is predictable. One thing missing right now is a flexible way to determine the number of streams that can be newly admitted, given the current unallocated GPU memory. We just hard-limit for now.

The Scheduler also manages a pool of StreamingState objects for the encoder. Each state slot contains the per-layer attention and convolution caches. When a new stream is created, it acquires a slot from the pool. When the stream finishes, the slot is reset and returned to the pool. This avoids re-allocating caches for every connection and keeps memory usage stable even at high concurrency.

ModelRunner

ModelRunner drives the actual forward passes. It spins up three threads:

  1. Pre-encode thread: This consumes raw audio chunks, runs feature extraction, and applies the pre-encoder (subsampling stack). For every Sequence seq, the resulting frames are appended to the enc_buffer.
  2. Encode thread: This pulls fixed-size chunks from the enc_buffer, packs the corresponding streaming states into a batch, runs an encoder forward pass and unpacks the states back into the pool. The encoder output chunks are queued in encoded_queue.
  3. Decode thread: We now pull encoded chunks, run the RNNT Decoder and Joiner (RNNT) in a greedy sampling loop and emit new tokens. For every Sequence seq, tokens_ids are appended to Sequence.token_ids and the new tokens are placed in a thread-safe results queue.

All three stages run continuously. The queues decouple the stages so that a slow step does not stall the rest of the pipeline, and every aspect of inference is kept fed most of the time. However, this particular design might change. Right now however, this is simple enough for low latency as we avoid a single monolithic loop that must do all the work.

ASREngine

ASREngine is the public-facing engine interface used by the server. It owns the ModelRunner and Scheduler, maps stream ids to Sequence objects, and exposes push_samples and collect_stream_results. The collect_stream_results call drains the ModelRunner results queue, converts token ids to text with the tokenizer, and returns StreamResult objects containing the text, the newly emitted token ids, and a final flag.

Finalisation is handled carefully. If a stream is marked final and all queues are empty, the engine releases the sequence and emits a final result even when no new tokens were produced at the very end. That ensures clients always receive a final message.

Server

We use Trio for TCP and WebSockets serving. I first looked into Trio for async work sometime last year after seeing it mentioned in an Anthropic job posting. Each client connection creates a new stream in ASREngine, and the server replies with a hello message that includes the stream ID and the required sample rate (16 kHz). From there, the server runs two loops per connection:

  1. Reader loop: reads client messages (audio, ping or clone). Audio can be raw float arrays or base64 PCM16/f32. Close or final makers trigger stream finalisation, and the freed slot is returned to the pool.
  2. Writer loop: pulls StreamResults from the engine and sends them back to the client. If there are no new results, the loop waits on a condition variable via ASREngine.wait_for_update. When a client disconnects, the server asks the engine to finalize if the sequence still exists and waits for resulsts to drain. If the client drops abruptly, like a CTRL+C, the server cleans up after a short timeout to prevent resourse leaks.

We keep the server thin this way. All model logic remains in the engine. This also makes it easy to plug in new transports without changing the inference pipeline.

On the transport side, the server currently supports both raw TCP and WebSockets, and both use the same message protocol. TCP is the simpler, lower-overhead option and is usually faster.

The TCP endpoint uses newline-delimited JSON: clients send one JSON object per line and read responses line by line. This is convenient for backend services that can open a plain socket and keep things minimal.

The WebSocket endpoint wraps the same JSON messages in WebSocket frames. The payload is still JSON, but each message is delivered as its own frame. This is a better fit for browser clients or any environment where WebSockets are the standard choice for real-time communication.

So now, the result is a low-latency streaming ASR inference engine that’s simple to use and integrate in real-time applications. It’s not stable yet, so expect a few bugs.