Build Voice Chat with an LLM on Rails

Introduction

In this guide, we’ll walk through the code for building a simple voice chat application with speech to text, text to speech, and a large language model using OpenAI’s APIs, GPT-4, and the Sublayer Gem.

You can browse the code for this guide on GitHub: Rails Voice Chat with LLM

A detailed video and step-by-step instructions are coming soon; in the meantime, we’ve called out the important parts of the repo above to help you get started with it.

Code Walkthrough

Sublayer Generators

What makes this all work is the use of the Sublayer gem. In particular, the combination of custom Actions and Generators that allow us to easily convert the audio to text, generate a response, and convert that response back to audio.

For this project, we created 2 Actions: lib/sublayer/actions/speech_to_text_action.rb

require "tempfile"

module Sublayer
  module Actions
    class SpeechToTextAction < Base
      def initialize(audio_data)
        @audio_data = audio_data
      end

      def call
        tempfile = Tempfile.new(['audio', '.webm'], encoding: 'ascii-8bit')
        tempfile.write(@audio_data.read)
        tempfile.rewind

        text = HTTParty.post(
          "https://api.openai.com/v1/audio/transcriptions",
          headers: {
            "Authorization" => "Bearer \#{ENV["OPENAI_API_KEY"]}",
            "Content-Type" => "multipart/form-data",
          },
          body: {
            file: tempfile,
            model: "whisper-1"
          })

        tempfile.close
        tempfile.unlink

        text["text"]
      end
    end
  end
end

lib/sublayer/actions/text_to_speech_action.rb

module Sublayer
  module Actions
    class TextToSpeechAction < Base
      def initialize(text)
        @text = text
      end

      def call
        speech = HTTParty.post(
          "https://api.openai.com/v1/audio/speech",
          headers: {
            "Authorization" => "Bearer \#{ENV["OPENAI_API_KEY"]}",
            "Content-Type" => "application/json",
          },
          body: {
            "model": "tts-1",
            "input": @text,
            "voice": "nova",
            "response_format": "wav"
          }.to_json
        )

        speech
      end
    end
  end
end

and 1 Generator: lib/sublayer/generators/conversational_response_generator.rb

module Sublayer
  module Generators
    class ConversationalResponseGenerator < Base
      llm_output_adapter type: :single_string,
        name: "response_text",
        description: "The response to the latest request from the user"

      def initialize(conversation_context:, latest_request:)
        @conversation_context = conversation_context
        @latest_request = latest_request
      end

      def generate
        super
      end

      def prompt
        <<-PROMPT
          \#{@conversational_context}
          \#{@latest_request}
        PROMPT
      end
    end
  end
end

Data Model

There are two primary models, `Conversation` and `Message`. A `Conversation` has many `messages`. A `Message` has a `content` and `role` and belongs to a `conversation`.

Schema

View

Since this is a simple demo, all the action happens in app/views/layouts/application.html.erb where we have a button that records audio when the button is pressed and uploads it when the button is released.

<body>
    <div data-controller="audio-upload" data-audio-upload-conversation-id-value="<%= @conversation.id %>">
        <button data-action="mousedown->audio-upload#startRecording mouseup->audio-upload#stopRecording touchstart->audio-upload#startRecording touchend->audio-upload#stopRecording">
            Press and Hold to Record
        </button>
        <div data-audio-upload-target="status"></div>
        <!-- clear button -->
        <button data-action="click->audio-upload#clearSession">Clear</button
    </div>
</body>

Stimulus Controller

In the app/javascript/controllers/audio_upload_controller.js file we have the Stimulus controller that handles the audio recording, uploading, and processing the user interactions. The important parts are here in `connect()` when we set up the media recorder and `uploadAudio()` when we upload the audio in response to the user releasing the button, and play the audio that is returned from the backend.

connect() {
    navigator.mediaDevices.getUserMedia({ audio: true })
      .then(stream => {
        this.mediaRecorder = new MediaRecorder(stream)
        this.mediaRecorder.ondataavailable = event => {
          this.audioChunks.push(event.data)
        }
        this.mediaRecorder.onstop = () => {
          const audioBlob = new Blob(this.audioChunks, { type: 'audio/wav' })
          this.uploadAudio(audioBlob)
          this.audioChunks = []
        }
      })
      .catch(error => console.error("Audio recording error:", error))
  }

  uploadAudio(audioBlob) {
    const formData = new FormData()
    formData.append('audio_data', audioBlob)
    formData.append('conversation_id', this.conversationIdValue)

    fetch('/conversation_messages', {
      method: 'POST',
      body: formData,
      headers: {
        'X-CSRF-Token': document.querySelector('meta[name="csrf-token"]').getAttribute('content'),
        'Accept': 'application/json',
      },
    })
    .then(response => response.blob())
      .then(audioBlob => {
        const audioUrl = URL.createObjectURL(audioBlob)
        const audio = new Audio(audioUrl)
        audio.play()
        this.statusTarget.textContent = "Playing response..."
      })
    .catch(error => {
      console.error('Error:', error)
      this.statusTarget.textContent = "Upload failed."
    })
  }

ConversationMessagesController

In /app/controllers/conversation_messages_controller.rb we have the controller action that does the bulk of the work. It takes in the audio data from the frontend and loads any previous conversation context.

Then uses the Sublayer SpeechToTextAction to convert the audio to text using OpenAI’s Speech to Text API.

After that, we pass the user’s new text and the previous conversational context to the Sublayer::Generators::ConversationalResponseGenerator to generate GPT-4’s next response in the conversation chain.

Finally, we use the Sublayer TextToSpeechAction to convert the text response to audio and send it back to the frontend.

def create
    conversation = Conversation.find(params[:conversation_id])

    # Convert conversational context to an easy to use format
    conversational_context = conversation.messages.map { |message| {role: message.role, content: message.content} }

    # Convert audio data to text
    text = Sublayer::Actions::SpeechToTextAction.new(params[:audio_data]).call

    # Generate conversational response
    output_text = Sublayer::Generators::ConversationalResponseGenerator.new(
        conversation_context: conversational_context, latest_request: text
    ).generate

    # Convert text to audio data
    speech = Sublayer::Actions::TextToSpeechAction.new(output_text).call

    # Store conversation context for next message
    conversation.messages << Message.new(conversation: conversation, role: "user", content: text)
    conversation.messages << Message.new(conversation: conversation, role: "assistant", content: output_text)

    send_data speech, type: "audio/wav", disposition: "inline"
  end