xr-ai-accelerator

IXrAiSpeechToText

The IXrAiSpeechToText interface defines the contract for AI models that convert spoken audio into text. This interface processes audio data and returns transcribed text.

Interface Declaration

public interface IXrAiSpeechToText

Methods

Execute

Converts audio data to text asynchronously.

public Task<XrAiResult<string>> Execute(byte[] audioData, Dictionary<string, string> options = null)

Parameters:

Returns:

Usage Example

// Load the model
IXrAiSpeechToText speechToText = XrAiFactory.LoadSpeechToText("OpenAI", new Dictionary<string, string>
{
    { "apiKey", "your-openai-api-key" }
});

// Record audio using helper
XrAiSpeechToTextHelper recorder = GetComponent<XrAiSpeechToTextHelper>();
recorder.StartRecording(Microphone.devices[0], OnRecordingComplete, 5);

// Process recorded audio
private async void OnRecordingComplete(byte[] audioData)
{
    var result = await speechToText.Execute(audioData, new Dictionary<string, string>
    {
        { "model", "whisper-1" },
        { "language", "en" }
    });

    if (result.IsSuccess)
    {
        string transcribedText = result.Data;
        Debug.Log($"Transcribed: {transcribedText}");
        
        // Use the transcribed text
        ProcessTranscription(transcribedText);
    }
    else
    {
        Debug.LogError($"Speech recognition failed: {result.ErrorMessage}");
    }
}

Audio Recording Helper

Use the XrAiSpeechToTextHelper component to simplify audio recording:

// Start recording from default microphone
speechToTextHelper.StartRecording(
    device: Microphone.devices[0],
    onRecordingComplete: ProcessAudioData,
    recordingMax: 10 // seconds
);

// Stop recording manually
speechToTextHelper.StopRecording();

Model-Specific Options

Different providers support different configuration options:

OpenAI (Whisper)

Audio Format Requirements

Most providers expect audio in specific formats:

The XrAiSpeechToTextHelper automatically converts Unity AudioClip to the appropriate WAV format.

Real-time Usage

For real-time speech recognition applications:

public class RealtimeSpeechRecognition : MonoBehaviour
{
    private XrAiSpeechToTextHelper recorder;
    private IXrAiSpeechToText speechToText;
    private bool isListening = false;

    void Start()
    {
        recorder = GetComponent<XrAiSpeechToTextHelper>();
        speechToText = XrAiFactory.LoadSpeechToText("OpenAI", new Dictionary<string, string>
        {
            { "apiKey", "your-api-key" }
        });
    }

    public void StartListening()
    {
        if (!isListening)
        {
            isListening = true;
            recorder.StartRecording(Microphone.devices[0], OnSpeechRecorded, 3);
        }
    }

    private async void OnSpeechRecorded(byte[] audioData)
    {
        if (isListening)
        {
            var result = await speechToText.Execute(audioData);
            
            if (result.IsSuccess && !string.IsNullOrEmpty(result.Data))
            {
                ProcessCommand(result.Data);
            }
            
            // Continue listening
            recorder.StartRecording(Microphone.devices[0], OnSpeechRecorded, 3);
        }
    }
}

Implementation Notes

Error Handling

if (!result.IsSuccess)
{
    Debug.LogError($"Speech transcription failed: {result.ErrorMessage}");
    // Handle specific error cases:
    // - Invalid API key
    // - Unsupported audio format
    // - Audio too short/long
    // - Network issues
    // - Service rate limits
}

Best Practices