The IXrAiSpeechToText
interface defines the contract for AI models that convert spoken audio into text. This interface processes audio data and returns transcribed text.
public interface IXrAiSpeechToText
Converts audio data to text asynchronously.
public Task<XrAiResult<string>> Execute(byte[] audioData, Dictionary<string, string> options = null)
Parameters:
audioData
(byte[]): The audio data as a byte array (typically WAV format)options
(Dictionary<string, string>, optional): Model-specific options and parametersReturns:
Task<XrAiResult<string>>
: A task that resolves to a result containing the transcribed text// Load the model
IXrAiSpeechToText speechToText = XrAiFactory.LoadSpeechToText("OpenAI", new Dictionary<string, string>
{
{ "apiKey", "your-openai-api-key" }
});
// Record audio using helper
XrAiSpeechToTextHelper recorder = GetComponent<XrAiSpeechToTextHelper>();
recorder.StartRecording(Microphone.devices[0], OnRecordingComplete, 5);
// Process recorded audio
private async void OnRecordingComplete(byte[] audioData)
{
var result = await speechToText.Execute(audioData, new Dictionary<string, string>
{
{ "model", "whisper-1" },
{ "language", "en" }
});
if (result.IsSuccess)
{
string transcribedText = result.Data;
Debug.Log($"Transcribed: {transcribedText}");
// Use the transcribed text
ProcessTranscription(transcribedText);
}
else
{
Debug.LogError($"Speech recognition failed: {result.ErrorMessage}");
}
}
Use the XrAiSpeechToTextHelper
component to simplify audio recording:
// Start recording from default microphone
speechToTextHelper.StartRecording(
device: Microphone.devices[0],
onRecordingComplete: ProcessAudioData,
recordingMax: 10 // seconds
);
// Stop recording manually
speechToTextHelper.StopRecording();
Different providers support different configuration options:
model
: Model to use (e.g., “whisper-1”)language
: Source language code (e.g., “en”, “es”, “fr”)prompt
: Optional text to guide the model’s styleresponse_format
: Response format (e.g., “json”, “text”, “srt”, “vtt”)temperature
: Sampling temperature (0 to 1)Most providers expect audio in specific formats:
The XrAiSpeechToTextHelper
automatically converts Unity AudioClip
to the appropriate WAV format.
For real-time speech recognition applications:
public class RealtimeSpeechRecognition : MonoBehaviour
{
private XrAiSpeechToTextHelper recorder;
private IXrAiSpeechToText speechToText;
private bool isListening = false;
void Start()
{
recorder = GetComponent<XrAiSpeechToTextHelper>();
speechToText = XrAiFactory.LoadSpeechToText("OpenAI", new Dictionary<string, string>
{
{ "apiKey", "your-api-key" }
});
}
public void StartListening()
{
if (!isListening)
{
isListening = true;
recorder.StartRecording(Microphone.devices[0], OnSpeechRecorded, 3);
}
}
private async void OnSpeechRecorded(byte[] audioData)
{
if (isListening)
{
var result = await speechToText.Execute(audioData);
if (result.IsSuccess && !string.IsNullOrEmpty(result.Data))
{
ProcessCommand(result.Data);
}
// Continue listening
recorder.StartRecording(Microphone.devices[0], OnSpeechRecorded, 3);
}
}
}
Task
XrAiResult<string>
for consistent error handlingif (!result.IsSuccess)
{
Debug.LogError($"Speech transcription failed: {result.ErrorMessage}");
// Handle specific error cases:
// - Invalid API key
// - Unsupported audio format
// - Audio too short/long
// - Network issues
// - Service rate limits
}