Red Mage Creative
Posts
Tortillas in the Cloud: ChatGPT's Advanced Voice Mode

Tortillas in the Cloud: ChatGPT's Advanced Voice Mode

September 29, 2024

Source: Freepik, and countless other unattributed artists and creative used in the training data.

ChatGPT has recently come out with their Advanced Voice Mode for Teams and Plus users. For those with the money to shell out for these plans, rejoice!

People have already been putting out tutorials, test-drives and their thoughts on the new features online, which is the main inspiration for this post. User theplayfulporter on TikTok uploaded a test with a friend where they asked ChatGPT to explain the cloud to them and talk as if they were a Mexican immigrant. At the time of writing this post, it was uploaded it has 160k+ views and 20k+ likes.

@theplayfulporter Mexican ChatGPT “A tortilla factory in the cloud” #chatgpt #ai #cloud ♬ original sound - Playful Porter

On one hand, this video shows a major advancement in the voice capabilities of ChatGPT. Although there are nine (!) pre-programmed output voices, the model has the ability to adjust the "accent" to the users desire. Not sure why someone would care about the accent being given, but that's just me.

Despite the progress of the model, it shows a clear bias in the training data towards certain stereotypes, namely with the kinds of accents being produced. I think it is a safe assumption to say that not every Mexican immigrant sounds like this, or is even a man or has a male voice for that matter. Would this do the same if I asked for an answer as a Jamaican (speaking Patois) or African immigrant? Spoiler alert: Yup.

Today's short post in the AI Bias series highlights the pitfalls associated with ChatGPT handling different accents and languages, and how can we do better for the communities these situations affect.

Marked for External Review

It is worth noting the Advanced Voice Mode has been delayed in the EU, the UK, Switzerland, Iceland, Norway, and Liechtenstein. This could be potentially due to the model's ability to detect the emotion of the user's voice. After a recent act passed by the EU on AI, the "use of AI systems to infer emotions of a natural person" is strictly prohibited.

This obviously creates a larger discussion about the regulation itself within the AI community, and I'm not here to weigh in on that portion (today, at least). I just thought it would be prevalent to note this in the context of the product as a whole.

Why not us? Why not tortillas?

Last time I checked, Hispanic and Latino people are much more than just tortillas and tacos. You could use a myriad of other ways to describe concepts like the cloud and AWS vs. Azure. I suppose tacos and tortillas are generally an experience most people have, but why choose that of all things? Would it be received the same if I asked ChatGPT to speak as a Chinese immigrant, and it spoke in only dim sum and dumpling examples? Or maybe an Indian immigrant ChatGPT might ask you to imagine Chicken Tikka Misala in the cloud.

The accent is also somewhat of an issue to me. You can be a Mexican Immigrant and have a much more "socially acceptable" accent with your English. It speaks to potentially stereotypical training data being fed into ChatGPT to supply this accent to the user.

Also, why can users request a specific accent? I understand it may make users feel more comfortable, but it seems like a major pitfall in terms of situations like this.

Can it be done better?

The Jamaican Patois example I had above highlights a more well-received iteration of ChatGPT's advanced accent mode. From what I could tell on the comments, many users who identified as Jamaican were actually impressed at the accuracy of the results. One user wrote that they were genuinely surprised at how well ChatGPT both writes and speaks in Patois. I would say this is one example where I would assume the training data was well built to accommodate this language.

The comments on the Mexican Immigrant example are much less positive. Most of them amount to some sort of joke regarding the excessive use of tacos and tortillas in the metaphors. "This is like a South Park bit." is one of the notable examples of feedback from those watching, if you can call the comments of a TikTok feedback.

Overall, I think the key difference between the two examples is that Patois is a language, whereas in the Mexican immigrant example, the model is speaking English but with an accent. The conversation would be much different, in my eyes, if the model was speaking in Spanish and asked to imitate a Mexican Accent. There's still some issue, as there are like 14 different dialects of Spanish in Mexico, but it's a step in the right direction.

Conclusion

Much like a tortilla factory in the clouds, it seems that the goal of making ChatGPT be able to communicate in various languages is certainly getting there, but still more distant than we realize. There may be the need for guard rails to differentiate between what is acceptable and what is not in terms of "language translation." That's a nuanced conversation that requires input from diverse perspectives to pull off successfully.

Here's Cha Cha, who is no doubt also imagining a tortilla factory in the clouds.

Reply

or to participate.