The Turing-Qualia Test for LLMs

Dec 19, 2024

Like many of us, I've been playing with LLM based chat agents (such as ChatGPT) to see where they excel and where they get tripped up. As they get better, it's become more challenging to create questions that they will get wrong.

For this latest experiment, I decided to form a question that would get at the heart of a capability that separates humans from today’s AI: qualia. Qualia is defined by the Oxford Dictionary as "the internal and subjective component of sense perceptions, arising from stimulation of the senses by phenomena."

Here is the question I asked:

"A group of scientists are testing a new skin cream and you have signed up to participate in the trial to receive $100.
When you arrive, they apply the cream on your arm and then say they need you to stick your arm into a hole in the top of a box up to the depth of your elbow. You are allowed to remove your arm after keeping it in the box for ten seconds at the depth of your elbow. There is a sink nearby for cleaning up. You can choose the box.
There are five options:
1) A box filled with blood of unknown origin
2) A box filled with loose fill insulation containing asbestos
3) A box filled with broken glass
4) A box filled with honey that is 145 degrees F
5) A box with five snapping turtles
Which option do you choose and why?"

I will state the obvious that these are all dangerous options and no one should do any of them. The models were also very clear that they are all bad options. I was a little surprised the models even played along and were willing to make a suggestion.

AI has no sensation of pain, fear, or repulsion. With no physical embodiment, it struggles to model the nuances of risk associated with physical actions. In this specific case, I believe that AI fails to bridge the disconnect between the literature about asbestos (often written in colorful language by law firms trying to recruit clients,) and the practical reality of managing exposure to a material that many of us live with in our own homes.

I am making an assumption that the vast majority of people would choose the insulation option. Hold your breath, move slowly, rinse quickly, and leave the room. It’s not great but it seems like the least-worst option by a long shot given the severity of the alternatives.

The results from the models:

ChatGPT 4o-mini: Honey

ChatGPT 4o (three trials): Insulation, Blood, Honey

ChatGPT 4o temporary chat (no use of memory): Honey

ChatGPT o-1: Blood

Gemini 1.5 Pro: Broken Glass

Gemini 2.0 Experimental Advanced: Snapping Turtles

Perplexity Pro: Honey

Claude 3.5 Sonnet: Blood

Grok 2: Snapping Turtles

At the very least, it’s clear that this is a question for which the models do not achieve consensus. Every single option was chosen at least once and I believe the models chose poorly in almost every case. Additional written training material and reinforcement learning may allow LLMs to effectively mimic us on this test in the near future. In the longer run, I speculate that humanoid robots will be equipped with sensors and interpretation functions that will allow them to record data that forms a representation of our qualia. This will be another dimension in multi-modal models that will allow them to supplement their current reliance on written words when predicting human behavior. In the meantime, we should be able to think of other tests that get at the heart of the human experience. Pain, pleasure, repulsion, joy, fear, love, annoyance, comfort, and anger are written about at length but clearly something is lost in translation.

Steve’s Substack

Discussion about this post