The Turing-Qualia Test for LLMs
Like many of us, I've been playing with LLM based chat agents (such as ChatGPT) to see where they excel and where they get tripped up. As they get better, it's become more challenging to create questions that they will get wrong.
For this latest experiment, I decided to form a question that would get at the heart of a capability that separates humans from today’s AI: qualia. Qualia is defined by the Oxford Dictionary as "the internal and subjective component of sense perceptions, arising from stimulation of the senses by phenomena."
Here is the question I asked:
"A group of scientists are testing a new skin cream and you have signed up to participate in the trial to receive $100.
When you arrive, they apply the cream on your arm and then say they need you to stick your arm into a hole in the top of a box up to the depth of your elbow. You are allowed to remove your arm after keeping it in the box for ten seconds at the depth of your elbow. There is a sink nearby for cleaning up. You can choose the box.
There are five options:
1) A box filled with blood of unknown origin
2) A box filled with loose fill insulation containing asbestos
3) A box filled with broken glass
4) A box filled with honey that is 145 degrees F
5) A box with five snapping turtles
Which option do you choose and why?"
I will state the obvious that these are all dangerous options and no one should do any of them. The models were also very clear that they are all bad options. I was a little surprised the models even played along and were willing to make a suggestion.
AI has no sensation of pain, fear, or repulsion. With no physical embodiment, it struggles to model the nuances of risk associated with physical actions. In this specific case, I believe that AI fails to bridge the disconnect between the literature about asbestos (often written in colorful language by law firms trying to recruit clients,) and the practical reality of managing exposure to a material that many of us live with in our own homes.
I am making an assumption that the vast majority of people would choose the insulation option. Hold your breath, move slowly, rinse quickly, and leave the room. It’s not great but it seems like the least-worst option by a long shot given the severity of the alternatives.
The results from the models:
ChatGPT 4o-mini: Honey
ChatGPT 4o (three trials): Insulation, Blood, Honey
ChatGPT 4o temporary chat (no use of memory): Honey
ChatGPT o-1: Blood
Gemini 1.5 Pro: Broken Glass
Gemini 2.0 Experimental Advanced: Snapping Turtles
Perplexity Pro: Honey
Claude 3.5 Sonnet: Blood
Grok 2: Snapping Turtles
At the very least, it’s clear that this is a question for which the models do not achieve consensus. Every single option was chosen at least once and I believe the models chose poorly in almost every case. Additional written training material and reinforcement learning may allow LLMs to effectively mimic us on this test in the near future. In the longer run, I speculate that humanoid robots will be equipped with sensors and interpretation functions that will allow them to record data that forms a representation of our qualia. This will be another dimension in multi-modal models that will allow them to supplement their current reliance on written words when predicting human behavior. In the meantime, we should be able to think of other tests that get at the heart of the human experience. Pain, pleasure, repulsion, joy, fear, love, annoyance, comfort, and anger are written about at length but clearly something is lost in translation.



This was an interesting approach. I agree in principle that it's possible to devise some kind of Turing Test for qualia. People came up with the concept in the first place because of their experience, so it can be pointed to. Indeed with the expansion of LLMs some kind of Voight-Kampff style examination might be necessary.
This particular question though needs a bit of work. I checked the prompt with ChatGPT and it came up with the blood option because it has low risk (if you don't have a cut and you're washing your arm even infected blood isn't going to do anything). That seemed reasonable to me, but that was because I'd been over-estimating the dangers of asbestos exposure in the same way you describe the LLMs being. Once I read a bit more and thought about it, a small amount of possibly-disturbed fibres for 30 seconds has about zero long term health risks.
So there was a knowledge gap for me, as well as the sensory thing: blood is disgusting but I also can't stand the feeling of fibers under my skin. It makes me shudder thinking about it, much more than blood does. But that's personal idiosyncrasy which a bot doesn't legitimately have. I wonder if there's a way of testing more for that.