Looking for accuracy? Don’t be depend on LLM
#Looking for accuracy? Don’t be depend on LLM
I participated in a hackathon around 2020, sponsored by FastDoctor and Qiita. The theme was simple: build medical applications integrated with LLM-like AI. At that time, AI was just starting to enter our daily engineering conversations. Everyone was excited. Naturally, many participants attempted to build AI-powered medical checkup or diagnosis tools.
Almost all of them were rejected.
The reason was painfully clear and repeatedly emphasized by doctors during the judging process, “You cannot provide false positives or false negatives to patients.” That was the moment I clearly understood something fundamentalletting a probabilistic language model make decisions in sensitive domains is fundamentally wrong. LLMs are statistical machines trained on massive amounts of internet text. They are incredibly good at generating plausible answers—but plausibility is not correctness.
#Five Years Later: We Live With LLMs
After 5 years, looking around the world where there is not a single day you don't hear LLM and AI, we now have to accept that society has shifted in a way we take LLM, statistical correct outputting machine, granted. We now know that LLM sometimes makes mistakes and is biased, but we are living with them. In my opinion, humans also make mistakes, even on critical jobs, so we won’t get frustrated when LLM makes mistakes. But looking at people wasting their flight ticket due to lack of visa because chatgpt told them visa is not necessary, it is beyond hilarious.
#Why Coding Agents Actually Work
So why are coding agents like Cursor or Claude so widely adopted among engineers? Because coding has correct answers. Much like chess or Go, once the goal state is defined, machines are extremely good at searching for valid solutions. Even more importantly, software development has built-in safety nets:
- Unit tests
- Integration tests
- Staging environments
- Production gates
If an AI agent introduces a bug, it usually won’t reach production. This makes coding an unusually good match for LLM-based agents. As HuggingFace page describes on SWE bench:
SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
#Why Driving Is a Different Story
Driving is often cited as a domain where mistakes are unacceptable. Yet AI is rapidly replacing human drivers. The key difference is this autonomous driving is not powered by LLMs alone. Full Self-Driving (FSD) in autonomous vehicles is a combination of multiple systems working together. Calling self-driving systems “just AI” misses the point entirely. It relies on:
- LiDAR and camera-based spatial recognition
- Physics-based models
- Massive amounts of labeled training data
- Continuous validation and simulation
So the reason FSD can be considered safer isn’t because it never makes mistakes—it’s because it has a lot more built-in infrastructure to detect and correct those mistakes before they cause harm. It’s a system designed with multiple layers of real-world validation, whereas GPT is just a single layer of statistical reasoning without those extra safety nets.
#Hypothetical AI
If we add layers of validation, like real-time monitoring and multiple checks similar to what autonomous driving systems have, could we eventually trust a medical AI to a similar degree?
In theory, yes. If we had a medical AI system that incorporated not just a language model, but also real-time imaging, lab results, second opinions from other AI models, and continuous feedback from human professionals, then yes, we could theoretically get to a level where it’s much safer and more reliable. It’s a bit like how autonomous driving isn’t just one AI making all the decisions in isolation—it’s a combination of many validated inputs.
But we’re not quite there yet. The complexity of the human body and the consequences of a medical error make it a lot harder to implement that kind of layered safety net right now. So it’s definitely possible in the future, but it will take a lot more infrastructure and validation before we can confidently say a medical AI is as safe as a human doctor.
#Thoughts
At this point, I believe we should avoid using AI as the final decision-maker in domains where human lives are involved (medical), where mistakes are unacceptable, and where ground-truth data itself is difficult to define (legal). In such domains, LLMs are far more suitable as an intermediate layer or as a copilot—to generate hypotheses, summarize information, surface edge cases, or assist human judgment—rather than to produce definitive answers.
Accuracy is not a property of a single model; it is a property of a system.
By layering validation mechanisms—rules, algorithms, cross-checks, uncertainty handling, human review, and fail-safe escalation—we can significantly improve reliability. This is not a theoretical idea; it is exactly how safety-critical systems like autonomous driving are being built today.
In the long run, it may become possible to construct medical AI systems that combine language models with real-time imaging, laboratory data, multimodal sensors, second opinions from independent models, and continuous human oversight. Such systems could eventually reach a level of safety that is acceptable for real-world use. However, we are not there yet. Until validation becomes cheap, explainable, and accountable, LLMs should remain assistive tools, not authorities—especially in medicine and law.