Nov. 9 — Researchers tested Google’s latest video-generation AI model, Veo-3, using real surgical footage and found that, while the model can produce highly realistic visuals, it severely lacks substantive understanding of medical procedures.
In the study, researchers provided the model with a single surgical image as input and asked Veo-3 to predict the next eight seconds of surgical progress. To systematically evaluate performance, an international research team built a dedicated benchmark called SurgVeo, covering 50 real laparoscopic and neurosurgical video clips. Four experienced surgeons independently rated the AI-generated videos on four dimensions (each scored out of 5): visual realism, instrument-use plausibility, tissue-response depiction, and the medical logicality of the operation.

Veo-3’s generated videos were initially highly deceptive—some surgeons even described the image quality as “shockingly clear.” However, deeper analysis showed the content’s logic quickly fell apart: in the laparoscopic tests, the model’s visual plausibility at 1 second still scored 3.72, but medical-accuracy–related scores dropped sharply—the instrument-use score was only 1.78, tissue-response 1.64, and the most critical surgical-logic score was the lowest at 1.61. In short, while the model can render convincingly realistic imagery, it cannot reproduce the procedural workflows and causal relationships that occur in a real operating room.
In the precision-demanding neurosurgical scenarios, Veo-3 performed even worse. From the first second it struggled to capture the exacting maneuvers required in neurosurgery: instrument-use scores fell to 2.77 (versus 3.36 for laparoscopic cases), and surgical-logic scores dropped to as low as 1.13 by 8 seconds.
The team further categorized error types and found that over 93% of errors stemmed from medical-logic failures—for example, inventing non-existent surgical instruments, fabricating tissue responses that violate physiology, or performing clinically meaningless actions—while only a tiny fraction of errors were related to image quality (6.2% for abdominal surgery, 2.8% for brain surgery).
Researchers attempted to give the model additional contextual cues (such as surgical type and specific procedural stage), but this did not produce significant or consistent improvements. The team concluded that the core problem is not missing information but the model’s fundamental lack of medical knowledge and reasoning ability.

The SurgVeo study makes clear that current video-generation AIs remain far from genuine medical understanding. Although such systems might one day assist with physician training, preoperative planning, or intraoperative guidance, current models are nowhere near safe or reliable enough for those applications—they can generate images that “look” real but lack the knowledge foundation needed to support correct clinical decisions.
The research team plans to open-source the SurgVeo benchmark dataset on GitHub to encourage the field to improve models’ medical understanding.
The study also warns of serious risks in using AI-generated videos for medical training. Unlike NVIDIA’s use of AI video to train general-purpose task robots, in medicine these “hallucinations” can have grave consequences—if systems like Veo-3 generate videos that appear plausible but violate medical standards, they could mislead surgical robots or medical trainees into learning incorrect techniques.
The results also indicate that viewing current video models as “world models” is premature. Present systems can imitate surface motion and shape changes but cannot reliably grasp anatomy, biomechanics, or the causal logic of surgical procedures. Their outputs may be superficially convincing yet fail to capture the true physiological mechanisms and operative reasoning behind a surgery.



















暂无评论内容