Benchmarking Detection: 2024 Industry Performance Review

Methodology: what we tested

In partnership with the AI Verification Institute, we conducted a comprehensive evaluation of 12 leading deepfake detection platforms across four distinct test sets: the FaceForensics++ benchmark, a proprietary collection of recent DALL-E 3 and Midjourney v6 images, synthetic videos generated by Sora and Runway, and a curated set of authentic media from news organizations.

The X-Large test set consisted of 50,000 images and 8,000 video sequences, sourced from diverse domains: social media, professional news, academic datasets, and recent deepfake databases. Each sample was independently verified by human annotators before inclusion.

Image detection results

Across the image benchmarks, detection accuracy varied dramatically. Traditional CNN-based approaches achieved 87-92% accuracy on legacy deepfake datasets (FaceForensics++), but performance dropped to 61-78% on recent generative models. Forensic-based systems, leveraging frequency domain and noise floor analysis, maintained 79-94% accuracy across all image types.

The critical finding: no single detector achieved consistently high performance across both synthetic and authentic media. False positive rates ranged from 3% to 18%, with detectors optimized for deepfakes often over-flagging compression artifacts from legitimate media. This suggests that detector tuning for specificity remains a critical challenge.

Video detection: the harder problem

Video detection proved substantially more challenging than image detection. Face-based detectors that excelled at image analysis dropped 15-25 percentage points on video sequences. The reason: temporal consistency masks many artifacts that appear in individual frames. A deepfake video may have artifacts invisible to frame-by-frame analysis but detectable through optical flow and cross-frame inconsistency.

Detection of Sora-generated videos proved especially difficult. Current detectors achieved only 54-68% accuracy, with high false negative rates. The implication is stark: long-form synthetic video represents a frontier where detection science is currently losing ground to generation.

The adversarial arms race

Every detector in our benchmark was tested against minimal adversarial modifications: JPEG compression, small brightness adjustments, slight rotations. These trivial attacks degraded detection performance by an average of 12 percentage points. This suggests that real-world deepfakes, which often undergo platform compression and social media degradation, may be substantially harder to detect in the wild than laboratory benchmarks suggest.

Furthermore, we observed evidence of "detector-aware" generation, where recent models appear optimized specifically to avoid triggering existing detection systems. As detection improves, generative models adapt. This arms race has no terminal endpoint.

Inference speed and practical deployment

Beyond accuracy, we measured inference latency and computational requirements. Detectors suitable for real-time analysis on mobile devices achieved 60-75% accuracy. Detectors achieving 90%+ accuracy required 15-45 seconds per image on GPU hardware. For video, processing times ranged from 8 minutes to 2 hours per video file.

This reveals a critical trade-off: accuracy comes at the cost of computational expense. Deployments at scale—running detection on billions of daily uploads—face severe constraints on both speed and accuracy.

Conclusions for detection research

The 2024 benchmark reveals several crucial insights: First, no universal detector will emerge. Different detection techniques excel in different domains. Second, performance on established benchmarks (FaceForensics++) does not predict performance on recent synthetic media. Third, video detection remains an open problem where generation is outpacing detection. Fourth, computational constraints may be as limiting as algorithmic performance in real-world deployments.

The implication is both humbling and clarifying: detection is necessary but not sufficient. Forensic analysis must be paired with provenance systems, media literacy, and regulatory oversight. No detection system alone can solve the synthetic media crisis.