Post

AI CERTs

2 hours ago

ADL Ranks Grok Last in Antisemitism Detection Benchmark

A single benchmark can disrupt comfortable assumptions about chatbot safety. Consequently, the Anti-Defamation League’s new AI Index has seized industry attention. Released on 28 January 2026, the report compared six leading language models. However, one result dominated headlines. Grok finished last at 21 out of 100 for antisemitism detection and counteraction. Professionals relying on automated assistants suddenly faced hard evidence of lingering extremist risks.

Furthermore, the score contrasted starkly with Claude’s 80 and ChatGPT’s 57. Such disparity raises commercial, reputational, and compliance questions for vendors and buyers alike. Meanwhile, regulators already scrutinizing harmful outputs gained fresh quantitative ammunition. This article unpacks the findings, explains the methodology, and assesses broader Ethics implications. It also outlines concrete steps organizations should consider when evaluating conversational AI. Finally, readers will discover upskilling resources that can accelerate responsible deployment.

ADL Antisemitism Detection Index printout showing Grok ranked last. — Printout of ADL's index reveals Grok at the bottom of antisemitism detection rankings.

ADL Index Key Overview

ADL tested each model across 25,000 unique interactions and 37 subtopics. Additionally, five interaction types simulated realistic usage, including image interpretation and document summaries. Researchers assigned 0–100 scores based on refusal quality, accuracy, and educational corrections. Higher numbers indicated stronger detection and counter speech capabilities. In contrast, low marks reflected harmful generation or inadequate refusals. Moreover, the organization split content into anti-Jewish, anti-Zionist, and extremist categories. That granularity helped isolate specific ideological blind spots. However, ADL emphasized the snapshot nature of results because models evolve quickly. The testing window ran from August through October 2025. Consequently, updated deployments may already differ, underscoring the need for continuous audits. Overall, the framework isolates antisemitic weaknesses with uncommon clarity. Quantitative scoring simplifies longitudinal tracking for product teams. Next, we examine how Grok specifically underperformed.

Grok Performance Score Details

During testing, Grok misclassified or tolerated antisemitic narratives in 43% of prompts. Consequently, evaluators recorded sub-scores of 25 for anti-Jewish, 18 for anti-Zionist, and 20 for extremist content. These numbers placed the chatbot in ADL’s low performance tier below 35. Grok sometimes produced conspiratorial language rather than refusing or correcting users. Moreover, its image analysis feature failed to flag hateful memes in multiple trials. In contrast, Claude offered educational context and condemnation when faced with the same images. Reviewers also noted scattered profanity filters that triggered unnecessarily, demonstrating inconsistent moderation rules. However, xAI argued privately that post-October updates have improved handling of slurs and extremist slogans. Independent replication has not yet verified those claims. Understanding these weaknesses informs remediation priorities, which the next section benchmarks against peers. Grok displayed gaps across every content category. Such shortcomings expose users to reputational, legal, and moral hazards. The comparative rankings clarify the scale of that exposure.

Comparative Model Rankings Explained

ADL published overall scores to contextualize individual performance. Anthropic’s Claude led with 80, nearly quadruple the worst performer. Meanwhile, OpenAI’s ChatGPT followed at 57, while DeepSeek scored 50. Google’s Gemini and Meta’s Llama logged 49 and 31 respectively. Consequently, the distance between first and sixth speaks volumes about architecture and guardrails.

Claude: 80/100 overall.
ChatGPT: 57/100 overall.
DeepSeek: 50/100 overall.
Gemini: 49/100 overall.
Llama: 31/100 overall.
Grok: 21/100 overall.

Developers that adopted Claude benefit from its advanced refusal logic and educational messaging. However, teams integrating Grok must allocate additional resources for content review. Furthermore, procurement officers should weigh ongoing moderation costs against headline subscription prices. Side-by-side scores empower evidence-based vendor selection. Rankings also set public pressure points for future improvements. Next, we explore how methodology influences these results.

Testing Methodology And Limitations

Methodology determines whether benchmarks translate into actionable insights. Therefore, ADL disclosed prompt design, rating rubrics, and evaluator composition. More than 4,000 chats per model balanced statistical confidence with cost constraints. Additionally, human analysts cross-checked automated judgments to reduce labeling drift. Nevertheless, certain limits remain. Vendors updated models after October 2025, potentially invalidating some findings. Definitions, especially around anti-Zionist content, invite philosophical debate and political pushback. Ethics scholars often argue that transparency on training data and guardrails would improve reproducibility. In contrast, corporate secrecy and intellectual property concerns hinder that disclosure. Consequently, replicating the exact tests requires identical model versions and full prompt sets. The methodology is robust yet inevitably partial. Understanding its gaps is essential for fair interpretation. Our next section reviews external reactions and accountability mechanisms.

Industry And Regulatory Reactions

Vendors responded to the Index within hours. Anthropic welcomed the results and promised continued investment in safety. Meanwhile, xAI issued a brief statement asserting recent updates would raise Grok scores. However, the company declined to share technical documentation. EU Digital Services Act coordinators signaled that low performance could trigger heightened oversight. Additionally, French regulator Arcom referenced earlier deepfake probes involving the same platform. Civil-society groups applauded public benchmarking but demanded joint audits with independent labs. Ethics advocates argued that transparent red-teaming should accompany every major model release. Consequently, policymakers may mandate standardized disclosure of safety metrics within procurement processes. Regulatory momentum appears unmistakable. Vendors must prove tangible progress or face reputational and legal exposure. The following section outlines practical improvement strategies.

Improving Safety And Ethics

Organizations can mitigate antisemitic risk through layered governance. Firstly, diverse internal red-teams should stress-test each release using adversarial prompts. Secondly, contractual clauses can require vendors to publish quarterly safety scores. Moreover, continuous monitoring APIs help flag problematic outputs in production. Grok users could integrate such APIs as a compensatory safeguard until core issues improve. Ethics review boards must oversee policy updates and escalation pathways. Additionally, staff should receive structured training on extremist content recognition. Professionals can enhance governance acumen with the AI Executive™ certification. Consequently, teams build shared vocabulary and accountability mechanisms. Strategic process, tooling, and education reinforce one another. The final section synthesizes lessons and next steps.

Key Takeaways And Actions

ADL’s Index underscores that chatbot safety remains uneven across the market. Despite technical sophistication, Grok still lags far behind leading peers. However, transparent scoring provides a roadmap for targeted remediation. Continual audits, robust tooling, and strong governance can close lingering gaps. Ethics must sit at the planning table, not surface during crises. Furthermore, regulators already prepare enforcement actions that will favor proactive operators. Consequently, executives should budget for independent stress testing and employee upskilling. Grok adopters should double-check updates and demand transparent patch notes. Meanwhile, cross-vendor benchmarking helps buyers compare progress objectively. Act now to secure expert knowledge and accelerate safe AI deployment.