SJTU Classical Chinese Culture Club

NEO-CLASSIC

A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry古典诗词语言美学推理能力评测基准

Han Zhang1,*, Zihan Gu2,3,*, Zhiyuan Wang1, Tianyi Ma1, Jiacheng Lu1, Xinyan Zhang4, Yuhao Wei2,3, Cheng Hua1,† 张含1,*顾子涵2,3,*、王致远1、马添翼1、卢嘉成1、张欣妍4、魏玉昊2,3、花成1,†

1Shanghai Jiao Tong University   2IIE, Chinese Academy of Sciences   3UCAS   4Independent Researcher
*Equal contribution   Corresponding author
1上海交通大学   2中国科学院信息工程研究所   3中国科学院大学   4独立研究者
*共同一作   通讯作者

◆ ◆ ◆

Abstract摘要

High benchmark scores on Classical Chinese Poetry (CCP) tasks may arise not from transferable Linguistic-Aesthetic Reasoning, but from a Memorization Illusion — models retrieving canonical texts from pre-training corpora rather than reasoning over formal constraints.

We introduce NEO-CLASSIC, an Out-of-Sample (OOS) evaluation testbed built on 1,406 contemporary poems authored by 30 living poets. These works strictly conform to classical prosodic rules while being absent from historical corpora, enabling clean measurement of the Memorization Gap. Across five diagnostic probes — from authorship attribution to poem-level sentence reordering — we show that SOTA LLMs suffer 20–50% performance drops on contemporary texts and near-total failure on global structural planning tasks.

大语言模型在古典诗词任务上的高分可能并非来自可迁移的语言美学推理能力,而是源于记忆幻觉——模型从预训练语料中直接检索经典文本,而非基于格律约束进行推理。

我们提出 NEO-CLASSIC,一个基于样本外(OOS)评测的测试平台,包含 30 位当代诗人创作的 1,406 首作品。这些作品严格遵守古典格律规范,同时不存在于历史语料库中,从而实现对记忆差距的精准度量。通过五类诊断性探针任务——从作者归属到诗句排序——我们发现,SOTA 大语言模型在当代文本上的表现下降 20–50%,且在全局结构规划任务上几乎完全失败。

NEO-CLASSIC approach overview
Figure 1. Overview of the NEO-CLASSIC benchmark approach.图 1. NEO-CLASSIC 基准方法概览

Key Findings核心发现

20–50%

Memorization Gap记忆差距

Performance drops substantially when models shift from historical to contemporary texts, while human experts remain stable.当模型从历史文本转向当代文本时,性能大幅下降,而人类专家表现保持稳定。

0–3%

Structural Collapse结构性崩溃

Standard LLMs achieve near-zero accuracy on (8-line) reordering. Best model reaches 36% with expert prompting vs. 48% human.标准大模型在律诗(8句)排序任务上准确率接近零。最佳模型在专家提示下仅达 36%,而人类专家为 48%。

Fine-tuning ≠ Reasoning微调 ≠ 推理

Domain-specialized models (Yi-34B, Xunzi) underperform general-purpose SOTA, suggesting scale matters more than domain data.领域微调模型(Yi-34B、荀子)表现不如通用 SOTA 模型,表明模型规模比领域数据更重要。

Benchmark Tasks基准任务

Five behavioral probes targeting different constraint levels of Classical Chinese Poetry:五类行为探针,分别针对古典诗词不同层级的形式约束:

Probe探针 Task任务 Constraint约束层级 What It Tests测试内容 Random随机基线
GuessAuthor Authorship attribution作者归属 Stylistic风格 Idiolect extraction without memorization在无记忆辅助下识别个人语体风格 25%
GuessCiTone Cipai identification词牌识别 Phonological音韵 Character counting & tonal mapping字数统计与平仄映射 25%
GuessWord Cloze test填空测试 Phon.音韵 + Syn.句法 Constraint recognition & prosodic sensitivity约束识别与声律敏感度 25%
MatchSentence Couplet matching联句匹配 Syntactic句法 Understanding of Duizhang (parallelism)对仗理解能力 25%
SortPoem Sentence reordering诗句排序 Discourse篇章 Global planning over Qi-Cheng-Zhuan-He基于起承转合的全局规划 ~0.002%

Results实验结果

Quantifying the Memorization Gap量化记忆差距
Best SOTA performance on historical vs. contemporary (OOS) corporaSOTA 模型在历史语料 vs. 当代(OOS)语料上的最佳表现
SOTA-HistoricSOTA-历史
SOTA-Neo-ClassicSOTA-当代
Memorization Gap记忆差距
100806040200

Sorting Accuracy: The Impact of Expert Prompting律诗排序准确率:专家提示的影响

Model模型Standard标准CoT-Expert思维链-专家Gap vs. Human与人类差距
DeepSeek-V3.20.0%16.5%-31.5
Qwen3-Max1.0%13.0%-35.0
Gemini-3-Pro3.0%36.0%-12.0
GPT-4o0.0%0.0%-48.0
Human Expert人类专家36.6%48.0%

Random chance for (8 lines): 1/8! ≈ 0.002%. Even with structured expert-level guidance, the best model trails human experts by 12 percentage points. 律诗(8句)随机正确率:1/8! ≈ 0.002%。即使在结构化专家级提示下,最佳模型仍落后人类专家 12 个百分点。

NEO-CLASSIC LeaderboardNEO-CLASSIC 排行榜
Accuracy (%) on Contemporary OOS Corpus — Standard Prompting — Random Chance: 25%当代 OOS 语料准确率 (%) — 标准提示 — 随机基线:25%
# Model模型 Author
3-shot3样例
Tone
std标准
Cloze
std标准
Match
std标准
Jue
4-line绝句

8-line律诗
Avg均分
Human Expert人类专家 Expert专家 65.0 88.0 65.0 75.0 62.0 36.6 65.3
Human Players普通玩家 Casual业余 39.0 46.5 48.4 48.0 47.7
1 DeepSeek-V3.2 API 37.2 44.8 58.3 64.0 27.4 1.0 38.8
2 Qwen3-Max API 39.0 33.0 52.0 67.0 32.0 1.0 37.3
3 Gemini-3-Pro API 32.0 36.0 42.0 57.0 41.0 3.0 35.2
4 GPT-4o API 38.3 30.3 50.2 56.6 16.2 0.0 31.9
5 Yi-34B Local 38.0 34.0 44.0 54.0 9.0 0.0 29.8
6 Gemini-2.5-Pro API 30.0 27.0 50.0 52.0 19.0 0.0 29.7
7 Qwen3-30B API 34.0 30.0 44.0 46.0 5.0 0.0 26.5
8 Xunzi-Qwen3-8B Domain 3.0 4.0 16.0 9.0 13.0 0.0 7.5

Dataset数据集

1,406
Contemporary Poems当代诗词
30
Contributing Poets授权诗人
5
Corpora语料库
41
Task Variations任务变体

Data files follow the naming convention:数据文件命名规则:

{corpus}.{task}.{variant}.jsonl

Corpora: tang, song, tang300, tangsong, today   Tasks: guess_author, guess_word, guess_ci_tone, match_sentence, sort_poem
Variants: standard, fewshot1/3/10, cot, couplets, jue, lyu, lyu_cot_expert

Download from GitHub从 GitHub 下载

Citation引用

@inproceedings{zhang2026neoclassic,
  title={Neo-Classic: A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry},
  author={Zhang, Han and Gu, Zihan and Wang, Zhiyuan and Ma, Tianyi and Lu, Jiacheng and Zhang, Xinyan and Wei, Yuhao and Hua, Cheng},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026}
}

Acknowledgments致谢

This research originated from the SJTU Classical Chinese Culture Club (guoxue_sjtu@163.com). We are deeply grateful to the 30 contemporary poets who authorized the use of their works for this benchmark.本研究源自上海交通大学国学社guoxue_sjtu@163.com)。我们衷心感谢 30 位当代诗人授权使用其作品构建本基准测试。

View all 30 contributing poets查看全部 30 位授权诗人

Lin Yuye, Xizhi, Xiyan, Zhou Qinggui, Guanghan Jushi, Baomu, Lingjun, Shaoni, Maoge, Wu Sipeng, Yanchi, Guijia, Chang'an, Chengqing, Huatong, Yang Ruoxuan, Xie Zhongyan, Guiti, Mu Linu, Yun'aimu, Chenlv, Pifu, Mingling Xiaoyue, Wang Qingyan, Nanyang Jushi, Kongke, Shuangxie, Huwulinglong, Sun Zhongyi, Qianwu 林雨夜希之溪烟周庆贵广寒居士抱木令君韶逆毛哥伍思鹏雁迟龟家长安承青花桐杨若萱谢忠延归荑木狸奴云爱暮尘旅蚍蜉明棂晓月王清晏南阳居士空客霜写昈旿昤昽孙中宜潜武