NEO-CLASSIC | Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry

Abstract摘要

High benchmark scores on Classical Chinese Poetry (CCP) tasks may arise not from transferable Linguistic-Aesthetic Reasoning, but from a Memorization Illusion — models retrieving canonical texts from pre-training corpora rather than reasoning over formal constraints.

We introduce NEO-CLASSIC, an Out-of-Sample (OOS) evaluation testbed built on 1,406 contemporary poems authored by 30 living poets. These works strictly conform to classical prosodic rules while being absent from historical corpora, enabling clean measurement of the Memorization Gap. Across five diagnostic probes — from authorship attribution to poem-level sentence reordering — we show that SOTA LLMs suffer 20–50% performance drops on contemporary texts and near-total failure on global structural planning tasks.

大语言模型在古典诗词任务上的高分可能并非来自可迁移的语言美学推理能力，而是源于记忆幻觉——模型从预训练语料中直接检索经典文本，而非基于格律约束进行推理。

我们提出 NEO-CLASSIC，一个基于样本外（OOS）评测的测试平台，包含 30 位当代诗人创作的 1,406 首作品。这些作品严格遵守古典格律规范，同时不存在于历史语料库中，从而实现对记忆差距的精准度量。通过五类诊断性探针任务——从作者归属到诗句排序——我们发现，SOTA 大语言模型在当代文本上的表现下降 20–50%，且在全局结构规划任务上几乎完全失败。

NEO-CLASSIC approach overview — Figure 1. Overview of the NEO-CLASSIC benchmark approach.图 1. NEO-CLASSIC 基准方法概览

Key Findings核心发现

20–50%

Memorization Gap记忆差距

Performance drops substantially when models shift from historical to contemporary texts, while human experts remain stable.当模型从历史文本转向当代文本时，性能大幅下降，而人类专家表现保持稳定。

0–3%

Structural Collapse结构性崩溃

Standard LLMs achieve near-zero accuracy on Lü (8-line) reordering. Best model reaches 36% with expert prompting vs. 48% human.标准大模型在律诗（8句）排序任务上准确率接近零。最佳模型在专家提示下仅达 36%，而人类专家为 48%。

≠

Fine-tuning ≠ Reasoning微调 ≠ 推理

Domain-specialized models (Yi-34B, Xunzi) underperform general-purpose SOTA, suggesting scale matters more than domain data.领域微调模型（Yi-34B、荀子）表现不如通用 SOTA 模型，表明模型规模比领域数据更重要。

Benchmark Tasks基准任务

Five behavioral probes targeting different constraint levels of Classical Chinese Poetry:五类行为探针，分别针对古典诗词不同层级的形式约束：

Probe探针	Task任务	Constraint约束层级	What It Tests测试内容	Random随机基线
GuessAuthor	Authorship attribution作者归属	Stylistic风格	Idiolect extraction without memorization在无记忆辅助下识别个人语体风格	25%
GuessCiTone	Cipai identification词牌识别	Phonological音韵	Character counting & tonal mapping字数统计与平仄映射	25%
GuessWord	Cloze test填空测试	Phon.音韵 + Syn.句法	Constraint recognition & prosodic sensitivity约束识别与声律敏感度	25%
MatchSentence	Couplet matching联句匹配	Syntactic句法	Understanding of Duizhang (parallelism)对仗理解能力	25%
SortPoem	Sentence reordering诗句排序	Discourse篇章	Global planning over Qi-Cheng-Zhuan-He基于起承转合的全局规划	~0.002%

Results实验结果

Quantifying the Memorization Gap量化记忆差距

Best SOTA performance on historical vs. contemporary (OOS) corporaSOTA 模型在历史语料 vs. 当代（OOS）语料上的最佳表现

SOTA-HistoricSOTA-历史

SOTA-Neo-ClassicSOTA-当代

Memorization Gap记忆差距

100806040200

Lü Sorting Accuracy: The Impact of Expert Prompting律诗排序准确率：专家提示的影响

Model模型	Standard标准	CoT-Expert思维链-专家	Gap vs. Human与人类差距
DeepSeek-V3.2	0.0%	16.5%	-31.5
Qwen3-Max	1.0%	13.0%	-35.0
Gemini-3-Pro	3.0%	36.0%	-12.0
GPT-4o	0.0%	0.0%	-48.0
Human Expert人类专家	36.6%	48.0%	—

Random chance for Lü (8 lines): 1/8! ≈ 0.002%. Even with structured expert-level guidance, the best model trails human experts by 12 percentage points. 律诗（8句）随机正确率：1/8! ≈ 0.002%。即使在结构化专家级提示下，最佳模型仍落后人类专家 12 个百分点。

NEO-CLASSIC LeaderboardNEO-CLASSIC 排行榜

Accuracy (%) on Contemporary OOS Corpus — Standard Prompting — Random Chance: 25%当代 OOS 语料准确率 (%) — 标准提示 — 随机基线：25%

#	Model模型	Author 3-shot3样例	Tone std标准	Cloze std标准	Match std标准	Jue 4-line绝句	Lü 8-line律诗	Avg均分
	Human Expert人类专家 Expert专家	65.0	88.0	65.0	75.0	62.0	36.6	65.3
	Human Players普通玩家 Casual业余	39.0	46.5	48.4	48.0	47.7	—	—
1	DeepSeek-V3.2 API	37.2	44.8	58.3	64.0	27.4	1.0	38.8
2	Qwen3-Max API	39.0	33.0	52.0	67.0	32.0	1.0	37.3
3	Gemini-3-Pro API	32.0	36.0	42.0	57.0	41.0	3.0	35.2
4	GPT-4o API	38.3	30.3	50.2	56.6	16.2	0.0	31.9
5	Yi-34B Local	38.0	34.0	44.0	54.0	9.0	0.0	29.8
6	Gemini-2.5-Pro API	30.0	27.0	50.0	52.0	19.0	0.0	29.7
7	Qwen3-30B API	34.0	30.0	44.0	46.0	5.0	0.0	26.5
8	Xunzi-Qwen3-8B Domain	3.0	4.0	16.0	9.0	13.0	0.0	7.5

Dataset数据集

1,406

Contemporary Poems当代诗词

Contributing Poets授权诗人

Corpora语料库

Task Variations任务变体

Data files follow the naming convention:数据文件命名规则：

{corpus}.{task}.{variant}.jsonl

Corpora: tang, song, tang300, tangsong, today Tasks: guess_author, guess_word, guess_ci_tone, match_sentence, sort_poem
Variants: standard, fewshot1/3/10, cot, couplets, jue, lyu, lyu_cot_expert

Download from GitHub从 GitHub 下载

Citation引用

@inproceedings{zhang2026neoclassic,
  title={Neo-Classic: A Benchmark for Evaluating Linguistic-Aesthetic Reasoning in Classical Chinese Poetry},
  author={Zhang, Han and Gu, Zihan and Wang, Zhiyuan and Ma, Tianyi and Lu, Jiacheng and Zhang, Xinyan and Wei, Yuhao and Hua, Cheng},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026}
}

Acknowledgments致谢

This research originated from the SJTU Classical Chinese Culture Club (guoxue_sjtu@163.com). We are deeply grateful to the 30 contemporary poets who authorized the use of their works for this benchmark.本研究源自上海交通大学国学社（guoxue_sjtu@163.com）。我们衷心感谢 30 位当代诗人授权使用其作品构建本基准测试。

View all 30 contributing poets查看全部 30 位授权诗人

Lin Yuye, Xizhi, Xiyan, Zhou Qinggui, Guanghan Jushi, Baomu, Lingjun, Shaoni, Maoge, Wu Sipeng, Yanchi, Guijia, Chang'an, Chengqing, Huatong, Yang Ruoxuan, Xie Zhongyan, Guiti, Mu Linu, Yun'aimu, Chenlv, Pifu, Mingling Xiaoyue, Wang Qingyan, Nanyang Jushi, Kongke, Shuangxie, Huwulinglong, Sun Zhongyi, Qianwu 林雨夜、希之、溪烟、周庆贵、广寒居士、抱木、令君、韶逆、毛哥、伍思鹏、雁迟、龟家、长安、承青、花桐、杨若萱、谢忠延、归荑、木狸奴、云爱暮、尘旅、蚍蜉、明棂晓月、王清晏、南阳居士、空客、霜写、昈旿昤昽、孙中宜、潜武