Microsoft's SkillOpt Boosts GPT-5.5 Performance via Markdown Optimization

Microsoft and researchers from three Chinese universities have unveiled SkillOpt, a groundbreaking method that treats instructional Markdown files as trainable parameters. By optimizing these "skill" documents, the researchers achieved a massive 23-point performance jump for GPT-5.5 on procedural tasks.

Treating Text as Trainable Weights

In the current AI landscape, "skills"—modular instructions that guide agents through specific procedures, tool-use rules, and output formats—are becoming industry standards. While companies like Anthropic use these to enhance Claude, these documents are traditionally written by humans or generated in a single pass by an LLM. Neither method functions as a true optimizer.

SkillOpt changes this paradigm by treating a Markdown file as an external, trainable state for a frozen target model. Instead of updating the model's weights, a second "optimizer" language model analyzes execution logs to identify recurring errors and successes. This optimizer proposes surgical edits—adding, deleting, or replacing specific passages—within a Markdown document. Crucially, these changes are only accepted if they yield measurable improvements on a held-out validation set.

Deep Learning Concepts Applied to Prose

The brilliance of SkillOpt lies in how it maps traditional deep learning mechanics onto text-level optimization. The researchers implemented several sophisticated control mechanisms to ensure stability:

This separation of concerns means the heavy lifting happens during training. At inference time, the target model remains lightweight, simply receiving a compact Markdown file of 300 to 2,000 tokens as context.

Benchmark Dominance and Cross-Model Transferability

The empirical results are significant. Testing across six benchmarks—including search, math, spreadsheets, and embodied action—SkillOpt consistently outperformed handwritten skills and specialized methods like TextGrad and EvoSkill. On GPT-5.5 in direct chat, the method yielded an average performance increase of approximately 23 points.

One of the most impactful findings is the method's transferability. A skill optimized for a large model like GPT-5.5 can be applied to much smaller models, such as Qwen3.5-4B, effectively providing them with procedural knowledge they lack in their native weights. Furthermore, skills are environment-agnostic; a spreadsheet skill trained in a Codex loop works seamlessly in Claude Code without retraining.

For example, in spreadsheet tasks, the optimized skill learns to check worksheet structures first and write evaluated values directly rather than relying on formulas. In embodied AI tasks like ALFWorld, the skill learns to maintain a log of visited locations to ensure objectives are met in the correct order.

Key Takeaways