operating-production-services

Name: operating-production-services
Rating: 0.9 (1 reviews)
Author: mjunaidca

17.3

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response.Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishingon-call practices. NOT for initial service development (use scaffolding skills instead).

Also in: api monitoring

Third-Party Agent Skill: Review the code before installing. Agent skills execute in your AI assistant's environment and can access your files. Learn more about security

Installation for Agentic Skill

View all platforms →

Claude Code (CLI) Fast

skilz install mjunaidca/mjs-agent-skills/operating-production-services

OpenCode (CLI) Fast

skilz install mjunaidca/mjs-agent-skills/operating-production-services --agent opencode

OpenAI Codex (CLI) Native

skilz install mjunaidca/mjs-agent-skills/operating-production-services --agent codex

Gemini CLI (Project) Project

skilz install mjunaidca/mjs-agent-skills/operating-production-services --agent gemini

First time? Install Skilz: pip install skilz

Works with 22+ AI coding agents

Cursor, Aider, Copilot, Windsurf, Qwen, Kimi, and more...

View All Agents

For Claude Desktop Easy

Download Agent Skill ZIP

Extract and copy to ~/.claude/skills/ then restart Claude Desktop

Manual Installation

1. Clone the repository:

git clone https://github.com/mjunaidca/mjs-agent-skills

2. Copy the agent skill directory:

cp -r mjs-agent-skills/.claude/skills/operating-production-services ~/.claude/skills/

View on GitHub

Need detailed installation help? Check our platform-specific guides:

Claude Desktop Guide Claude Code Guide Troubleshooting

Related Agentic Skills

microservices-patterns

by secondsky

Design microservices architectures with service boundaries, event-driven communication, and resilience patterns. Use when building distributed systems...

TECHkubernetes

Marketplace

writing-skills

by obra

Use when creating new skills, editing existing skills, or verifying skills work before deployment

TECHkubernetes

Marketplace

flow-nexus-swarm

by ruvnet

Cloud-based AI swarm deployment and event-driven workflow automation with Flow Nexus platform

TECHkubernetes

Marketplace

+github

airflow-dag-patterns

by wshobson

Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrat...

TECHkubernetes

Marketplace

+ci cd+testing

Agentic Skill Details

Owner: mjunaidca (GitHub)
Repository: mjs-agent-skills
Type: Technical
Meta-Domain: cloud infrastructure
Primary Domain: kubernetes
Market Score: 17.3

Agentic Skill Grades →

Agent Skill Grade

Score: 97/100 Click to see breakdown

Score Breakdown

Spec Compliance

12/15

PDA Architecture

27/30

Ease of Use

24/25

Writing Style

10/10

Utility

19/20

Modifiers: +5

Areas to Improve

File is 189 lines but lacks table of contents despite exceeding 100-line threshold

Recommendations

Add trigger phrases to description for discoverability
Add table of contents for files over 100 lines

Graded: 1/24/2026

Developer Feedback

Found your operating-production-services skill while browsing the registry—the way you've structured the progressive disclosure for such a dense topic (97/100 for a reason) makes me curious how you'd handle even more edge cases around observability and incident response.

Links:

The TL;DR

You're at 97/100, solidly in A-grade territory. This is based on Anthropic's skill best practices rubric. Your strongest area is Writing Style (10/10)—the skill reads like documentation written by someone who actually runs production systems, not a marketing pamphlet. Weakest spot is Spec Compliance (12/15), mostly because you're leaving discoverability points on the table with trigger phrases.

What's Working Well

Blameless postmortem framework - The 5 Whys template and postmortem meeting checklist give Claude concrete structure for handling incidents. That's the kind of thing teams actually need.
Token economy is chef's kiss - slo-alerting.md delegates heavy technical details while SKILL.md stays lean. You're not dumping a 200-line reference file on someone; you're layering it thoughtfully.
Practical burn rate guidance - The multi-window alerting patterns with specific Prometheus queries and Grafana dashboard structure mean Claude can actually implement this, not just read philosophy.
Clear scope boundaries - Your description explicitly calls out SLO alerting and postmortems while noting what you don't cover (deployment strategies, team structure). That's rare and helpful.

The Big One

slo-alerting.md (189 lines) is missing a table of contents. This hurts your navigation score because at 100+ lines, readers need an anchor point. Right now someone has to scroll through Prometheus rules, Grafana templates, and example YAMLs without knowing what's coming.

Add this at the top:

## Contents
- [Prometheus Recording Rules](#prometheus-recording-rules)
- [Multi-Window Burn Rate Alerts](#multi-window-burn-rate-alerts)
- [Burn Rate Reference](#burn-rate-reference)
- [Grafana Dashboard](#grafana-dashboard)
- [SLO Definition Template](#slo-definition-template)
- [Common Mistakes](#common-mistakes)

Impact: +1 point to PDA (gets you to 28/30).

Other Things Worth Fixing

Expand trigger phrases in your frontmatter description - You're only hitting 1-2 right now. Add "error budget", "incident response", "reliability metrics" to catch more discovery queries. (-3 points on Spec Compliance; this could recover that easily).
Add one more example template - You've got postmortem templates and SLO YAML. A quick Alertmanager config snippet showing how to route burn rate alerts would give Claude another angle to work from.
Reference section could name-check - slo-alerting.md is good but it's generic. Could use a line in SKILL.md like "See references/slo-alerting.md for Prometheus query patterns and Grafana dashboard templates" to make the connection explicit.

Quick Wins

Add TOC to slo-alerting.md → +1 point
Expand trigger phrases (error budget, incident response, reliability) → +2-3 points
One more config example (Alertmanager routing) → +0-1 point

These three things could realistically push you to 99-100.

Checkout your skill here: SkillzWave.ai | SpillWave We have an agentic skill installer that install skills in 14+ coding agent platforms. Check out this guide on how to improve your agentic skills.

Browse Category

More cloud infrastructure Agentic Skills

Report Security Issue

Found a security vulnerability in this agent skill?