Homeโ€บ๐Ÿš€ Production Skillsโ€บModule 152 min read ยท 16/16

Real-World Case Studies

Reference

Real Bugs We Found

These are actual bugs discovered during production validation of 13 customer extensions (81+ alerts deployed):

ExtensionBugImpactRoot Cause
APIC v0.0.63 Python typos3 metrics always zeroidentInt16 โ†’ identInst16, wrong API field names
Nexus v0.0.1SNMPv2c auth broken50 SNMP metrics missingv2 fell through to SNMPv3 handler
ACI v0.0.2Cross-table OID mixingWrong data matched to rowsipAddrTable OID in ifTable subgroup (DED018)
MSSQL v2.10.6ROUND() missing precisionSQL query errorROUND(..., ) โ†’ ROUND(..., 2)
Catalyst v1.0.0Uptime threshold 10x offFalse alerts every rebootsysUpTime in centiseconds: 360000 โ‰  3600000
MikroTik v0.0.38 wrong metric key refsScreens show empty tilesMetric keys in screens didn't match extension.yaml
Checkpoint v0.0.18 metric gapsMissing monitoring coverageOIDs exist in MIB but not in extension.yaml
F5 BIG-IP v2.167 CLI/API gapsMetrics not in SNMP extensionSome F5 metrics only available via iControl REST

Validation Scorecard

Extension        Metrics  Validated  Alerts  Status
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
APIC             8/8      VALID      12      3 bugs FIXED
ASR              15/15    code only  19      4 metrics NO DATA
ACI              9/9      VALID      7       Fault isolation fix
Catalyst         25/25    VALID      12      Reference implementation
FortiSwitch      19/19    VALID      19      3 deploy methods documented
F5 BIG-IP        16/23    partial    10      7 gaps (CLI/API only)
MSSQL            3/3      VALID      2       4 metrics missing from screens
MikroTik         4/7      partial    TBD     8 wrong metric key refs FIXED

Production Patterns

Fault Isolation

โš ๏ธ ACI v0.0.5 fix: A device's interface table hung for 180 seconds, blocking ALL metrics. Fix: move interfaces to a separate group: so CPU/memory keeps polling independently.

# Bad: everything in one group
snmp:
  - group: Device Default
    subgroups:
      - subgroup: CPU          # blocked when interfaces hang
      - subgroup: Interfaces   # hangs for 180s

# Good: separate groups
snmp:
  - group: Device Default
    subgroups:
      - subgroup: CPU          # keeps polling
  - group: Interfaces          # hangs independently
    subgroups:
      - subgroup: Interfaces

Feature Set Granularity

Customers monitoring 1000+ interfaces need to toggle feature sets. Don't put everything in one group.

Duplicate Alert Cleanup

Before deploying new alerts, check for pre-existing ones. ASR had 28 duplicate alerts from previous deployments โ€” cleaned down to 19.

Lessons Learned

LessonWhy
Always use 64-bit counters32-bit wraps every 3.4s on 10Gbps links
Test with real devicesSimulators don't reproduce timeout/hang bugs
Check AG logs for silent failuresExtensions fail silently โ€” no UI error shown
func: metrics don't work in DQLOnly in screens/dashboards (Metrics API v2)
Sprint needs custom: prefixNon-Dynatrace extensions rejected without it
CA cert in BOTH locationsCredential Vault (server) + AG filesystem (runtime)

๐ŸŽ‰ Course Complete!

You now know how to build, validate, and deploy Dynatrace Extensions 2.0. Check out the Apps course to build custom Dynatrace applications.