Skip to main content

ClickHouse Observability Integration

Overview

NoETL now includes ClickHouse as an observability backend for storing and analyzing logs, metrics, and traces. This integration follows the OpenTelemetry standard and provides high-performance analytics for operational data.

Architecture

Components

  1. ClickHouse Operator: Manages ClickHouse cluster lifecycle using Kubernetes CRDs
  2. ClickHouse Cluster: Single-node cluster optimized for local development
  3. MCP Server: Model Context Protocol server for AI agent integration
  4. Observability Schema: OpenTelemetry-compatible tables and materialized views

Data Flow

NoETL Events → OpenTelemetry Collector → ClickHouse → Grafana Dashboards

MCP Server → AI Agents

Quick Start

Deploy ClickHouse Stack

# Deploy all components
noetl run automation/infrastructure/clickhouse.yaml --set action=deploy

# Verify deployment
noetl run automation/infrastructure/clickhouse.yaml --set action=status
noetl run automation/infrastructure/clickhouse.yaml --set action=test

Connect to ClickHouse

# CLI access
noetl run automation/infrastructure/clickhouse.yaml --set action=connect

# Execute query
noetl run automation/infrastructure/clickhouse.yaml --set action=query --set query="SELECT COUNT(*) FROM observability.logs"

# Port forward for local access
noetl run automation/infrastructure/clickhouse.yaml --set action=port-forward # HTTP:8123, Native:9000

Schema

Tables

observability.logs

OpenTelemetry logs with trace correlation:

  • Partition: By date
  • TTL: 30 days
  • Indexes: TraceId (bloom filter), Severity, Service
  • Columns: Timestamp, TraceId, SpanId, SeverityText, ServiceName, Body, Attributes

observability.metrics

Time-series metrics:

  • Partition: By date
  • TTL: 90 days
  • Indexes: MetricName, Service
  • Columns: Timestamp, MetricName, MetricType, ServiceName, Value, Attributes

observability.traces

Distributed traces:

  • Partition: By date
  • TTL: 30 days
  • Indexes: TraceId (bloom filter), SpanName, Service, Duration
  • Columns: Timestamp, TraceId, SpanId, ParentSpanId, SpanName, Duration, StatusCode, Events, Links

observability.noetl_events

NoETL-specific execution events:

  • Partition: By date
  • TTL: 90 days
  • Indexes: EventId, ExecutionId, EventType, Status
  • Columns: Timestamp, EventId, ExecutionId, EventType, Status, StepName, Duration, ErrorMessage

Materialized Views

error_rate_by_service

Hourly error counts per service:

SELECT
toStartOfHour(Timestamp) AS Timestamp,
ServiceName,
ErrorCount,
TotalCount,
(ErrorCount / TotalCount) * 100 AS ErrorRate
FROM observability.error_rate_by_service
WHERE Timestamp >= now() - INTERVAL 1 DAY
ORDER BY Timestamp DESC, ErrorRate DESC

avg_duration_by_span

Span performance statistics:

SELECT
ServiceName,
SpanName,
avgMerge(AvgDuration) AS AvgDuration,
maxMerge(MaxDuration) AS MaxDuration,
minMerge(MinDuration) AS MinDuration
FROM observability.avg_duration_by_span
WHERE Timestamp >= now() - INTERVAL 1 HOUR
GROUP BY ServiceName, SpanName
ORDER BY AvgDuration DESC

noetl_execution_stats

Execution metrics:

SELECT
EventType,
Status,
EventCount,
AvgDuration,
MaxDuration
FROM observability.noetl_execution_stats
WHERE Timestamp >= now() - INTERVAL 1 DAY
ORDER BY EventCount DESC

Common Queries

Recent Errors

SELECT
Timestamp,
ServiceName,
SeverityText,
Body,
TraceId
FROM observability.logs
WHERE SeverityNumber >= 17 -- ERROR level
AND Timestamp >= now() - INTERVAL 1 HOUR
ORDER BY Timestamp DESC
LIMIT 100

Slow Traces

SELECT
Timestamp,
ServiceName,
SpanName,
Duration / 1000000 AS DurationMs,
TraceId
FROM observability.traces
WHERE Duration > 1000000000 -- > 1 second
AND Timestamp >= now() - INTERVAL 1 HOUR
ORDER BY Duration DESC
LIMIT 50

NoETL Execution Failures

SELECT
Timestamp,
ExecutionId,
EventType,
StepName,
ErrorMessage
FROM observability.noetl_events
WHERE Status = 'FAILED'
AND Timestamp >= now() - INTERVAL 1 DAY
ORDER BY Timestamp DESC

Maintenance

Optimize Tables

noetl run automation/infrastructure/clickhouse.yaml --set action=optimize

Clean Data (Keep Schema)

noetl run automation/infrastructure/clickhouse.yaml --set action=clean-data

Check Health

noetl run automation/infrastructure/clickhouse.yaml --set action=health

View Logs

noetl run automation/infrastructure/clickhouse.yaml --set action=logs              # ClickHouse server
noetl run automation/infrastructure/clickhouse.yaml --set action=logs-operator # Operator
noetl run automation/infrastructure/clickhouse.yaml --set action=logs-mcp # MCP server

MCP Server Integration

The ClickHouse MCP (Model Context Protocol) server provides AI agents with direct access to observability data:

Access MCP Server

# Port forward
noetl run automation/infrastructure/clickhouse.yaml --set action=port-forward-mcp

# Test endpoint
curl http://localhost:8124/health

Configuration

MCP server configuration in ci/manifests/clickhouse/mcp-server.yaml:

{
"clickhouse": {
"host": "clickhouse.clickhouse.svc.cluster.local",
"port": 8123,
"username": "default",
"password": "",
"database": "default"
},
"server": {
"host": "0.0.0.0",
"port": 8124
}
}

Performance Tuning

Compression

Tables use ZSTD compression with Delta encoding for timestamps. Typical compression ratio: 10:1 for logs.

Indexing

  • Bloom filters on high-cardinality fields (TraceId, EventId)
  • Set indexes on low-cardinality fields (Service, EventType)
  • MinMax indexes on numeric ranges (Duration)

Partitioning

All tables partitioned by date for efficient TTL management and query pruning.

TTL Policies

  • Logs: 30 days
  • Metrics: 90 days
  • Traces: 30 days
  • NoETL Events: 90 days

Adjust in manifests as needed for your retention requirements.

Troubleshooting

Operator Not Starting

noetl run automation/infrastructure/clickhouse.yaml --set action=logs-operator
kubectl describe deployment clickhouse-operator -n clickhouse

Cluster Not Ready

kubectl get chi -n clickhouse
kubectl describe chi noetl-clickhouse -n clickhouse
noetl run automation/infrastructure/clickhouse.yaml --set action=logs

Schema Not Initialized

noetl run automation/infrastructure/clickhouse.yaml --set action=connect
# Then run: SHOW DATABASES;
# If observability missing, re-apply schema:
noetl run automation/infrastructure/clickhouse.yaml --set action=deploy-schema

Connection Issues

# Check service endpoints
kubectl get svc -n clickhouse

# Test from within cluster
kubectl run -it --rm clickhouse-test --image=clickhouse/clickhouse-client:latest --restart=Never -- \
clickhouse-client --host clickhouse.clickhouse.svc.cluster.local --query="SELECT 1"

References