ClickHouse Observability Integration
Overview
NoETL now includes ClickHouse as an observability backend for storing and analyzing logs, metrics, and traces. This integration follows the OpenTelemetry standard and provides high-performance analytics for operational data.
Architecture
Components
- ClickHouse Operator: Manages ClickHouse cluster lifecycle using Kubernetes CRDs
- ClickHouse Cluster: Single-node cluster optimized for local development
- MCP Server: Model Context Protocol server for AI agent integration
- Observability Schema: OpenTelemetry-compatible tables and materialized views
Data Flow
NoETL Events → OpenTelemetry Collector → ClickHouse → Grafana Dashboards
↓
MCP Server → AI Agents
Quick Start
Deploy ClickHouse Stack
# Deploy all components
task clickhouse:deploy
# Verify deployment
task clickhouse:status
task clickhouse:test
Connect to ClickHouse
# CLI access
task clickhouse:connect
# Execute query
task clickhouse:query -- "SELECT COUNT(*) FROM observability.logs"
# Port forward for local access
task clickhouse:port-forward # HTTP:8123, Native:9000
Schema
Tables
observability.logs
OpenTelemetry logs with trace correlation:
- Partition: By date
- TTL: 30 days
- Indexes: TraceId (bloom filter), Severity, Service
- Columns: Timestamp, TraceId, SpanId, SeverityText, ServiceName, Body, Attributes
observability.metrics
Time-series metrics:
- Partition: By date
- TTL: 90 days
- Indexes: MetricName, Service
- Columns: Timestamp, MetricName, MetricType, ServiceName, Value, Attributes
observability.traces
Distributed traces:
- Partition: By date
- TTL: 30 days
- Indexes: TraceId (bloom filter), SpanName, Service, Duration
- Columns: Timestamp, TraceId, SpanId, ParentSpanId, SpanName, Duration, StatusCode, Events, Links
observability.noetl_events
NoETL-specific execution events:
- Partition: By date
- TTL: 90 days
- Indexes: EventId, ExecutionId, EventType, Status
- Columns: Timestamp, EventId, ExecutionId, EventType, Status, StepName, Duration, ErrorMessage
Materialized Views
error_rate_by_service
Hourly error counts per service:
SELECT
toStartOfHour(Timestamp) AS Timestamp,
ServiceName,
ErrorCount,
TotalCount,
(ErrorCount / TotalCount) * 100 AS ErrorRate
FROM observability.error_rate_by_service
WHERE Timestamp >= now() - INTERVAL 1 DAY
ORDER BY Timestamp DESC, ErrorRate DESC
avg_duration_by_span
Span performance statistics:
SELECT
ServiceName,
SpanName,
avgMerge(AvgDuration) AS AvgDuration,
maxMerge(MaxDuration) AS MaxDuration,
minMerge(MinDuration) AS MinDuration
FROM observability.avg_duration_by_span
WHERE Timestamp >= now() - INTERVAL 1 HOUR
GROUP BY ServiceName, SpanName
ORDER BY AvgDuration DESC
noetl_execution_stats
Execution metrics:
SELECT
EventType,
Status,
EventCount,
AvgDuration,
MaxDuration
FROM observability.noetl_execution_stats
WHERE Timestamp >= now() - INTERVAL 1 DAY
ORDER BY EventCount DESC
Common Queries
Recent Errors
SELECT
Timestamp,
ServiceName,
SeverityText,
Body,
TraceId
FROM observability.logs
WHERE SeverityNumber >= 17 -- ERROR level
AND Timestamp >= now() - INTERVAL 1 HOUR
ORDER BY Timestamp DESC
LIMIT 100
Slow Traces
SELECT
Timestamp,
ServiceName,
SpanName,
Duration / 1000000 AS DurationMs,
TraceId
FROM observability.traces
WHERE Duration > 1000000000 -- > 1 second
AND Timestamp >= now() - INTERVAL 1 HOUR
ORDER BY Duration DESC
LIMIT 50
NoETL Execution Failures
SELECT
Timestamp,
ExecutionId,
EventType,
StepName,
ErrorMessage
FROM observability.noetl_events
WHERE Status = 'FAILED'
AND Timestamp >= now() - INTERVAL 1 DAY
ORDER BY Timestamp DESC
Maintenance
Optimize Tables
task clickhouse:optimize
Clean Data (Keep Schema)
task clickhouse:clean-data
Check Health
task clickhouse:health
View Logs
task clickhouse:logs # ClickHouse server
task clickhouse:logs-operator # Operator
task clickhouse:logs-mcp # MCP server
MCP Server Integration
The ClickHouse MCP (Model Context Protocol) server provides AI agents with direct access to observability data:
Access MCP Server
# Port forward
task clickhouse:port-forward-mcp
# Test endpoint
curl http://localhost:8124/health
Configuration
MCP server configuration in ci/manifests/clickhouse/mcp-server.yaml:
{
"clickhouse": {
"host": "clickhouse.clickhouse.svc.cluster.local",
"port": 8123,
"username": "default",
"password": "",
"database": "default"
},
"server": {
"host": "0.0.0.0",
"port": 8124
}
}
Performance Tuning
Compression
Tables use ZSTD compression with Delta encoding for timestamps. Typical compression ratio: 10:1 for logs.
Indexing
- Bloom filters on high-cardinality fields (TraceId, EventId)
- Set indexes on low-cardinality fields (Service, EventType)
- MinMax indexes on numeric ranges (Duration)
Partitioning
All tables partitioned by date for efficient TTL management and query pruning.
TTL Policies
- Logs: 30 days
- Metrics: 90 days
- Traces: 30 days
- NoETL Events: 90 days
Adjust in manifests as needed for your retention requirements.
Troubleshooting
Operator Not Starting
task clickhouse:logs-operator
kubectl describe deployment clickhouse-operator -n clickhouse
Cluster Not Ready
kubectl get chi -n clickhouse
kubectl describe chi noetl-clickhouse -n clickhouse
task clickhouse:logs
Schema Not Initialized
task clickhouse:connect
# Then run: SHOW DATABASES;
# If observability missing, re-apply schema:
task clickhouse:deploy-schema
Connection Issues
# Check service endpoints
kubectl get svc -n clickhouse
# Test from within cluster
kubectl run -it --rm clickhouse-test --image=clickhouse/clickhouse-client:latest --restart=Never -- \
clickhouse-client --host clickhouse.clickhouse.svc.cluster.local --query="SELECT 1"