Troubleshooting
This guide covers common failure scenarios when running the Memcached Operator and provides step-by-step diagnosis and resolution procedures.
1. Memcached CR Stuck in Progressing State
Symptom
The Memcached custom resource shows Progressing=True and Available=False for an extended period. The readyReplicas count does not reach the desired replicas.
kubectl get memcached <name> -n <namespace>
# Ready column remains lower than Replicas columnDiagnosis
Check the CR status conditions:
kubectl get memcached <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq .Look for the Progressing condition message, which reports the rollout state (e.g., Rollout in progress: 0/3 replicas updated).
Check the owned Deployment:
kubectl get deployment <name> -n <namespace>
kubectl describe deployment <name> -n <namespace>Check Pod status:
kubectl get pods -n <namespace> -l app.kubernetes.io/name=memcached,app.kubernetes.io/instance=<name>Check events on the CR and Deployment:
kubectl describe memcached <name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<name> --sort-by='.lastTimestamp'Check operator logs:
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c managerCommon Causes and Fixes
Insufficient cluster resources
Pods stay in Pending state because CPU or memory requests cannot be satisfied.
kubectl describe pod <pod-name> -n <namespace>
# Look for "Insufficient cpu" or "Insufficient memory" in EventsFix: Reduce resource requests in the CR, add cluster capacity, or scale down other workloads.
spec:
resources:
requests:
cpu: 100m # Lower CPU request
memory: 128Mi # Lower memory requestImage pull errors
Pods show ImagePullBackOff or ErrImagePull.
kubectl describe pod <pod-name> -n <namespace>
# Look for "Failed to pull image" in EventsFix: Verify the image name and tag, ensure the image registry is accessible, and check pull secrets if using a private registry.
Node selector or toleration mismatch
No nodes match the scheduling constraints.
Fix: Verify that nodes have the required labels and taints. Adjust the pod template or node configuration accordingly.
Hard anti-affinity with too few nodes
When antiAffinityPreset: hard is set and the cluster has fewer nodes than the requested replicas, some pods cannot be scheduled because requiredDuringSchedulingIgnoredDuringExecution prevents two Memcached pods from running on the same node.
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=<name> -o wide
# Pending pods indicate scheduling failureFix: Either switch to soft anti-affinity or add more nodes to the cluster.
spec:
highAvailability:
antiAffinityPreset: soft2. Pods CrashLooping
Symptom
Pods are in CrashLoopBackOff status.
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=<name>
# STATUS shows CrashLoopBackOffDiagnosis
Check logs for the memcached container:
kubectl logs <pod-name> -n <namespace> -c memcached --previousCheck logs for the exporter sidecar (if monitoring is enabled):
kubectl logs <pod-name> -n <namespace> -c exporter --previousCheck if the pod was OOMKilled:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'Common Causes and Fixes
maxMemoryMB exceeds container memory limit (OOMKilled)
Memcached allocates the amount of memory specified by maxMemoryMB for item storage. If the container memory limit does not leave enough room for this plus operational overhead (connections, threads, internal structures), the kernel OOM-kills the process.
The validating webhook rejects configurations where resources.limits.memory < maxMemoryMB + 32Mi, but if the webhook is bypassed or the limit is only slightly above the threshold, runtime OOM is still possible.
Fix: Ensure the container memory limit is sufficiently above maxMemoryMB. A safe guideline is to set the limit to at least maxMemoryMB + 64Mi.
spec:
memcached:
maxMemoryMB: 512
resources:
limits:
memory: 640Mi # 512Mi + 128Mi headroomInvalid extraArgs
The extraArgs field passes arguments directly to the memcached process. Unrecognized or conflicting flags cause the process to exit immediately.
kubectl logs <pod-name> -n <namespace> -c memcached --previous
# Look for "unknown option" or "illegal argument" messagesFix: Remove or correct the invalid arguments in spec.memcached.extraArgs. Refer to the memcached documentation for valid flags.
Missing SASL Secret
When SASL is enabled (security.sasl.enabled: true), the operator mounts the Secret referenced by credentialsSecretRef at /etc/memcached/sasl/. If the Secret does not exist or lacks the password-file key, the pod fails to start because the volume mount fails.
kubectl get secret <secret-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
# Look for "MountVolume.SetUp failed" eventsFix: Create the required Secret before applying the CR.
kubectl create secret generic <secret-name> -n <namespace> \
--from-file=password-file=/path/to/password-fileMissing TLS Secret
When TLS is enabled (security.tls.enabled: true), the operator mounts the Secret referenced by certificateSecretRef at /etc/memcached/tls/. If the Secret does not exist or is missing the required keys (tls.crt, tls.key), the pod fails to start.
kubectl get secret <secret-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
# Look for "MountVolume.SetUp failed" eventsFix: Create the TLS Secret with the required keys.
kubectl create secret tls <secret-name> -n <namespace> \
--cert=/path/to/tls.crt \
--key=/path/to/tls.keyIf enableClientCert: true is set, the Secret must also contain a ca.crt key:
kubectl create secret generic <secret-name> -n <namespace> \
--from-file=tls.crt=/path/to/tls.crt \
--from-file=tls.key=/path/to/tls.key \
--from-file=ca.crt=/path/to/ca.crt3. ServiceMonitor Not Created
Symptom
monitoring.enabled is set to true and monitoring.serviceMonitor is configured, but no ServiceMonitor resource exists in the namespace.
kubectl get servicemonitor -n <namespace>
# No ServiceMonitor for the Memcached instanceDiagnosis
Verify the CR spec includes the serviceMonitor section:
The operator only creates a ServiceMonitor when monitoring.enabled: true AND monitoring.serviceMonitor is present (not nil) in the spec.
kubectl get memcached <name> -n <namespace> -o jsonpath='{.spec.monitoring}' | jq .Check whether the ServiceMonitor CRD is installed:
kubectl get crd servicemonitors.monitoring.coreos.comCheck operator logs for errors:
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager | grep -i servicemonitorCause
The Prometheus Operator CRDs are not installed in the cluster. The operator controller watches ServiceMonitor resources and will fail to reconcile them if the CRD does not exist.
Fix
Install the Prometheus Operator CRDs. If you use the kube-prometheus-stack Helm chart, the CRDs are included automatically. Otherwise, install them manually:
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yamlAfter the CRD is installed, the operator will create the ServiceMonitor on the next reconciliation cycle.
4. Webhook Admission Errors
Symptom
Creating or updating a Memcached CR is rejected with a validation error.
Error from server (Invalid): error when creating "memcached.yaml":
admission webhook "vmemcached-v1beta1.kb.io" denied the request: ...Diagnosis
Read the error message returned by kubectl. The validating webhook provides specific field-level error messages that identify the exact issue.
kubectl apply -f memcached.yaml
# The error message identifies which field failed validation and whyCommon Causes and Fixes
maxMemoryMB exceeds container memory limit
The webhook validates that resources.limits.memory >= maxMemoryMB (in bytes) + 32Mi (operational overhead).
spec.resources.limits.memory: Invalid value: "128Mi": memory limit must be at least 96Mi (maxMemoryMB=64Mi + 32Mi overhead)Fix: Increase resources.limits.memory or decrease maxMemoryMB.
minAvailable >= replicas
The webhook validates that PDB minAvailable (when set as an integer) must be strictly less than replicas.
spec.highAvailability.podDisruptionBudget.minAvailable: Invalid value: 3: minAvailable (3) must be less than replicas (3)Fix: Set minAvailable to a value less than replicas.
spec:
replicas: 3
highAvailability:
podDisruptionBudget:
enabled: true
minAvailable: 2 # Must be < replicas (3)minAvailable and maxUnavailable both set
The webhook enforces that minAvailable and maxUnavailable are mutually exclusive.
Fix: Specify only one of minAvailable or maxUnavailable.
Missing Secret references when security features are enabled
When SASL is enabled, credentialsSecretRef.name must be set. When TLS is enabled, certificateSecretRef.name must be set.
spec.security.sasl.credentialsSecretRef.name: Required value: credentialsSecretRef.name is required when SASL is enabledFix: Provide the required Secret reference name.
terminationGracePeriodSeconds <= preStopDelaySeconds
When graceful shutdown is enabled, terminationGracePeriodSeconds must exceed preStopDelaySeconds to ensure the preStop hook completes before the kubelet sends SIGKILL.
spec.highAvailability.gracefulShutdown.terminationGracePeriodSeconds: Invalid value: 10: terminationGracePeriodSeconds (10) must exceed preStopDelaySeconds (10)Fix: Increase terminationGracePeriodSeconds or decrease preStopDelaySeconds.
spec:
highAvailability:
gracefulShutdown:
enabled: true
preStopDelaySeconds: 10
terminationGracePeriodSeconds: 30 # Must be > preStopDelaySecondsReplicas out of range
The CRD schema enforces replicas to be between 0 and 64 (inclusive).
Fix: Set replicas to a value within the allowed range.
5. Operator Not Reconciling
Symptom
Changes to a Memcached CR (e.g., scaling replicas) are not reflected in the managed Deployment, Service, or other resources.
Diagnosis
Check the operator pod status:
kubectl get pods -n memcached-operator-systemCheck operator logs:
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager --tail=100Check the operator health endpoints:
kubectl port-forward -n memcached-operator-system deployment/memcached-operator-controller-manager 8081:8081
curl http://localhost:8081/healthz
curl http://localhost:8081/readyzCheck RBAC permissions:
kubectl auth can-i get deployments --as=system:serviceaccount:memcached-operator-system:memcached-operator-controller-manager -n <namespace>Common Causes and Fixes
Operator pod not running
The operator Deployment has zero ready replicas.
kubectl get deployment memcached-operator-controller-manager -n memcached-operator-system
kubectl describe deployment memcached-operator-controller-manager -n memcached-operator-systemFix: Investigate why the operator pod is not running (image pull issues, resource constraints, crash loop). Check pod events and logs for the specific error.
RBAC permissions missing
The operator's ServiceAccount lacks the required ClusterRole/Role bindings to manage the target resources (Deployments, Services, PodDisruptionBudgets, NetworkPolicies, ServiceMonitors, Secrets).
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager | grep -i forbiddenFix: Verify that the ClusterRole and ClusterRoleBinding for the operator are correctly applied. Re-apply the operator manifests if necessary.
kubectl apply -k config/defaultLeader election issues
In multi-replica operator deployments, only the leader instance performs reconciliation. If leader election is stuck (e.g., a stale lease), no reconciliation occurs.
kubectl get lease -n memcached-operator-system
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager | grep -i "leader"Fix: Delete the stale Lease object to allow a new leader election.
kubectl delete lease <lease-name> -n memcached-operator-systemCRDs not installed
The Memcached CRD is not installed in the cluster.
kubectl get crd memcacheds.memcached.c5c3.ioFix: Install the CRD.
kubectl apply -k config/crd6. Metrics Not Available
Symptom
Prometheus cannot scrape memcached metrics. The exporter target is down or missing from Prometheus targets.
Diagnosis
Check if the exporter sidecar is running:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].name}'
# Should include "exporter" if monitoring is enabledTest the metrics endpoint directly:
kubectl port-forward <pod-name> -n <namespace> 9150:9150
curl http://localhost:9150/metricsCheck the ServiceMonitor exists and has correct labels:
kubectl get servicemonitor <name> -n <namespace> -o yamlCheck Prometheus targets:
In the Prometheus UI (Status > Targets), look for the memcached target. If it is missing, check that the ServiceMonitor labels match the Prometheus serviceMonitorSelector.
Check the Service exposes the metrics port:
kubectl get svc <name> -n <namespace> -o jsonpath='{.spec.ports}' | jq .
# Should include port 9150 named "metrics"Common Causes and Fixes
Monitoring not enabled
The CR does not have monitoring.enabled: true.
Fix: Enable monitoring in the CR.
spec:
monitoring:
enabled: true
serviceMonitor:
additionalLabels:
release: prometheus # Must match Prometheus serviceMonitorSelectorServiceMonitor label mismatch
Prometheus selects ServiceMonitors by label. If the ServiceMonitor's labels do not match the Prometheus serviceMonitorSelector, the target is ignored.
# Check what labels Prometheus expects
kubectl get prometheus -A -o jsonpath='{.items[*].spec.serviceMonitorSelector}' | jq .Fix: Add the required labels via monitoring.serviceMonitor.additionalLabels.
spec:
monitoring:
enabled: true
serviceMonitor:
additionalLabels:
release: prometheus # Common label for kube-prometheus-stackNetworkPolicy blocking Prometheus scrapes
If a NetworkPolicy is enabled, the ingress rules must allow traffic to port 9150 from Prometheus pods.
kubectl get networkpolicy <name> -n <namespace> -o yamlThe operator automatically includes port 9150 in the NetworkPolicy when monitoring is enabled. However, if allowedSources is configured, ensure that the Prometheus pods match the allowed peer selectors.
Fix: Add Prometheus pods to the allowedSources list.
spec:
security:
networkPolicy:
enabled: true
allowedSources:
- podSelector:
matchLabels:
app.kubernetes.io/name: prometheusExporter sidecar crashing
The exporter container may be crashing independently.
kubectl logs <pod-name> -n <namespace> -c exporterFix: Check the exporter image is valid and resource limits are sufficient. Increase exporter resources if the container is OOMKilled.
spec:
monitoring:
enabled: true
exporterResources:
requests:
cpu: 50m
memory: 32Mi
limits:
cpu: 100m
memory: 64Mi7. PDB Blocking Node Drain
Symptom
kubectl drain <node> hangs indefinitely. The drain operation cannot evict Memcached pods because the PodDisruptionBudget prevents it.
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Hangs with: "Cannot evict pod as it would violate the pod's disruption budget"Diagnosis
Check the PDB status:
kubectl get pdb <name> -n <namespace>
kubectl describe pdb <name> -n <namespace>Key fields to examine:
ALLOWED DISRUPTIONS: If this is 0, no pods can be evicted.MIN AVAILABLE: The minimum number of pods that must remain running.CURRENT: The current number of healthy pods.
kubectl get pdb <name> -n <namespace> -o jsonpath='{.status}' | jq .Cause
minAvailable is set too high relative to the current replicas count. For example, with replicas: 3 and minAvailable: 2, only 1 pod can be disrupted at a time. If two nodes need to be drained simultaneously, the second drain will block.
The validating webhook prevents minAvailable >= replicas, but even valid configurations (e.g., minAvailable: 2 with replicas: 3) can cause drain issues when multiple nodes are involved.
Fix
Option A: Lower minAvailable
spec:
highAvailability:
podDisruptionBudget:
enabled: true
minAvailable: 1Option B: Switch to maxUnavailable
Using maxUnavailable is often more practical for drain operations because it directly controls how many pods can be down simultaneously. Note that minAvailable and maxUnavailable are mutually exclusive.
spec:
highAvailability:
podDisruptionBudget:
enabled: true
maxUnavailable: 1Option C: Temporarily disable PDB for maintenance
Set podDisruptionBudget.enabled: false during maintenance windows, then re-enable it after drains are complete.
8. NetworkPolicy Blocking Client Traffic
Symptom
Application pods cannot connect to Memcached. Connections to port 11211 (or 11212 for TLS) time out or are refused.
# From a client pod:
nc -zv <memcached-service>.<namespace>.svc.cluster.local 11211
# Connection timed outDiagnosis
Check if a NetworkPolicy exists:
kubectl get networkpolicy <name> -n <namespace>
kubectl describe networkpolicy <name> -n <namespace>Examine the ingress rules:
kubectl get networkpolicy <name> -n <namespace> -o yamlLook at the spec.ingress[].from field. If allowedSources is configured, only pods matching those selectors can reach the Memcached pods.
Verify client pod labels:
kubectl get pod <client-pod> -n <client-namespace> --show-labelsCompare the client pod labels against the allowedSources peer selectors in the NetworkPolicy.
Cause
The allowedSources list in spec.security.networkPolicy does not include the client application pods. When allowedSources is non-empty, the operator creates ingress rules that restrict traffic to only the listed peers.
Fix
Update the allowedSources to include the client pods. The operator supports both podSelector (for same-namespace peers) and namespaceSelector (for cross-namespace access).
Allow specific pods by label:
spec:
security:
networkPolicy:
enabled: true
allowedSources:
- podSelector:
matchLabels:
app: my-applicationAllow all pods in a specific namespace:
spec:
security:
networkPolicy:
enabled: true
allowedSources:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: my-app-namespaceAllow all traffic (remove restrictions):
Set allowedSources to an empty list or omit it entirely. When allowedSources is empty, the NetworkPolicy allows ingress from all sources on the Memcached ports.
spec:
security:
networkPolicy:
enabled: true
# allowedSources omitted = all sources allowed on Memcached portsGeneral Debugging Tips
Checking Operator Logs
The operator logs contain detailed information about reconciliation activity:
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager -fFilter for a specific Memcached instance:
kubectl logs -n memcached-operator-system deployment/memcached-operator-controller-manager -c manager | grep '"name":"<instance-name>"'Inspecting Managed Resources
All resources created by the operator carry standard labels. Use them to find all resources for a given instance:
kubectl get all,pdb,networkpolicy,servicemonitor -n <namespace> \
-l app.kubernetes.io/name=memcached,app.kubernetes.io/instance=<name>Checking Owner References
Every managed resource has an ownerReference pointing to the parent Memcached CR. This ensures garbage collection when the CR is deleted:
kubectl get deployment <name> -n <namespace> -o jsonpath='{.metadata.ownerReferences}' | jq .Verifying Webhook Configuration
If webhook admission fails unexpectedly, verify the webhook configuration:
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurationsCheck that the webhook CA bundle is valid and that the cert-manager Certificate is in a Ready state:
kubectl get certificate -n memcached-operator-system