Job Summary
Key Skills & Requirements
- Strong hands-on experience in Grafana administration, including dashboard development, alert configuration, notification policies, RBAC, user management, and data source integration.
- Expertise in Grafana plugin installation, configuration, troubleshooting, upgrades, and performance optimization across enterprise-scale monitoring environments.
- Experience designing and maintaining observability solutions using Grafana Alloy, Grafana and OpenTelemetry frameworks.
- Hands-on experience with Grafana Alloy configuration, telemetry collection pipelines, log/metric forwarding, relabeling, filtering, and performance tuning.
- Strong knowledge of BindPlane administration, including collector deployment, gateway configuration, telemetry routing, load balancing, high availability, and troubleshooting.
- Experience configuring and optimizing telemetry ingestion pipelines from on-premises and cloud-based infrastructure into centralized observability platforms.
- Good understanding of Google Cloud Platform (GCP) services, with hands-on experience in GKE cluster administration, workload deployment, pod management, scaling, and troubleshooting.
- Experience using Google Cloud Monitoring tools such as Metrics Explorer, Logs Explorer, dashboards, alerting policies, and observability best practices.
- Strong Kubernetes administration skills, including deployments, services, ingress controllers, daemonsets, statefulsets, namespaces, resource management, and cluster troubleshooting.
- Experience managing and monitoring Azure Kubernetes Service (AKS) environments and implementing observability solutions for containerized workloads.
- Knowledge of Azure cloud services, networking concepts, identity management, and infrastructure monitoring.
- Hands-on experience with Ansible for infrastructure automation, configuration management, deployment automation, and operational tasks.
- Strong scripting and automation skills using Python and Shell Scripting for monitoring, API integrations, and operational efficiency improvements.
- Experience integrating monitoring platforms with ServiceNow, REST APIs, webhook-based alerting, SQL , and third-party enterprise applications.
- Strong understanding of Linux system administration, troubleshooting, process management, networking fundamentals, and performance analysis.
- Ability to perform root cause analysis, capacity planning, performance optimization, and reliability improvements for large-scale monitoring platforms.
- Experience supporting enterprise observability environments with thousands of monitored servers, applications, and cloud-native workloads.
- Excellent analytical, troubleshooting, documentation, and stakeholder communication skills.
Cloud & Container Technologies
- Google Cloud Platform (GCP)/Google Kubernetes Engine (GKE)
- Kubernetes Administration
- Azure Cloud/Azure Kubernetes Service (AKS)
Monitoring & Observability
- Grafana
- Grafana Alloy
- OpenTelemetry
- BindPlane
- Cloud Monitoring
- Log Management Solutions
- Prometheus
Automation & Development
- Python
- Shell Scripting (Bash)
- Ansible
- REST APIs
- Git/GitHub
Key Responsibilities
Key Skills & Requirements
- Strong hands-on experience in Grafana administration, including dashboard development, alert configuration, notification policies, RBAC, user management, and data source integration.
- Expertise in Grafana plugin installation, configuration, troubleshooting, upgrades, and performance optimization across enterprise-scale monitoring environments.
- Experience designing and maintaining observability solutions using Grafana Alloy, Grafana and OpenTelemetry frameworks.
- Hands-on experience with Grafana Alloy configuration, telemetry collection pipelines, log/metric forwarding, relabeling, filtering, and performance tuning.
- Strong knowledge of BindPlane administration, including collector deployment, gateway configuration, telemetry routing, load balancing, high availability, and troubleshooting.
- Experience configuring and optimizing telemetry ingestion pipelines from on-premises and cloud-based infrastructure into centralized observability platforms.
- Good understanding of Google Cloud Platform (GCP) services, with hands-on experience in GKE cluster administration, workload deployment, pod management, scaling, and troubleshooting.
- Experience using Google Cloud Monitoring tools such as Metrics Explorer, Logs Explorer, dashboards, alerting policies, and observability best practices.
- Strong Kubernetes administration skills, including deployments, services, ingress controllers, daemonsets, statefulsets, namespaces, resource management, and cluster troubleshooting.
- Experience managing and monitoring Azure Kubernetes Service (AKS) environments and implementing observability solutions for containerized workloads.
- Knowledge of Azure cloud services, networking concepts, identity management, and infrastructure monitoring.
- Hands-on experience with Ansible for infrastructure automation, configuration management, deployment automation, and operational tasks.
- Strong scripting and automation skills using Python and Shell Scripting for monitoring, API integrations, and operational efficiency improvements.
- Experience integrating monitoring platforms with ServiceNow, REST APIs, webhook-based alerting, SQL , and third-party enterprise applications.
- Strong understanding of Linux system administration, troubleshooting, process management, networking fundamentals, and performance analysis.
- Ability to perform root cause analysis, capacity planning, performance optimization, and reliability improvements for large-scale monitoring platforms.
- Experience supporting enterprise observability environments with thousands of monitored servers, applications, and cloud-native workloads.
- Excellent analytical, troubleshooting, documentation, and stakeholder communication skills.
Cloud & Container Technologies
- Google Cloud Platform (GCP)/Google Kubernetes Engine (GKE)
- Kubernetes Administration
- Azure Cloud/Azure Kubernetes Service (AKS)
Monitoring & Observability
- Grafana
- Grafana Alloy
- OpenTelemetry
- BindPlane
- Cloud Monitoring
- Log Management Solutions
- Prometheus
Automation & Development
- Python
- Shell Scripting (Bash)
- Ansible
- REST APIs
- Git/GitHub