grafana-agent内置集成了process-exporter,基于/proc的文件分析结果,来收集Linux系统进程相关的指标(注意,非Linux系统开启该exporter不起作用)。

如果grafana-agent运行在container中,那么在容器的启动命令中,要做以下调整,即将宿主机的/proc目录映射到容器中相应的位置。

  1. docker run \
  2. -v "/proc:/proc:ro" \
  3. -v /tmp/agent:/etc/agent \
  4. -v /path/to/config.yaml:/etc/agent-config/agent.yaml \
  5. grafana/agent:v0.23.0 \
  6. --config.file=/etc/agent-config/agent.yaml

注意,将/path/to/config.yaml替换成您自己相应的配置文件。

如果grafana-agent运行在Kubernetes中,那么同样的需要在manifest文件中,做如下调整,即将宿主机的/proc目录映射到容器中相应的位置。

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: grafana-agent
  5. spec:
  6. containers:
  7. - image: grafana/agent:v0.23.0
  8. name: agent
  9. args:
  10. - --config.file=/etc/agent-config/agent.yaml
  11. volumeMounts:
  12. - name: procfs
  13. mountPath: /proc
  14. readOnly: true
  15. volumes:
  16. - name: procfs
  17. hostPath:
  18. path: /proc

配置并启用process_exporter

如下的配置,将会开启process_exporter,并追踪系统中的所有进程。

  1. process_exporter:
  2. enabled: true
  3. process_names:
  4. - name: "{{.Comm}}"
  5. cmdline:
  6. - '.+'

采集的指标列表

  1. # Context switches
  2. # 上下文切换数量
  3. # Counter
  4. namedprocess_namegroup_context_switches_total
  5. # Cpu user/system usage in seconds
  6. # CPU 时间(秒)
  7. # Counter
  8. namedprocess_namegroup_cpu_seconds_total
  9. # Major page faults
  10. # 主要页缺失次数
  11. # Counter
  12. namedprocess_namegroup_major_page_faults_total
  13. # Minor page faults
  14. # 次要页缺失次数
  15. # Counter
  16. namedprocess_namegroup_minor_page_faults_total
  17. # number of bytes of memory in use
  18. # 内存占用(byte)
  19. # Gauge
  20. namedprocess_namegroup_memory_bytes
  21. # number of processes in this group
  22. # 同名进程数量
  23. # Gauge
  24. namedprocess_namegroup_num_procs
  25. # Number of processes in states Running, Sleeping, Waiting, Zombie, or Other
  26. # 同名进程状态分布
  27. # Gauge
  28. namedprocess_namegroup_states
  29. # Number of threads
  30. # 线程数量
  31. # Gauge
  32. namedprocess_namegroup_num_threads
  33. # start time in seconds since 1970/01/01 of oldest process in group
  34. # 启动时间戳
  35. # Gauge
  36. namedprocess_namegroup_oldest_start_time_seconds
  37. # number of open file descriptors for this group
  38. # 打开文件描述符数量
  39. # Gauge
  40. namedprocess_namegroup_open_filedesc
  41. # the worst (closest to 1) ratio between open fds and max fds among all procs in this group
  42. # 打开文件数 / 允许打开文件数
  43. # Gauge
  44. namedprocess_namegroup_worst_fd_ratio
  45. # number of bytes read by this group
  46. # 读数据量(byte)
  47. # Counter
  48. namedprocess_namegroup_read_bytes_total
  49. # number of bytes written by this group
  50. # 写数据量(byte)
  51. # Counter
  52. namedprocess_namegroup_write_bytes_total
  53. # Number of threads in this group waiting on each wchan
  54. # 内核wchan等待线程数量
  55. # Gauge
  56. namedprocess_namegroup_threads_wchan

process_exporter的详细配置项说明

  1. # Enables the process_exporter integration, allowing the Agent to automatically
  2. # collect system metrics from the host UNIX system.
  3. [enabled: <boolean> | default = false]
  4. # Sets an explicit value for the instance label when the integration is
  5. # self-scraped. Overrides inferred values.
  6. #
  7. # The default value for this integration is inferred from the agent hostname
  8. # and HTTP listen port, delimited by a colon.
  9. [instance: <string>]
  10. # Automatically collect metrics from this integration. If disabled,
  11. # the process_exporter integration will be run but not scraped and thus not
  12. # remote-written. Metrics for the integration will be exposed at
  13. # /integrations/process_exporter/metrics and can be scraped by an external
  14. # process.
  15. [scrape_integration: <boolean> | default = <integrations_config.scrape_integrations>]
  16. # How often should the metrics be collected? Defaults to
  17. # prometheus.global.scrape_interval.
  18. [scrape_interval: <duration> | default = <global_config.scrape_interval>]
  19. # The timeout before considering the scrape a failure. Defaults to
  20. # prometheus.global.scrape_timeout.
  21. [scrape_timeout: <duration> | default = <global_config.scrape_timeout>]
  22. # Allows for relabeling labels on the target.
  23. relabel_configs:
  24. [- <relabel_config> ... ]
  25. # Relabel metrics coming from the integration, allowing to drop series
  26. # from the integration that you don't care about.
  27. metric_relabel_configs:
  28. [ - <relabel_config> ... ]
  29. # How frequent to truncate the WAL for this integration.
  30. [wal_truncate_frequency: <duration> | default = "60m"]
  31. # procfs mountpoint.
  32. [procfs_path: <string> | default = "/proc"]
  33. # If a proc is tracked, track with it any children that aren't a part of their
  34. # own group.
  35. [track_children: <boolean> | default = true]
  36. # Report on per-threadname metrics as well.
  37. [track_threads: <boolean> | default = true]
  38. # Gather metrics from smaps file, which contains proportional resident memory
  39. # size.
  40. [gather_smaps: <boolean> | default = true]
  41. # Recheck process names on each scrape.
  42. [recheck_on_scrape: <boolean> | default = false]
  43. # A collection of matching rules to use for deciding which processes to
  44. # monitor. Each config can match multiple processes to be tracked as a single
  45. # process "group."
  46. process_names:
  47. [- <process_matcher_config>]

process_matcher_config

  1. # The name to use for identifying the process group name in the metric. By
  2. # default, it uses the base path of the executable.
  3. #
  4. # The following template variables are available:
  5. #
  6. # - {{.Comm}}: Basename of the original executable from /proc/<pid>/stat
  7. # - {{.ExeBase}}: Basename of the executable from argv[0]
  8. # - {{.ExeFull}}: Fully qualified path of the executable
  9. # - {{.Username}}: Username of the effective user
  10. # - {{.Matches}}: Map containing all regex capture groups resulting from
  11. # matching a process with the cmdline rule group.
  12. # - {{.PID}}: PID of the process. Note that the PID is copied from the
  13. # first executable found.
  14. # - {{.StartTime}}: The start time of the process. This is useful when combined
  15. # with PID as PIDS get reused over time.
  16. [name: <string> | default = "{{.ExeBase}}"]
  17. # A list of strings that match the base executable name for a process, truncated
  18. # at 15 characters. It is derived from reading the second field of
  19. # /proc/<pid>/stat minus the parens.
  20. #
  21. # If any of the strings match, the process will be tracked.
  22. comm:
  23. [- <string>]
  24. # A list of strings that match argv[0] for a process. If there are no slashes,
  25. # only the basename of argv[0] needs to match. Otherwise the name must be an
  26. # exact match. For example, "postgres" may match any postgres binary but
  27. # "/usr/local/bin/postgres" can only match a postgres at that path exactly.
  28. #
  29. # If any of the strings match, the process will be tracked.
  30. exe:
  31. [- <string>]
  32. # A list of regular expressions applied to the argv of the process. Each
  33. # regex here must match the corresponding argv for the process to be tracked.
  34. # The first element that is matched is argv[1].
  35. #
  36. # Regex Captures are added to the .Matches map for use in the name.
  37. cmdline:
  38. [- <string>]