Document ID: 0010
Topic: Monitoring, System Performance
Created: 2007-05-20
Last Updated: 2009-03-25
Author: Stefan Parvu
References: K9toolkit
OS: Solaris 10+
Monitoring the IT infrastructure is an important key to ensure your business continuity and prepare for future grow. SDR is a simple toolkit, containing a number of data collectors, used to record and report data from your Solaris servers. SDR is mainly designed around Solaris operating system due kernel statistics interface but it can be easily expanded to other OSes.
Solaris operating environment has already many utilities to debug and observe the entire system or certain individual processes. Third parties software applications can be installed to monitor the system or the applications: BMC Patrol/Predict, TeamQuest, Tivoli, Sitescope, Nagios, etc. Some of these software packages are focused in events management some focus on performance analysis and capacity planning. SDR tries to sit around performance ananlysis and focus on capacity planning even if there is a lot of work still to be done. Hand in hand with PDQ, a simple and powerful analytic model, SDR can be used to measure your infrastructure capacity.
Mainly we are interested in observing and recording:
SDR can help in cases where the budget is limited and the time to deploy the solution is an important factor for your site. You dont want to spend a lot of time setting up an expensive RDBMS system , in money and time, for your reports but rather a simple and reliable solution reducing the maintenance to almost 0. SDR is using RRDTool as the kernel for storing and reporitng the data.
|
Design Recorders Installation |
system data recorder design list of system recorders installation and setup guide |
|
Design Installation |
reporting side installation notes |
|
FAQ |
most common questions about SDR |
|
SDR Live |
Live Demo for SDR installation. Username: demo Password: sdr |
|
SDR Bugzilla |
SDR Bug Tracking |
The System Data Recorder is simple organized as two main things: the collection part, or the part which handles recording the data from each system and a reporting side where we permanently store and generate simple reports and graphs. For some configurations we can use only the recording part without the reporting side at all.
Data recorder consists of many simple utilities developed in Korn shell and Perl language which extract different telemetry from Solaris kernel statistic module. As well some recorders gather their data from various processes directly using OS or third parties utilities. There are a total of 5 recorders, which should be installed and deployed in any system and optional recorders needed only in certain cases: CMT, JVM.
If your system deploys some sort of virtualization then the recorders will operate from the global level. If the virtualization type includes domains or Xen technology then the recorders are deployed in all these systems.
Recorded data:
Each recorder is operated by the SMF, the Solaris service management facility in order to ensure their activity, restarting them automatically in case one fails or exists unexpected. As well dependency checking is easily implemented with SMF, for instance the recorders should not start if the local filesystem is not mounted or the network interfaces are not present when the recorder tries to start.
Each recorder outputs its data to a file called the raw output file. Every midnight we rotate this file using logadm utility and we compress it. This way we make sure the stored data is small and easy to be transported to our reporting system. The stored data is small and compact in size, majority of collectors record directly raw data in RRD format, easy to be imported into Round-Robin Database system, the final place where the data will be stored for 1 year or desired period of time of your site.
The recording part consists of the following collectors or recorders:
| Item | Description | Based On |
| sysrec | system utilisation and saturation | Perl5, Kstat |
| cpurec | per-cpu detailed statistics | Perl5, Kstat |
| nicrec | network interface statistics | Perl5, Kstat |
| netrec | network protocol statistics | Perl5, netstat |
| zonerec | zone statistics | Ksh, prstat |
| corerec | CMT T1, T2 processor statistics | Ksh, Perl5, cpustat |
| jvmrec | garbage collection statistics | Ksh, jstat |
sysrec is a utility, part of K9Toolkit, author Brendan Gregg. The toolkit is a collection of free Perl scripts used to troubleshoot and observe Solaris systems. Check Appendix for more details. The recorder has been modified to output its data into RRD format.
sysrec records system utilisation and saturation and it is used as a starting point in observing the system's health.
The output from sysrec is displayed below:
| timestamp | CPU Util % | Mem Util % | Disk Util % | Net Util % | CPU Sat % | Mem Sat % | Disk Sat % | Net Sat % |
| 1225038537: | 10.56: | 70.79: | 9.87: | 0.13: | 0.02: | 0.02: | 0.19: | 0.00 |
| 1225038539: | 3.92: | 70.79: | 0.00: | 0.00: | 0.00: | 0.00: | 0.00: | 0.00 |
| 1225038538: | 2.94: | 70.79: | 0.00: | 0.00: | 0.00: | 0.00: | 0.00: | 0.00 |
cpurec is a utility, collecting per-CPU data from kstat The recorder outputs its data under RRD format.
cpurec used mainly to observe CPU activity and analyse how the CPUs are used in the system. Useful for capacity planning. Recording points:
The output from cpurec is displayed below:
| timestamp | Cpuid | Xcalls | Intr | iThr | Csw | Icsw | Migr | Smtx | Syscalls | User % | Sys % | Idle % |
| 1225039500: | 1: | 97: | 568: | 49: | 1195: | 69: | 190: | 40: | 4183: | 5.95: | 4.37: | 89.68 |
| 1225039500: | 0: | 98: | 936: | 513: | 1169: | 47: | 190: | 41: | 4097: | 6.17: | 4.62: | 89.21 |
| 1225039504: | 1: | 0: | 90: | 3: | 219: | 1: | 28: | 2: | 289: | 2.97: | 0.00: | 97.03 |
| 1225039504: | 0: | 0: | 482: | 378: | 158: | 3: | 23: | 3: | 579: | 0.00: | 13.86: | 86.14 |
nicrec is a utility part of K9toolkit, author Brendan Gregg, printing network traffic, Kb/s read and written. The recorder outputs its data under RRD format.
nicrec used to observe the Kb/s transferred for all the network cards, including packet counts and average sizes. Recording points:
The output from nicrec is displayed below:
| timestamp | interface | read KB/s | write KB/s | rPackets/s | wPackets/s | read average | write average | Util % | Sat % |
| 1225354256: | e1000g0: | 72.29: | 3.07: | 72.09: | 41.56: | 1026.89: | 75.58: | 0.06: | 0.00: |
| 1225354256: | mac: | 72.29: | 3.07: | 72.09: | 41.56: | 1026.89: | 75.58: | 0.06: | 0.00: |
netrec is a utility, reporting TCP, UDP and IP statistics from a running local or global zone. If the system deploys one or more zones and if all zones share same TCP/IP stack then you can simple use -s flag to report the numbers just once. The recorder outputs its data under RRD format.
netrec used to observe TCP, UDP and IP counters. Recording points:
The output from netrec is displayed below:
# netrec global:1225354459:241756657:0:242661104:0:167385:47575:54354:4477:17:20962546:11345630:4070136829:978714:913964897:0:0:0:435380:161:10 lobby:1225354459:241756657:0:242661104:0:167385:47575:54354:4477:2:20962750:11345632:4070137181:978714:913964897:0:0:0:435380:161:10 wowza-test:1225354459:241756657:0:242661104:0:167385:47575:54354:4477:0:20962750:11345634:4070137533:978714:913964897:0:0:0:435380:161:10 dss:1225354460:241756657:0:242661104:0:167385:47575:54354:4477:0:20962750:11345636:4070137885:978714:913964897:0:0:0:435380:161:10 dss-test:1225354460:241756657:0:242661104:0:167385:47575:54354:4477:0:20962750:11345638:4070138237:978714:913964897:0:0:0:435380:161:10 # netrec -s global:1225354465:241756657:0:242661105:0:167386:47575:54354:4477:16:20962798:11345713:4070158680:978715:913964897:0:0:0:435380:161:10 lobby:1225354465:2 wowza-test:1225354466:0 dss:1225354466:0 dss-test:1225354466:0 |
zonerec is a simple script calling prstat utility to report zone utilisation in human readable format. This data needs to be parsed and prepared in RRD format. Future versions will include a new recorder which will output its data to RRD format.
zonerec used to observe CPU and Mem utilisation, as reported by prstat
The output from zonerec is displayed below:
# zonerec 60
2008-10-30:10:32:12 - 1225355532
ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE
0 110 1025M 1050M 26% 17:22:56 11% global
3 28 140M 199M 4.9% 3:24:32 0.1% wowza-test
1 32 47M 46M 1.1% 0:00:45 0.0% lobby
4 23 38M 37M 0.9% 0:00:38 0.0% dss
Total: 224 processes, 1064 lwps, load averages: 0.30, 0.30, 0.29
Zone: global
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
7323 sparvu 236M 157M cpu0 14 0 4:30:39 7.0% firefox-bin/12
640 sparvu 217M 169M sleep 59 0 4:46:25 1.8% Xorg/1
698 sparvu 64M 62M sleep 59 0 6:45:11 1.7% prstat/1
7343 sparvu 445M 307M sleep 57 4 0:33:13 0.6% java/76
689 sparvu 4008K 2024K sleep 49 0 0:12:42 0.1% cpubar.x86/1
7280 sparvu 258M 159M sleep 57 4 0:03:21 0.0% java/36
17860 root 6848K 3704K cpu1 50 0 0:00:00 0.0% prstat/1
17694 sparvu 5596K 1940K sleep 59 0 0:00:00 0.0% sshd/1
605 root 3532K 1280K sleep 59 0 0:00:00 0.0% sshd/1
17696 sparvu 1208K 1012K sleep 59 0 0:00:00 0.0% ksh/1
139 root 2248K 1332K sleep 59 0 0:00:00 0.0% syseventd/14
149 root 6108K 3500K sleep 59 0 0:00:00 0.0% devfsadm/9
NPROC USERNAME SWAP RSS MEMORY TIME CPU
56 sparvu 936M 969M 24% 17:19:14 11%
41 root 299M 359M 8.8% 0:02:42 0.1%
4 postgres 15M 16M 0.4% 0:00:59 0.0%
1 lp 1016K 3012K 0.1% 0:00:00 0.0%
1 smmsp 3712K 6040K 0.1% 0:00:00 0.0%
6 daemon 38M 29M 0.7% 0:00:01 0.0%
Total: 109 processes, 390 lwps, load averages: 0.30, 0.30, 0.29
Zone: lobby
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
17654 sparvu 5716K 1956K sleep 59 0 0:00:00 0.0% sshd/1
17686 sparvu 4852K 3100K sleep 59 0 0:00:00 0.0% ssh/1
1123 root 3532K 1340K sleep 59 0 0:00:00 0.0% sshd/1
1120 root 2088K 1204K sleep 59 0 0:00:00 0.0% ttymon/1
1164 root 9180K 5760K sleep 59 0 0:00:03 0.0% snmpd/1
1149 root 1740K 660K sleep 59 0 0:00:00 0.0% smcboot/1
1039 root 4064K 3016K sleep 59 0 0:00:04 0.0% inetd/4
1047 root 1108K 640K sleep 59 0 0:00:00 0.0% utmpd/1
774 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
986 daemon 6048K 3484K sleep 59 0 0:00:01 0.0% nfsmapid/3
981 daemon 2440K 1064K sleep 59 0 0:00:00 0.0% rpcbind/1
796 root 11M 9648K sleep 59 0 0:00:07 0.0% svc.startd/14
NPROC USERNAME SWAP RSS MEMORY TIME CPU
3 sparvu 936M 969M 24% 0:00:00 0.0%
1 smmsp 3712K 6040K 0.1% 0:00:00 0.0%
6 daemon 38M 29M 0.7% 0:00:01 0.0%
22 root 299M 359M 8.8% 0:00:44 0.0%
Total: 32 processes, 119 lwps, load averages: 0.30, 0.30, 0.29
Zone: wowza-test
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
1379 root 167M 114M sleep 49 0 3:23:47 0.1% java/225
1350 root 1128K 820K sleep 59 0 0:00:00 0.0% sh/1
1743 daemon 2104K 1328K sleep 59 0 0:00:00 0.0% lockd/2
1673 root 2128K 1168K sleep 59 0 0:00:00 0.0% ttymon/1
1872 root 1740K 672K sleep 59 0 0:00:00 0.0% smcboot/1
1870 root 1740K 932K sleep 59 0 0:00:00 0.0% smcboot/1
1388 root 5016K 2668K sleep 59 0 0:00:11 0.0% nscd/28
1382 daemon 3912K 2028K sleep 59 0 0:00:00 0.0% kcfd/3
1256 root 10M 8956K sleep 59 0 0:00:11 0.0% svc.configd/20
1249 root 2160K 1196K sleep 59 0 0:00:00 0.0% init/1
947 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
1254 root 10M 9100K sleep 59 0 0:00:07 0.0% svc.startd/12
1378 root 1100K 784K sleep 49 0 0:00:00 0.0% startup.sh/1
NPROC USERNAME SWAP RSS MEMORY TIME CPU
22 root 299M 359M 8.8% 3:24:32 0.1%
6 daemon 38M 29M 0.7% 0:00:00 0.0%
Total: 28 processes, 330 lwps, load averages: 0.30, 0.30, 0.29
Zone: dss
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
2050 root 2088K 1208K sleep 59 0 0:00:00 0.0% ttymon/1
1752 daemon 3912K 2020K sleep 59 0 0:00:00 0.0% kcfd/3
2000 daemon 2104K 1328K sleep 59 0 0:00:00 0.0% lockd/2
2090 root 1740K 936K sleep 59 0 0:00:00 0.0% smcboot/1
1750 root 5544K 2640K sleep 59 0 0:00:12 0.0% nscd/27
1975 daemon 2440K 1012K sleep 59 0 0:00:00 0.0% rpcbind/1
2018 root 2104K 1160K sleep 59 0 0:00:00 0.0% ttymon/1
1983 daemon 2456K 1596K sleep 59 0 0:00:00 0.0% statd/1
2042 root 1108K 644K sleep 59 0 0:00:00 0.0% utmpd/1
2091 root 1740K 672K sleep 59 0 0:00:00 0.0% smcboot/1
2028 root 3536K 1592K sleep 59 0 0:00:00 0.0% syslogd/13
1985 daemon 2124K 1320K sleep 59 0 0:00:00 0.0% nfs4cbd/2
1294 root 11M 8756K sleep 59 0 0:00:07 0.0% svc.startd/13
1972 root 2424K 1004K sleep 59 0 0:00:00 0.0% cron/1
NPROC USERNAME SWAP RSS MEMORY TIME CPU
6 daemon 38M 29M 0.7% 0:00:01 0.0%
17 root 299M 359M 8.8% 0:00:37 0.0%
Total: 23 processes, 97 lwps, load averages: 0.30, 0.30, 0.29
|
corerec is a utility using corestat, from Cooltools. The output is human readable format, it will require parsing and proper formating for RRD.
corerec used to observe core utilisation from a T1 or T2 processor. Since T1 and T2 have different registers for keeping track of the usage the corestat utility has to be different for each case: corestat.t1 used for T1 processors and corestat.t2 for T2. Recording points:
The output from corerec is displayed below:
For a T2 processor:
Core,Int-pipe %Usr %Sys %Usr+Sys
0,0 0.05 0.37 0.42
0,1 0.06 0.07 0.13
1,0 0.04 0.08 0.12
1,1 0.01 0.05 0.06
2,0 0.09 0.11 0.20
2,1 0.01 0.06 0.07
3,0 0.02 0.15 0.17
3,1 0.01 0.05 0.05
4,0 0.01 0.13 0.14
4,1 0.01 0.04 0.05
5,0 0.07 0.10 0.16
5,1 0.01 0.45 0.46
6,0 0.02 0.10 0.12
6,1 0.01 0.06 0.07
7,0 0.03 0.12 0.14
7,1 0.01 0.05 0.06
------------- ----- ----- ------
Avg 0.03 0.12 0.15
|
Important to note here is that utilisation for a T1 or T2 processor does not simple mean data from vmstat or mpstat alone. You have to use corerec in order to gather the correct utilisation figures. See below Ravindra Talashikar's notes about mpstat vmstat on T1 processors!
jvmrec is a utility based on jstat, part of JDK helping to extract Garbage Collection statistics from a running virtual machine. This recorder will loop over all running zones found on the system and will fetch each java process found and extract its GC numbers. The recorder outputs its data under RRD format.
jvmrec records the GC statistics useful to understand how your JVMs are running. Recording points:
The output from jvmrec is displayed below:
| zone.pid | timestamp | S0% | S1% | Eden% | Old% | Perm% | No.mGC | Time.mGC | No.MGC | Time.MGC | Total GC |
| global.23699: | 1225360607: | 0.00: | 29.51: | 10.03: | 10.07: | 60.44: | 9: | 0.100: | 1: | 0.058: | 0.158 |
| global.23699: | 1225360668: | 0.00: | 29.51: | 12.76: | 10.07: | 60.44: | 9: | 0.100: | 1: | 0.058: | 0.158 |
SDR is a simple collection of scripts easy to install and setup under Solaris 10 systems. The recorders operate under Solaris service management facility , a nice interface for running services under Solaris 10. Each recorder is monitored by SMF and restarted in case is needed. For systems lower than Solaris 10, you need to enable yourself rc scripts.
| Version | Description | Release Notes |
| current: 0.63 |
Recording Reporting |
ChangeLog 0.63 |
| future: 0.70 |
Recording |
ChangeLog 0.70 |
SDR uses SAR, system activity reporter. SAR is started using SMF so these are the main steps in order to get started SAR:
# svcs -a | grep rec # |
# svccfg validate sysrec.xml # svccfg validate cpurec.xml # svccfg validate nicrec.xml # svccfg validate netrec.xml # svccfg validate zonerec.xml # svccfg import sysrec.xml # svccfg import cpurec.xml # svccfg import nicrec.xml # svccfg import netrec.xml # svccfg import zonerec.xml |
# svcadm enable sysrec # svcadm enable cpurec # svcadm enable nicrec # svcadm enable netrec # svcadm enable zonerec On syslog each recorder will report its status: Oct 31 16:36:07 nereid root: [ID 702911 daemon.notice] Starting system recorder: sysrec Oct 31 16:37:43 nereid root: [ID 702911 daemon.notice] Starting per-cpu recorder: cpurec Oct 31 16:37:47 nereid root: [ID 702911 daemon.notice] Starting nic recorder: nicrec Oct 31 16:37:52 nereid root: [ID 702911 daemon.notice] Starting net recorder: netrec Oct 31 16:37:56 nereid root: [ID 702911 daemon.notice] Starting zone recorder: zonerec |
# svcs -a | grep rec online 16:36:07 svc:/application/sysrec:default online 16:37:43 svc:/application/cpurec:default online 16:37:47 svc:/application/nicrec:default online 16:37:52 svc:/application/netrec:default online 16:37:56 svc:/application/zonerec:default # ptree [...] 3958 /usr/bin/perl -w /opt/sdr/bin/sysrec 60 3972 /bin/perl -w /opt/sdr/bin/cpurec 60 3980 /usr/bin/perl -w /opt/sdr/bin/nicrec 60 3989 /bin/perl -w /opt/sdr/bin/netrec 60 4000 /bin/ksh -p /opt/sdr/bin/zonerec 60 4022 sleep 60 |
# pwd /opt/sdr/log/raw # ls -lrt total 25 -rw-r--r-- 1 root root 342 Oct 31 16:39 cpurec.raw -rw-r--r-- 1 root root 522 Oct 31 16:39 nicrec.raw -rw-r--r-- 1 root root 324 Oct 31 16:39 netrec.raw -rw-r--r-- 1 root root 7552 Oct 31 16:39 zonerec.raw -rw-r--r-- 1 root root 263 Oct 31 16:40 sysrec.raw |
Enable for each raw file, a entry for logadm to rotate the file at midnight and to compress the file. For this make sure you are superuser and modify the /etc/logadm.conf file or use logadm utility to add the entries. # SDR Monitoring /opt/sdr/log/raw/sysrec.raw -c -p 1d -z 0 /opt/sdr/log/raw/cpurec.raw -c -p 1d -z 0 /opt/sdr/log/raw/nicrec.raw -c -p 1d -z 0 /opt/sdr/log/raw/netrec.raw -c -p 1d -z 0 /opt/sdr/log/raw/zonerec.raw -c -p 1d -z 0 At the end make sure you check the consistency of the logadm.conf by running: # logadm -V |
# crontab -e Add here logadm to be done at 00:05, everyday instead of 3AM and move the raw data compressed into daily directories. # 05 00 * * * /usr/sbin/logadm 10 00 * * * /opt/sdr/bin/raw2day |
# svcadm disable sysrec # svcadm disable cpurec # svcadm disable nicrec # svcadm disable netrec # svcadm disable zonerec # svcs -a | grep rec disabled 16:43:55 svc:/application/sysrec:default disabled 16:44:02 svc:/application/cpurec:default disabled 16:44:05 svc:/application/nicrec:default disabled 16:44:08 svc:/application/netrec:default disabled 16:44:11 svc:/application/zonerec:default |
# svccfg delete application/sysrec # svccfg delete application/cpurec # svccfg delete application/nicrec # svccfg delete application/netrec # svccfg delete application/zonerec At this moment SMF does not know anymore about SDR # svcs -a | grep rec # |
SDR talks about utilisation and saturation rather than run queue lenght or other operating system metrics in general. You can have as well all the other metrics, but main idea of SDR is to combine CPU, Mem, Disk and Net I/O in terms of utilisation and saturation. More SDR adds an queueing model solver PDQ which can be used to solve various problems !
SDR recording part requires Solaris due its tight integration with KSTAT interface. Couple of recorders need to be ported to Linux or FreeBSD, if required. Feel free to contribute the code for your preferate operating system.
SDR recording part includes several recorders designed to collect data from a particular area of your systems: CPU, Zones, applications etc. Instead of having one, two general recorders I tried to design 5 main recorders which can be easily maintained and ported and others specialized for other purposes. Simplicity was the main criteria !
Simplicity was one of the main reasons behind. KSTAT interface in Solaris can be accessed via a Perl or C program. Brendan Gregg, the author of sysperfstat inspired me to keep using the same way, KSTAT scripts. When I was not able to obtain the information from KSTAT I used a simple Ksh script calling basic OS utilities. This last part needs improvement, example here zonerec, jvmrec. The main goal is to use as few utilities as possible and gather all data from OS interfaces.
To gather data from various Solaris zones, KSTAT interface should be used. Currently there is a open effort to improve this. Meanwhile prstat can be used to obtain data for each zone. Extended Process Accounting can as well be used to obtain information from each process running on the physical machine. However at this moment Im looking into new ways to improve this.
Make sure you use SDR 0.70 which includes updates about sysrec and ZFS.
Back to main homepage
This document is Copyright (c) 2009 Stefan Parvu
Document License:
PDL