Following a scheduled reboot, one data node began exhibiting unusual behavior. Routine MapReduce jobs run against HBase tables which regularly completed in five or six minutes began taking 30 or 40 minutes to finish. These jobs were consistently found to be waiting on map and reduce tasks assigned to the troubled data node (e.g., the slow map tasks all had the same Input Split). The situation came to a head during a distributed copy, when the copy was severely prolonged by the lagging node.
Datanodes:
Two 12-core processors
Six Enerprise SATA disks
24GB of RAM
Two bonded gigabit NICs
Network:
10 Gigabit top-of-rack switches
20 Gigabit bonded interconnects between racks.
We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time. Examination of the HBase Master status page showed that the number of HBase requests to the troubled node was almost zero. Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions in progress. This effectively ruled out a "hot spot" as the root cause of the observed slowness.
Our next hypothesis was that one of the MapReduce tasks was requesting data from HBase that was not local to the datanode, thus forcing HDFS to request data blocks from other servers over the network. Examination of the datanode logs showed that there were very few blocks being requested over the network, indicating that the HBase region was correctly assigned, and that the majority of the necessary data was located on the node. This ruled out the possibility of non-local data causing a slowdown.
After concluding that the Hadoop and HBase were not likely to be the culprits, we
moved on to troubleshooting the datanode's hardware. Java, by design, will periodically
scan its entire memory space to do garbage collection. If system memory is heavily
overcommitted, the Linux kernel may enter a vicious cycle, using up all of its resources
swapping Java heap back and forth from disk to RAM as Java tries to run garbage
collection. Further, a failing hard disk will often retry reads and/or writes many times
before giving up and returning an error. This can manifest as high iowait, as running
processes wait for reads and writes to complete. Finally, a disk nearing the upper edge
of its performance envelope will begin to cause iowait as it informs the kernel that it
cannot accept any more data, and the kernel queues incoming data into the dirty write
pool in memory. However, using vmstat(1)
and free(1)
, we could
see that no swap was being used, and the amount of disk IO was only a few kilobytes per
second.
Next, we checked to see whether the system was performing slowly simply due to very
high computational load. top(1)
showed that the system load was higher than
normal, but vmstat(1)
and mpstat(1)
showed that the amount of
processor being used for actual computation was low.
Since neither the disks nor the processors were being utilized heavily, we moved on
to the performance of the network interfaces. The datanode had two gigabit ethernet
adapters, bonded to form an active-standby interface. ifconfig(8)
showed
some unusual anomalies, namely interface errors, overruns, framing errors. While not
unheard of, these kinds of errors are exceedingly rare on modern hardware which is
operating as it should:
$ /sbin/ifconfig bond0 bond0 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:10.x.x.x Bcast:10.x.x.255 Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6 <--- Look Here! Errors! TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2416328868676 (2.4 TB) TX bytes:3464991094001 (3.4 TB)
These errors immediately lead us to suspect that one or more of the ethernet
interfaces might have negotiated the wrong line speed. This was confirmed both by
running an ICMP ping from an external host and observing round-trip-time in excess of
700ms, and by running ethtool(8)
on the members of the bond interface and
discovering that the active interface was operating at 100Mbs/, full duplex.
$ sudo ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Link partner advertised link modes: Not reported Link partner advertised pause frame use: No Link partner advertised auto-negotiation: No Speed: 100Mb/s <--- Look Here! Should say 1000Mb/s! Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: umbg Wake-on: g Current message level: 0x00000003 (3) Link detected: yes
In normal operation, the ICMP ping round trip time should be around 20ms, and the interface speed and duplex should read, "1000MB/s", and, "Full", respectively.
After determining that the active ethernet adapter was at the incorrect speed, we
used the ifenslave(8)
command to make the standby interface the active
interface, which yielded an immediate improvement in MapReduce performance, and a 10 times
improvement in network throughput:
On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.
Investigation results of a self-described "we're not sure what's wrong, but it seems slow" problem. http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html
Investigation results of general cluster performance from 2010. Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. http://hstack.org/hbase-performance-testing/
Case study of configuring max.transfer.threads
(previously known as
xcievers
) and diagnosing errors from misconfigurations. http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html
See also Section 2.1.1.5, “dfs.datanode.max.transfer.threads
”.