Chapter 15. Troubleshooting and Debugging Apache HBase

Chapter 15. Troubleshooting and Debugging Apache HBase
Prev		Next

Chapter 15. Troubleshooting and Debugging Apache HBase

Table of Contents

15.1. General Guidelines

15.2.1. Log Locations
15.2.2. Log Levels
15.2.3. JVM Garbage Collection Logs

15.3. Resources

15.3.1. search-hadoop.com
15.3.2. Mailing Lists
15.3.3. IRC
15.3.4. JIRA

15.4. Tools

15.4.1. Builtin Tools
15.4.2. External Tools

15.5. Client

15.5.1. ScannerTimeoutException or UnknownScannerException
15.5.2. LeaseException when calling Scanner.next
15.5.3. Shell or client application throws lots of scary exceptions during normal operation
15.5.4. Long Client Pauses With Compression
15.5.5. Secure Client Connect ([Caused by GSSException: No valid credentials provided...])
15.5.6. ZooKeeper Client Connection Errors
15.5.7. Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)
15.5.8. Client Slowdown When Calling Admin Methods (flush, compact, etc.)
15.5.9. Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])

15.6. MapReduce

15.6.1. You Think You're On The Cluster, But You're Actually Local
15.6.2. Launching a job, you get java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString or class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString

15.7. NameNode

15.7.1. HDFS Utilization of Tables and Regions
15.7.2. Browsing HDFS for HBase Objects

15.8. Network

15.8.1. Network Spikes
15.8.2. Loopback IP
15.8.3. Network Interfaces

15.9. RegionServer

15.9.1. Startup Errors
15.9.2. Runtime Errors
15.9.3. Snapshot Errors Due to Reverse DNS
15.9.4. Shutdown Errors

15.10. Master

15.10.1. Startup Errors
15.10.2. Shutdown Errors

15.11. ZooKeeper

15.11.1. Startup Errors
15.11.2. ZooKeeper, The Cluster Canary

15.12. Amazon EC2

15.12.1. ZooKeeper does not seem to work on Amazon EC2
15.12.2. Instability on Amazon EC2
15.12.3. Remote Java Connection into EC2 Cluster Not Working

15.13. HBase and Hadoop version issues

15.13.1. NoClassDefFoundError when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)
15.13.2. ...cannot communicate with client version...

15.14. IPC Configuration Conflicts with Hadoop

15.15. HBase and HDFS

15.16. Running unit or integration tests

15.16.1. Runtime exceptions from MiniDFSCluster when running tests

15.17. Case Studies

15.18. Cryptographic Features

15.18.1. sun.security.pkcs11.wrapper.PKCS11Exception: CKR_ARGUMENTS_BAD

15.19. Operating System Specific Issues

15.19.1. Page Allocation Failure

15.20. JDK Issues

15.20.1. NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet

15.1. General Guidelines

Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.

An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print some metrics when aborting so grepping for Dump should get you around the start of the problem.

RegionServer suicides are “normal”, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see Limits on Number of Files and Processes (ulimit) and Section 2.1.1.5, “dfs.datanode.max.transfer.threads”) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Section 14.3.1.1, “Long GC pauses” above.

comments powered by Disqus

Prev		Next
14.14. Case Studies	Home	15.2. Logs