19 Connecting to Microsoft Azure Data Lake

Learn how to connect to Microsoft Azure Data Lake to process big data jobs with Oracle GoldenGate for Big Data.

Use these steps to connect to Microsoft Azure Data Lake from Oracle GoldenGate for Big Data.

  1. Download Hadoop 2.9.1 from http://hadoop.apache.org/releases.html.

  2. Unzip the file in a temporary directory. For example, /ggwork/hadoop/hadoop-2.9.

  3. Edit the /ggwork/hadoop/hadoop-2.9/hadoop-env.sh file in the directory.

  4. Add entries for the JAVA_HOME and HADOOP_CLASSPATH environment variables:

    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
    export HADOOP_CLASSPATH=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH
    

    This points to Java 8 and adds the share/hadoop/tools/lib to the Hadoop classpath. The library path is not in the variable by default and the required Azure libraries are in this directory.

  5. Edit the /ggwork/hadoop/hadoop-2.9.1/etc/hadoop/core-site.xml file and add:

    <configuration>
    <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
    </property>
    <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>Insert the Azure https URL here to obtain the access token</value>
    </property>
    <property>
    <name>fs.adl.oauth2.client.id</name>
    <value>Insert the client id here</value>
    </property>
    <property>
    <name>fs.adl.oauth2.credential</name>
    <value>Insert the password here</value>
    </property>
    <property>
    <name>fs.defaultFS</name>
    <value>adl://Account Name.azuredatalakestore.net</value>
    </property>
    </configuration>
    
  6. Open your firewall to connect to both the Azure URL to get the token and the Azure Data Lake URL. Or disconnect from your network or VPN. Access to Azure Data Lake does not currently support using a proxy server per the Apache Hadoop documentation.

  7. Use the Hadoop shell commands to prove connectivity to Azure Data Lake. For example, in the 2.9.1 Hadoop installation directory, execute this command to get a listing of the root HDFS directory.

    ./bin/hadoop fs -ls /
    
  8. Verify connectivity to Azure Data Lake.

  9. Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Using the File Writer Handler. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.

    Setting the gg.classpath example:

    gg.classpath=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/common/:/ggwork/hadoop/hadoop-
    2.9.1/share/hadoop/common/lib/:/ggwork/hadoop/hadoop-
    2.9.1/share/hadoop/hdfs/:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/hdfs/lib/:/ggwork/hadoop/hadoop-
    2.9.1/etc/hadoop:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*
    

See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.