User Guide for Hadoop#
Overview#
This guide introduces how to install and configure the ADPS agent, and how to properly use ADPS to back up and restore Hadoop.
The backup and restore features supported by ADPS include:
Backup sources
Single or multiple directories and files on Hadoop Distributed File System (HDFS)
Backup types
Full backup, incremental backup, and synthetic backup
Backup targets
Standard storage pool, deduplication storage pool, local storage pool, file synthetic pool, tape library pool, object storage service pool, and LAN-free pool
Backup schedules
Immediate, one-time, hourly, daily, weekly, and monthly.
Data processing
Data compression, data encryption, multiple channels, reconnection, speed limit, and replication
Restore types
Point-in-time restore, instant recovery, and recovery testing
Restore targets
Original host, different host, different system (restore files between Hadoop and Linux), and different architecture (restore files from Hadoop to an operating system or object storage)
Restore options
Incremental restore, restore location (original path or customized path), and process invalid paths
Planning and preparation#
Before you install the agent, check the following prerequisites:
You have already installed and configured other backup components, including the backup server and the storage server.
You have created a user with roles of operator and administrator on the ADPS console. Log in to the console with this user to back up and restore the resource.
Note
The administrator role can install and configure agents, activate licenses, and authorize users. The operator role can create backup/restore jobs and conduct copy data management (CDM).
Install and configure the agent#
To back up and restore Hadoop, first install the ADPS agent on a host that can communicate with Hadoop.
Verify the compatibility#
Before you install the agent, ensure that the Hadoop environment is on the Aurreum Data Protection Suite’s compatibility lists.
ADPS supports the backup and restore of Hadoop of multiple versions, including:
Hadoop 2.2.x/2.6.x/2.7.x/2.8.x/2.9.x/3.0.x/3.1.x/3.2.x
CDH 6.0/6.1/6.2/6.3
Install the agent#
The ADPS agent can be installed on Linux and supports online and offline installation. We recommend online installation.
Online installation: ADPS provides
curl
andwget
commands for installation.Offline installation: See Offline installation in Aurreum Data Protection Suite Agent Installation Guide.
Before you install the Hadoop backup agent, install the Hadoop runtime environment on the backup host.
Unzip the Hadoop runtime environment offline package to a directory (C, in this example):
$ sudo tar -axf hadoop-2.10.0.tar.xz -C <dir>
hadoop-2.10.0
appears in directory C.Install OpenJDK:
$ sudo tar -axf Ubuntu20.04-OpenJDK11-AMD64.tar.gz
openjdk11
appears in the directory.Only install the
openjdk-11-jre-headless
environment:$ sudo dpkg -i openjdk11/*.deb
Note
Fill in the JRE directory and version. The default directory is /usr/lib/jvm/jre
.
To install the agent online, do the following:
Log in to the ADPS console.
From the menu, click Resource > Resource. The Resource page appears.
From the toolbar, click the Install agent icon. The Install agent window appears.
In the Install agent window, do the following:
(1) From the Select system list, select Linux.
(2) From the Component list, select Hadoop. The
curl
andwget
commands appear in the window.(3) If you want to delete the downloaded package automatically after the installation, select the Delete installation package check box.
(4) If you enable Ignore SSL errors, the installation will ignore certificate errors and so on. If you disable the option, the installation will prompt you to enter Y/N to continue or discontinue the process when an error occurs.
Click the Copy icon to copy the
curl
orwget
command.Log in to the Linux host as user root. Paste the command in the terminal and press Enter to start the installation. Example:
root@ubuntu:~# curl "http://192.168.17.90:50305/d2/update/script?modules=hadoop&location=http%3A%2F%2F192.168.17.90%3A50305&access_key=929a401135dc8f06efbc29c3ea86e3f9&rm=yes&tool=curl" | sh % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9205 0 9205 0 0 1797k 0 --:--:-- --:--:-- --:--:-- 2247k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 52.3M 100 52.3M 0 0 297M 0 --:--:-- --:--:-- --:--:-- 297M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 5730k 100 5730k 0 0 81.1M 0 --:--:-- --:--:-- --:--:-- 81.1M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 1576k 100 1576k 0 0 7690k 0 --:--:-- --:--:-- --:--:-- 7690k
Wait for the installation to complete.
Add and activate a Hadoop cluster#
Add a Hadoop cluster#
From the menu, click Resource > Resource. The Resource page appears.
From the toolbar, click the “+” icon and select Hadoop cluster. The Add Hadoop cluster window appears.
ADPS supports two authentication methods for Hadoop clusters: Simple and Kerberos. If you have configured your Hadoop cluster with the Kerberos authentication service, select Kerberos when adding the cluster. If your Hadoop cluster does not have the Kerberos service, you can use the default Simple authentication. You can click the “+” icon in the bottom left corner of the window to add multiple NameNodes.
In the Add Hadoop cluster window, do the following:
(1) Create the Hadoop resource.
In the Name field, enter a name for the cluster.
From the Backup host list, select a host to list backup content and as the default host for backups and restores.
In the Host field, enter the IP or the name of the host where the NameNode resides. If you have created Principal using the hostname when configuring the Kerberos authentication, then enter the hostname in the Name field, and add the IP and corresponding hostname in the
hosts
file on the selected backup host.Enable SSL only when the Hadoop cluster has configured and enabled the HTTPS service. Otherwise, deselect the option.
In the REST API port field, enter the port number. The default number for HTTP is 50070, and for HTTPS is 50470. Modify the port number according to your cluster configuration. You can check whether a port number is configured for the
dfs.namenode.http(s)-address
parameter in thehdfs-site.xml
file. If no port number is configured, use the default number in this field.In the RPC API port field, enter the port number. The default number is 8020. Modify the port number according to your cluster configuration. You can check whether a port number is configured for the
fs.defaultFS
parameter in thecore-site.xml
file. If no port number is configured, use the default number in this field.In the User field, enter an HDFS user. If HDFS is configured with the Kerberos authentication, enter the user of Principal authenticated in the
keytab
file. Example: There is a Principal usertest@HADOOP.COM
. Then entertest
in this field.
(2) Simple authentication
From the Authentication method list, select Simple.
In the
core-site.xml
file field, upload thecore-site.xml
file of the cluster. This option is not required for the Simple authentication.In the
hdfs-site.xml
file field, upload thehdfs-site.xml
file of the cluster. This option is not required for the Simple authentication.
(3) Kerberos authentication
From the Authentication method list, select Kerberos.
In the Realm name field, enter the Realm name that you configured for Kerberos.
In the Realm KDC server field, enter the IP or hostname of the Realm KDC server. The default port number is 88. Add the number if your port is different from the default number.
In the Realm Admin server field, enter the IP or hostname of the Realm Admin server. The default port number is 88. Add the number if your port is different from the default number.
In the RPC API Principal field, enter the name of the PRC API Principal. You can check the name in the
keytab
file. Example:klist -k -t test.keytab
.In the REST API Principal field, enter the name of the REST API Principal. You can check the name in the
keytab
file. Example:klist -k -t test.keytab
.In the UDP preference limit field, set the maximum value for the UDP transport package. When the data package exceeds this value, TCP will be used for the transport. The default value is 1, with which TCP is used for the transport. Adjust this value according to the parameter in the
/etc/krb5.conf
file of the KDC service.For the
krb5.keytab
file field, get thekeytab
file and copy it to a safe location on the host that has access to the ADPS console.In the
core-site.xml
file field, upload thecore-site.xml
file of the cluster. This option is required for the Kerberos authentication.In the
hdfs-site.xml
file field, upload thehdfs-site.xml
file of the cluster. This option is required for the Kerberos authentication.
Activate the Hadoop license#
To activate the Hadoop license and authorize users, do the following:
After you add the Hadoop cluster, a window to activate the Hadoop license appears. Click Activate.
After activation, click Authorize beside the Hadoop resource. The Authorize window appears.
In the Authorize window, do the following:
(1) From the User group list, select user groups to authorize access to the resource.
(2) Set the resource as Protected or not. The Protected resource cannot be the restore target or the data replication target unless the administrator removes the Protected label.
Note
If you are prompted with “No enough licenses”, contact the administrator to add licenses.
If the parameters of the added cluster have changed, including the IP, port, and authentication method, you can click Configure to modify the configuration.
Backup#
Backup types#
ADPS provides two common backup types (full backup and incremental backup) and one advanced (synthetic backup) for Hadoop.
Full backup
Backs up the Hadoop directories or files and copies all the directories and files at some point in time.
Incremental backup
Backs up only the files that have changed since the last backup (including the full backup and the incremental backup) with a full backup as the baseline.
Synthetic backup
The first synthetic backup is a full backup and the subsequent ones are incremental. When the synthesis condition is reached, the latest full backup and subsequent incremental backup will be synthesized to create a new full backup copy. Synthetic backups can improve the restore performance. You can directly mount the full backup copy to the target host through an instant recovery job without physical copies and additional storage space.
Backup policies#
ADPS provides six backup schedule types: immediate, one-time, hourly, daily, weekly, and monthly.
Immediate: ADPS will immediately start the job after it is created.
One-time: ADPS will perform the job at the specified time once only.
Hourly: ADPS will perform the job periodically at the specified hour/minute intervals within the time range according to the setting.
Daily: ADPS will perform the job periodically at the specified time and day intervals.
Weekly: ADPS will perform the job periodically at the specified time and week intervals.
Monthly: ADPS will perform the job periodically at the specified dates and times.
You can set an appropriate backup policy based on your situation and requirements. Usually, we recommend the following common backup policy:
Perform a full backup once a week when the application traffic is relatively small (Example: on the weekend) to ensure that you have a recoverable point in time every week.
Perform an incremental backup every day when the application traffic is relatively small (Example: at 2 a.m.) to ensure that you have a recoverable point in time every day, which can save storage space and backup time.
To use the advanced synthetic backup, we recommend the following backup policy:
Perform a synthetic backup every day to ensure that you have a recoverable point in time every day.
Before you begin#
Before you back up and restore Hadoop, check whether any storage pools have been created and authorized.
From the menu, click Storage > Storage pool. The Storage pool page appears.
Check whether the display area has any storage pools. If no, create a storage pool and authorize it for the current user. For details, see Add a storage pool in Aurreum Data Protection Suite Administrator’s Guide.
Note
To use synthetic backups, ensure that your environment fulfills the following requirements:
You have advanced licenses: Hadoop Synthetic Backup and Hadoop Copy Data Management.
You have created a file synthetic pool for the current user.
Create a backup job#
To create a backup job, do the following:
From the menu, click Backup. The backup job wizard appears.
At the Hosts and resources step, select the Hadoop host and resource. The wizard goes to the next step automatically.
At the Backup source step, do the following:
Note
For incremental backups, the Backup source step only requires a full backup selected as their baseline and there is no need to select the directories and files again.
(1) From the Backup type list, select a backup type.
(2) In the Backup source section, click + to expand folders and select the files or folders that you want to back up.
(3) If you want to filter the selected files and folders in the Backup source, click Filter below the Backup source field. The Filter window appears.
The Exclusion option is disabled by default. If you want to exclude some directories or files from the backup job, enter the directories and files in this field.
You can select the Inclusion checkbox and enter directories and files in case their parent directories are listed in the Exclusion field.
Note
For example, there are directories
/data
and/test
./test
has hundreds of files. Some are.txt
, some are.dat
, and so on. The whole directory/data
and all the.txt
files under the directory/test
need to be backed up.First select
/test
and/data
in the Backup source. Then open the Filter window.Enter
/test
in the Exclusion.Select the Inclusion checkbox and enter
*.txt
in the field.The backup result will be
/data
with all the data and/test
with only.txt
files.
Here is an example of using the wildcard
*
in Filter. Assume that the backup source includes the following directories and files:root@ubuntu:/# tree /backup/ /backup/ └── test ├── group_1 │ └── sub_group │ ├── file1.dat │ └── file1.txt ├── group_2 │ └── sub_group │ ├── file2.dat │ └── file2.txt └── no_group
# Exclusion
Inclusion
Result
/backup/*
/backup/test/group_*/*
Directories
group_1
andgroup_2
and all their subdirectories are backed up/backup/*
*.txt
.txt
files, their directories, and the directoryno_group
are backed up*.txt
All the directories and files except
.txt
files are backed upAt the Backup host step, select a backup host and resource. The wizard goes to the next step automatically.
At the Backup target step, select a storage pool. Click Next.
Note
Incremental backups do not have the Backup target step because their backup target is the same as the selected full backup at the Backup source step.
At the Backup schedule step, set the job schedule. For details, see Backup policies. Click Next.
Select Immediate. ADPS performs the job immediately after it is created.
Select One time and set the start time for the job.
Select Hourly. Set the start time, end time, and time interval for job execution. The unit can be hour(s) or minute(s).
Select Daily. Set the start time and enter the time interval for job execution. The unit is day(s).
Select Weekly. Set the start time, enter the time interval, and select the specific dates in a week for job execution. The unit is week.
Select Monthly. Set the start time and months for job execution. You can select the natural dates in one month or select the specific dates in one week.
At the Backup options step, set the common and advanced options according to your needs. For details, see Backup options. Click Next.
At the Finish step, set the job name and confirm the job information. Click Submit.
After the submission, you will be redirected to the Job page automatically. On this page, you can start, modify, and delete the job.
Backup options#
ADPS provides the following backup options for Hadoop:
Common options
Option |
Description |
Limitations |
---|---|---|
Compression |
Fast is enabled by default. |
|
Channels |
It can improve backup efficiency. The default value is 1 and the value ranges from 1 to 255. |
Only available for full backup and synthetic backup jobs. |
Snapshot |
Enable this option to perform Hadoop snapshot backups. It is disabled by default. |
Only available for full backup and synthetic backup jobs. The setting of incremental backups is the same as that of the baseline full backup. |
Advanced options
Option |
Description |
Limitations |
---|---|---|
Reconnection time |
The value ranges from 1 to 60 minutes. The job continues after the abnormal reset occurs in the network within the set time. |
|
Resumption buffer size |
Specifies the resumption buffer size. The default value is 10 MiB. The bigger the resumption buffer size is, the more physical storage will be consumed. However, a bigger resumption buffer size can prevent data loss when the throughput of the business system is high. |
|
Speed limit |
Limits data transfer speed or disk read/write speed for different time periods. The unit can be KiB/s, MiB/s, and GiB/s. |
|
Precondition |
Checked before the job starts. The job execution will be aborted and the job state will be idle when the precondition is invalid. |
|
Pre-/Post-script |
The pre-script is executed after the job starts and before the resource is backed up. The post-script is executed after the resource is backed up. |
Restore#
Restore types#
For different needs, ADPS provides several restore types for Hadoop, including:
Point-in-time restore
Restores the Hadoop folders or files to a specified point in time. The restore target can be the original host, a different host, the original path, and a customized path.
Instant recovery
Achieves fast recovery by mounting the Hadoop backup sets from the storage server with the following advantages: fast recovery speed, little resource consumption, reduced disk space, and improved availability of backup sets.
Recovery testing
Restores the latest backup sets to another path on the original host or a different host hourly, daily, weekly, and monthly.
Before you begin#
To restore Hadoop to a different host, install the agent on that host or register the Hadoop resource, activate the licenses, and authorize user access to the resource.
Create a point-in-time restore job#
To create a point-in-time restore job, do the following:
From the menu, click Restore. The restore job wizard appears.
At the Hosts and resources step, select the Hadoop host and resource. The wizard goes to the next step automatically.
At the Backup sets step, do the following:
(1) From the Storage pool list, select a storage pool with backup sets, including the source and target pools with pool replication relation. Default refers to the one where the selected backup sets reside.
(2) From the Restore type list, select Point-in-time restore.
(3) In the Restore source section, select a point in time for the restore job.
(4) In the File section, select files for the restore job. All the files in the backup set will be selected by default. You can select or deselect the files.
Note
Files in backup sets from local storage pools and LAN-free pools cannot be listed in the File field.
(5) Click Next.
At the Restore target step, select a host and resource as the target. The wizard goes to the next step automatically.
Note
If you select a Hadoop or object storage resource as the restore target, the wizard will go to the Backup host step. After you select a backup host, the wizard goes to the next step automatically.
At the Restore schedule step, set the job schedule. Click Next.
Select Immediate. ADPS will perform the job immediately after its creation.
Select One time and set the start time for the job.
At the Restore options step, set the options according to your needs. See Restore options. Click Next.
At the Finish step, set the job name and confirm the job information. Click Submit.
After the submission, you will be redirected to the Job page. You can start, modify, and delete the job.
Create an instant recovery job#
Note
The
adps-nfsd
package should be installed on the storage server for Hadoop instant recovery.Hadoop instant recovery only supports backup sets from standard storage pools (with neither Multi-storage nor Data storage encryption enabled) and file synthetic pools.
To create an instant recovery job, do the following:
From the menu, click Restore. The restore job wizard appears.
At the Hosts and resources step, select the Hadoop host and resource. The wizard goes to the next step automatically.
At the Backup sets step, do the following:
(1) From the Storage pool list, select a storage pool with backup sets, including the source and target pools with pool replication relation. Default refers to the one where the selected backup sets reside.
(2) From the Restore type list, select Instant recovery.
(3) In the Restore source section, select a point in time for the restore job.
(4) Click Next.
At the Export step, do the following:
(1) In the Export field, set the mount point for the export. The path must begin with / and can be 2-30 characters with letters or numbers.
(2) In the Access control list field, add the agents that can mount and access the backup set. It supports IP addresses and network segments. * means all agents can access the backup set.
(3) From the Conversion path coding list, select None, GBK, GB18030, or BIG5. None is selected by default, in which UTF8 is used for instant recovery.
(4) Expand the Advanced options. From the Bridge list, select whether to use a network bridge or not. None is selected by default. You can use the network bridge to export backup sets to avoid conflicts with the NFS service of the operating system.
Note
To use a network bridge, enter the IP address, subnet mask, and default gateway. The IP address must be a valid address that is not used in this network segment.
Install
bridge-utils
on the storage server for bridge settings, with which ADPS can recognize the network bridge after it is started. Add the following content to the configuration file/etc/network/interfaces
:auto br0 iface br0 inet static address 192.168.88.10 netmask 255.255.255.0 gateway 192.168.88.1 bridge_ports bond0 bridge_stp off bridge_fd 9 bridge_hello 2 bridge_maxage 12
At the Finish step, confirm the job information and click Submit.
After the submission, you will be redirected to the help page. Mount the files manually according to the procedure. On the CDM page, a copy with a mounted state is added below the selected point in time. For more details, see View a copy.
Create a recovery testing job#
To create a recovery testing job, do the following:
From the menu, click Restore. The restore job wizard appears.
At the Hosts and resources step, select the Hadoop host and resource. The wizard goes to the next step automatically.
At the Backup sets step, do the following:
(1) From the Storage pool list, select a storage pool with backup sets, including the source and target pools with pool replication relation. Default refers to the one where the selected backup sets reside.
(2) From the Restore type list, select Recovery testing.
(3) In the Restore source section, select a point in time for the restore job.
(4) In the File section, select files for the restore job. All the files in the backup set will be selected by default. You can select or deselect the files.
Note
Files in backup sets from local storage pools and LAN-free pools cannot be listed in the File section.
(5) Click Next.
At the Restore target step, select a host and resource as the target. The wizard goes to the next step automatically.
Note
If you select a Hadoop or object storage resource as the restore target, the wizard will go to the Backup host step. After you select a backup host, the wizard goes to the next step automatically.
At the Restore schedule step, set the job schedule. Click Next.
Select Hourly. Set the start time, end time, and time interval to specify the time range for job execution. The unit can be hour(s) or minute(s).
Select Daily. Set the start time and enter the time interval for job execution. The unit is day(s).
Select Weekly. Set the start time, enter the time interval, and select the specific dates in a week for job execution. The unit is week.
Select Monthly. Set the start time and months for job execution. You can select the natural dates in one month or select the specific dates in one week.
At the Restore options step, set the options according to your needs. See Restore options. Click Next.
At the Finish step, set the job name and confirm the job information. Click Submit.
After the submission, you will be redirected to the Job page. You can start, modify, and delete the job.
Restore options#
ADPS provides the following restore options for Hadoop:
Common options:
Option |
Description |
Limitations |
---|---|---|
Channels |
It can improve restore efficiency. The default value is 1. The value cannot exceed that of the backup set. |
|
Restore location |
You can set the restore location to the original location or a specified location. To specify a location, enter the path manually or click Browse to select the target folder. |
|
Incremental restore |
Only when you select an incremental backup set for the restore will this option become available. It is disabled by default. If you enable this feature, the job will only restore the incremental data at the selected point in time. |
Only available for point-in-time restore jobs. |
Advanced options:
Option |
Description |
Limitations |
---|---|---|
Reconnection time |
The value ranges from 1 to 60 minutes. The job continues after the abnormal reset occurs in the network within the set time. |
|
Resumption buffer size |
Specifies the resumption buffer size. The default value is 10 MiB. The bigger the resumption buffer size is, the more physical storage will be consumed. However, a bigger resumption buffer size can prevent data loss when the throughput of the business system is high. |
|
Speed limit |
Limits data transfer speed or disk read/write speed for different time periods. The unit can be KiB/s, MiB/s, and GiB/s. |
|
Precondition |
Checked before the job starts. The job execution will be aborted and the job state will be idle when the precondition is invalid. |
|
Pre-/Post-script |
The pre-script is executed after the job starts and before the resource is restored. The post-script is executed after the resource is restored. |
|
Process invalid paths |
- Do not check and convert paths |
Copy data management#
On the CDM page, you can manage the copies generated by instant recovery and synthetic backup jobs, including viewing, cloning, unmounting, and deleting copies.
View a copy#
To view copies, do the following:
From the menu, click CDM. The CDM page appears.
From the toolbar, select the resource and the period when copies are created. The display area shows copies of this resource in this period. Copies are named by the creation time.
Click the copy name. The details of this copy appear on the right side of the page. Different icons represent different copy types.
Full copy: A data copy created by a synthetic backup.
Mounted copy: A data copy created by an instant recovery.
Clone a copy#
You can click the Clone copy icon to create an instant recovery job for the synthetic copy to create a new mounted copy.
To clone a copy, do the following:
From the menu, click CDM. The CDM page appears.
From the toolbar, select the resource and the period when copies are created. The display area shows the copies in this period.
On the display area, click a full copy under the resource. The Clone copy icon appears on the right of the copy.
Click the Clone copy button. You will be redirected to the Backup sets step. See Create an instant recovery job to configure the job.
After the instant recovery, a copy with a mounted state is added below the selected point in time on the CDM page.
Unmount a copy#
You can click the Unmount icon to unmount the mounted copies. This operation will make the mounted directory on the restore target inaccessible.
To unmount a copy, do the following:
From the menu, click CDM. The CDM page appears.
From the toolbar, select the resource and the period when copies are created. The display area shows the copies in this period.
Expand the full backup copy and select a mounted copy. The Unmount icon appears on the right of the copy.
Click the Unmount icon. A confirmation window appears.
Confirm the warning and enter the verification code. Click OK.
After the unmounting, you can see no such mounted copy record under the full copy.
Limitations#
Feature |
Limitations |
---|---|
Point-in-time restore |
Hadoop backup sets cannot be restored to Windows and vice versa. |
Instant recovery |
Only Linux storage server supports instant recovery. |
Recovery testing |
It does not support recovering Hadoop backup sets to Windows. |
Glossary#
Term |
Description |
---|---|
fast compression |
A compression method that uses fast compression algorithms to compress data during the backup job. |
cross-system restore |
A restore method that restores files from Hadoop to Linux and vice versa. |
different-architecture restore |
A restore method that restores files from Hadoop to an operating system or object storage. |