Bamboo Elastic Instance fails to mount EBS snapshot volume if an Instance Store volume is present

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: High
    • 9.3.0, 9.1.2, 9.2.3
    • Affects Version/s: 9.1.1
    • Component/s: Elastic Bamboo
    • None
    • 1
    • Severity 2 - Major

      Issue Summary

      This is reproducible in Data Center:

      Bamboo Elastic Instance fails to mount EBS snapshot volume if an Instance Store volume is present on the EC2 instance.

      From AWS:

      An AWS instance store provides temporary block-level storage for your instance. This storage is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances, such as a load-balanced pool of web servers.

      The Bamboo Elastic Image's mountEbsDevice.sh script uses the lsblk command to list the volumes and find the one that should be mounted as the Instance's EBS volume. The script runs lsblk for disk volumes, captures the first device string and skips the root disk volume. The way Linux sorts nvme* or sd* device names is not predictable, hence there's no guarantee that the disk devices will always be reported in the same order, meaning that lsblk can return a specific disk names sequence when the server is started and another list in a different order the next time that same instance (or a different one) is started.

      From the lsblk manual page:

      The default output as well as default output from options like topology and --fs is subject to change, so whenever possible you should avoid using default outputs in your scripts. Always explicitly define expected columns by --output columns in an environment where a stable output is required.

      The mountEbsDevice.sh script will scan for the first, non-root NVMe device that shows on the list and try to mount it. If the Instance Store volume appears before the EBS volume, the bug will occur.

      In the example below, nvme2n1 is the Instance Store volume and nvme1n1 is the EBS volume:

      The bug happens here. Notice that the nvme2n1 – Instance Store volume – is listed before the EBS volume nvme1n1:

      nvme2n1     259:0    0 220.7G  0 disk 
      nvme1n1     259:1    0    15G  0 disk 
      nvme0n1     259:2    0    20G  0 disk 
      └─nvme0n1p1 259:3    0    20G  0 part /
      

      But here, after rebooting the same instance, the bug will not manifest as the nvme1n1 (EBS) is listed before nvme2n1 (Instance Store).

      nvme0n1     259:0    0    20G  0 disk 
      └─nvme0n1p1 259:1    0    20G  0 part /
      nvme1n1     259:2    0    15G  0 disk 
      nvme2n1     259:3    0 220.7G  0 disk 

      Steps to Reproduce

      1. Configure an Elastic Image with:
      2. Start the Elastic Instance
      3. The issue will happen if the Instance Store volume is listed before the EBS volume inside the EC2 instance
      4. If it "works", try starting a new instance until lsblk returns that specific ordered list with the Instance Store volume device being listed before the EBS device

      Expected Results

      • Bamboo should be able to mount the EBS volume regardless of the presence of other non-EBS volumes in any order

      Actual Results

      Bamboo will try to mount the Instance Store volume as if it were the EBS volume and will fail.

      There is also another case where the Instance Store volume would already contain a valid filesystem (ext3/ext4/xfs/etc). Bamboo would then succeed in mounting that volume as it were its EBS location. Bamboo will then fail at the Application level when trying to use that location as its EBS storage as that volume will not contain the required content - There is a risk of data loss in this scenario.

      /tmp/SetupEbsSnapshot.log
      ...
      Instance type: c6id.xlarge
      NVMe mode instance type override: false
      Detected an instance that exposes EBSes through NVMe devices, running the NVMe code path...
      NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
      nvme0n1       259:0    0     8G  0 disk 
      ├─nvme0n1p1   259:1    0     8G  0 part /
      └─nvme0n1p128 259:2    0     1M  0 part 
      nvme2n1       259:3    0 220.7G  0 disk 
      nvme1n1       259:4    0    20G  0 disk 
      Detected an EBS volume attached at /dev/nvme2n1.
      Detecting whether the volume has been attached as /dev/nvme2n1 or /dev/xvd1.
      Mounting /dev/nvme2n1 at /mnt/bamboo-ebs
      mount: /mnt/bamboo-ebs: wrong fs type, bad option, bad superblock on /dev/nvme2n1, missing codepage or helper program, or other error.
      Initial call failed, second attempt...
      mount: /mnt/bamboo-ebs: wrong fs type, bad option, bad superblock on /dev/nvme2n1, missing codepage or helper program, or other error.
      Second attempt failed, third and last attempt...
      mount: /mnt/bamboo-ebs: wrong fs type, bad option, bad superblock on /dev/nvme2n1, missing codepage or helper program, or other error.
      3 attempts failed: (mount -v  /dev/nvme2n1 /mnt/bamboo-ebs)
      mount: /mnt/bamboo-ebs: wrong fs type, bad option, bad superblock on /dev/nvme2n1, missing codepage or helper program, or other error.
      Failed to mount volume. Exiting.
      

      Workaround

      Create a custom Elastic Agent Image and patch the mountEbsDevice.sh script with the changes below and tell it to read the device's SERIAL number and look for a valid EBS volume first.

      The broken code (to be replaced):

          nvmeDeviceName=$(lsblk -l | grep -v ^${diskWithRootPartition} | grep nvm | head -1 | awk '{print $1}')
      

      Replace the line above with the content below:

          nvmeDevices=$(lsblk -l | grep -v ^${diskWithRootPartition} | grep nvm | awk '{ print $1 }')
          # Loop the nvmeDevices until it finds the first EBS volume after the root volume
          # this is to avoid using secondary NVMe volumes attached as Instance Store volumes
          # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#instance-store-volumes
          for _nvme in ${nvmeDevices} ; do
            # EBS Volumes have a volXXXXXXXXXX serial number format, Instance Store volumes have an AWSXXXXXXXXXXX
            lsblk -rn -o SERIAL /dev/${_nvme} | grep -q -e '^vol.*$' && nvmeDeviceName=${_nvme} && break || echo "Found an Instance Store volume on /dev/${_nvme}. Skipping..."
          done 

              Assignee:
              Alexey Chystoprudov
              Reporter:
              Eduardo Alvarenga (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: