Enhancing data transfer with S3 Parquet files and replication

XMLWordPrintable

    • 1

      User Problem

      The Customer is transitioning from an API-based data extraction model to a more secure and efficient S3 Parquet-based replication system for data storage and disaster recovery.

      The following challenges were identified:

      • The customer's requested replication model is non-standard for Atlassian and requires internal review, testing, and automation updates.
      • AWS replication delays for large data volumes risk disrupting sequential file processing and data accuracy.
      • Versioning requirements increase costs and complexity.
      • Ensuring compatibility of Parquet schema with the Customer's incremental updates.
      • Security and setup configurations require significant customization and approval.

      Suggested Solution

      • Optimize Replication for Large Data Volumes
        • Investigate premium AWS replication services or alternative solutions to reduce replication delays for larger data volumes.
        • Implement monitoring tools to track replication progress and identify bottlenecks early, ensuring timely data processing.
      • Ensure Data Integrity with Versioning and Backup Strategies
        • Confirm that both parties enable versioning for the source and target S3 buckets, with clear policies to manage versioning costs effectively.
        • Develop a strategy for sequential file processing to avoid data drift, including error-handling mechanisms for delayed or incomplete file transfers.
      • Validate Data Compatibility with Parquet Schema
        • Ensure all required tables and columns are present and usable in the new Parquet-based replication model.
        • Map the Customer's existing hourly API-based incremental updates to the Parquet schema, ensuring compatibility with the new system.
      • Strengthen Security and Streamline Configuration
        • Customize the replication setup templates provided by the Customer to align with Atlassian's environment and security requirements.
        • Conduct a security review with Atlassian's security team to identify and address potential vulnerabilities in the replication process.

      Current Workarounds

      • Continue API-Based Incremental Updates: Use the existing hourly API process until the S3 replication model is fully implemented.
      • Schedule Transfers Strategically: Perform exports during off-peak hours to minimize AWS replication delays.
      • Small-Scale Parquet Testing: Validate the Parquet schema with limited data before scaling up.
      • Manual Monitoring: Assign oversight for critical data transfers to address delays or issues.
      • Temporary Non-Versioned Buckets: Use non-versioned buckets for initial testing to reduce costs.
      • Iterative Template Refinement: Adjust provided templates to fit Atlassian’s standards and expedite security approval.

            Assignee:
            Melissa Hartsock
            Reporter:
            Rodrigo San Vicente
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: