AWS Data Deep Dive Day

AWS Data Deep Dive Day

Data has become the lifeblood of modern businesses, fueling growth, driving innovation, and shaping decision-making processes. Organizations are constantly seeking ways to leverage advanced analytics and cloud computing solutions to tap into the full potential of data-driven strategies. AWS plays an important role in this domain, providing powerful tools and services for extracting insights from data.

Recently, I had the privilege of attending the AWS Deep Data Dive Day (D4) at PwC Manchester, an immersive event that brought together data enthusiasts, industry experts, and technology enthusiasts. Hosted by PwC, a renowned professional services network, this event promised to take us on a deep dive into the world of AWS's data-centric services, exploring data analytics, machine learning, and artificial intelligence.

👏
A big thank you to Ermiya for bringing it all together!

As a member of the Clock team, a prominent organization specializing in building robust solutions for our clients, I was particularly drawn to this event due to some upcoming projects that involve working with significantly large datasets. The challenges of handling and optimizing the performance of such a massive dataset intrigued me, especially in the areas of partitioning and data segregation. I saw the event as an opportunity to gain valuable insights and best practices for tackling this project effectively.

At the event, I joined a diverse group of professionals eager to gain hands-on experience, learn from real-world case studies, and connect with industry leaders driving the data revolution. With an impressive lineup of speakers, interactive workshops, and networking opportunities, the event offered a unique platform to uncover the transformative power of data in today's digital landscape.

I thought I would my firsthand experience of the event at PwC Manchester, highlighting the key takeaways, session highlights, and the transformative impact of AWS's cutting-edge data services 🍻.

What was covered?

This event featured a diverse range of sessions, each delving into specific topics related to AWS's data-centric services. Here's an overview of the key areas that were covered during the event:

Using Graph Databases to provide real-world context to your Security Control findings

This session explored the use of graph databases to enhance security control findings. Attendees gained insights into leveraging AWS services to analyse and visualize security data in a graph format, enabling them to uncover hidden relationships and contextualize security events.

Advanced Monitoring for Amazon RDS

Focusing on Amazon RDS, this session provided attendees with advanced monitoring techniques to optimize the performance and availability of their database instances. Experts shared best practices and demonstrated how to leverage AWS tools and services for proactive monitoring and troubleshooting.

Unlocking the full potential of DynamoDB

DynamoDB, a fully managed NoSQL database service by AWS, took centre stage in this session. Attendees learned how to maximize the potential of DynamoDB by understanding its key features, implementing efficient data models, and leveraging advanced querying capabilities for high-performance applications.

Adapting to the Data Revolution

This session offered insights into Amazon Redshift, AWS's fully managed data warehousing service. Participants explored how to adapt to the changing data landscape and unleash the full potential of Redshift for scalable analytics, optimizing query performance, and integrating with other AWS services.

Sentric Music - Our Data Modernisation Journey

Sentric Music shared their real-world experience of data modernisation. Attendees learned about their journey in leveraging AWS's data services to transform their data architecture, enabling more efficient data processing, storage, and analysis.

Adding your database to your CI/CD Pipeline using Blue/Green Deployments on RDS?

This session focused on incorporating databases into the Continuous Integration and Continuous Deployment (CI/CD) pipeline using Blue/Green deployments on Amazon RDS. Experts discussed best practices and demonstrated how to seamlessly integrate database changes into the software delivery lifecycle.

How Steamhaus unlocked the value of Sperry’s Rail Data using AWS Data service

Steamhaus presented a case study on how they unlocked the value of Sperry's Rail Data using AWS data services. Attendees gained insights into the challenges faced, the solutions implemented, and the transformative impact of leveraging AWS data services for analyzing large volumes of rail data.

Cost Optimization of RDS

This session provided practical strategies for cost optimization in Amazon RDS. Attendees learned about different cost-saving techniques, such as right-sizing database instances, leveraging reserved instances, and implementing efficient backup and restore strategies.

Using Graph Databases to provide real-world context to your Security Control findings

Kevin Phillips - AWS Specialist Solution Architect, Neptune

As an AWS Specialist Solution Architect, Kevin Phillips possesses deep expertise in designing and implementing solutions using AWS services, with a particular focus on graph database technologies. His background and hands-on experience in working with Neptune have equipped him with unique insights into its capabilities and best practices for leveraging the power of graph databases.

Throughout the session, Kevin's passion for the subject matter and his ability to communicate complex concepts in a clear and concise manner shone through. He effortlessly demystified the world of graph databases, graph-backed applications, and their significance in the security landscape.

His expertise and real-world examples provided attendees with a solid understanding of how graph databases, such as Neptune, can be leveraged to gain valuable insights into security control findings. His insights into CloudTrail, CloudWatch, and Lambda functions showcased practical ways to build a security graph and perform forensic analysis within AWS environments.

During the session, Kevin highlighted several notable features and capabilities of Neptune that left a lasting impression on attendees:

  1. Massive Scalability: Kevin emphasized that Neptune is designed to handle large-scale graph datasets. He mentioned that Neptune supports up to 400 billion graph objects, enabling organizations to store and analyse vast amounts of interconnected data efficiently.
  2. Autoscaling Capabilities: Kevin highlighted Neptune's autoscaling capabilities, which allow it to adapt to varying workloads. Neptune can automatically scale both vertically and horizontally, dynamically adjusting resources to meet the demands of graph queries and ensuring optimal performance.
  3. Graph Explorer: Kevin introduced the concept of the Graph Explorer, a powerful tool provided by Neptune for visualizing and exploring relationships within graph data. The Graph Explorer allows users to interactively navigate and query the graph, gaining valuable insights into the interconnectedness of data points.

These notable features and capabilities of Neptune showcased its ability to handle large-scale graph datasets, adapt to changing workloads, and provide intuitive tools for exploring and understanding the relationships within the data. Kevin's insights on these aspects demonstrated the power and potential of Neptune in enabling organizations to derive meaningful insights and make informed decisions based on their graph data.

Example architecture to create a Security Graph

During the session, Kevin shed light on the utilization of a security graph as a powerful tool for uncovering insights and reinforcing security measures within AWS environments. By leveraging the relationships between different entities and resources, the security graph enables organizations to visualize complex security landscapes, identify potential risks, and respond effectively to security incidents.

Cloud Security Posture Management

The security graph plays a crucial role in evaluating and managing the cloud security posture of an environment. It allows organizations to assess AWS resources and monitor the overall security status. By visualizing the relationships between different resources, the security graph facilitates the identification of potential security gaps and misconfigurations. This, in turn, enables organizations to enhance their security controls and align them with industry best practices.

Data Flow/Exfiltration

With the help of the security graph, organizations can effectively monitor data movements within their environment. By tracking the relationships between resources, the graph can detect anomalies and patterns indicative of data flow or exfiltration attempts. Whether it involves provisioning resources to steal data from secure areas or unauthorized data transfers, the security graph provides valuable insights into potential data breaches. This allows organizations to swiftly detect and respond to such incidents, bolstering their data security measures.

Identity Access Management

The security graph offers a comprehensive view of the relationships between roles, policies, and permissions within an Identity Access Management (IAM) system. It enables organizations to visualize the policies attached to assigned roles, ensuring proper access controls are in place. By analyzing the graph, organizations can identify overly permissive access, insecure roles, or potential misconfigurations. Additionally, the graph aids in understanding nested roles, providing a clear understanding of the access privileges associated with each assigned role. This facilitates the implementation of robust IAM policies and the principle of least privilege.

Supply Chain Security

Assessing the security of the supply chain is crucial for safeguarding AWS environments. The security graph assists in this area by analyzing the dependencies used within the environment. By visualizing and tracking dependencies, such as EC2 images or application code dependencies, organizations can identify potential vulnerabilities and security risks associated with third-party components. This enables proactive measures to address vulnerabilities, such as staying updated with the latest security patches and avoiding dependencies with known security issues.

Digital Forensics

In the event of a security incident, the security graph serves as a valuable asset for digital forensics. It helps reconstruct the sequence of events, trace the origin of the issue, and identify potential sources of compromise. By analysing the relationships and patterns within the graph, organizations can determine the scope of the incident, identify any suspicious activities or access patterns, and pinpoint users with malicious intent. This aids in conducting thorough investigations, mitigating the impact of security breaches, and implementing appropriate remediation measures.

The security graph, with its ability to visualize complex relationships and contextual insights, empowers organizations to strengthen their security posture, detect anomalies, and respond effectively to security incidents. By leveraging the power of the security graph, organizations can gain valuable insights and make informed decisions to protect their AWS environments.

With Kevin's guidance, attendees were able to grasp the immense potential of Neptune and its role in enhancing security analysis, risk management, and proactive identification of vulnerabilities. His session undoubtedly left a lasting impression and equipped participants with actionable knowledge to apply to their own projects.

Advanced Monitoring for Amazon RDS

Tony Mullen, a Senior Database Solution Architect at AWS, began his session by providing attendees with an in-depth understanding of the monitoring capabilities available for Amazon RDS. He not only shared insights but also demonstrated how to navigate the AWS Management Console to access and utilize RDS monitoring features effectively.

He started by explaining the two levels of monitoring available: basic monitoring and enhanced monitoring. Tony emphasized that enhanced monitoring allows for more frequent monitoring intervals, enabling near real-time insights into the performance of the database. He showcased the process of configuring and customizing monitoring intervals to meet specific monitoring requirements.

During his demos, he showcased the AWS Management Console's intuitive interface, guiding attendees through the steps to access and analyse the monitoring data. He highlighted the ability to cherry-pick specific metrics to track with enhanced monitoring, focusing on metrics such as disk space, CPU utilization, and I/O throughput. Attendees gained a clear understanding of how to select and monitor the metrics most relevant to their specific use cases.

Tony also demonstrated how the AWS Management Console provides visibility into the processes running on the Amazon RDS instances. Attendees learned how to navigate the console to view and analyse the running processes, enabling them to gain insights into resource utilization, performance bottlenecks, and potential issues affecting the database's overall performance.

Moreover, Tony showcased the power of Performance Insights through practical examples. He demonstrated how Performance Insights allows database administrators and developers to drill down into detailed query-level performance metrics. By visually identifying and analyzing queries causing performance degradation, attendees learned how to optimize query execution, improve resource utilization, and enhance overall database performance.

In addition to Performance Insights, Tony highlighted the benefits of using DevOps Guru for RDS. Attendees discovered how DevOps Guru utilizes machine learning algorithms to analyse data and provide proactive recommendations for optimizing and maintaining data solutions. Tony showcased how DevOps Guru's intuitive dashboard enables organizations to identify and resolve issues promptly, ensuring the health and performance of their RDS deployments.

🙊
FYI, DevOps Guru does not suggest SQL to improve performance.

Tony Mullen's engaging demos not only provided attendees with practical demonstrations of accessing and utilizing the AWS Management Console for RDS monitoring but also empowered them to navigate and leverage the monitoring features effectively. The demos offered valuable hands-on experience, allowing attendees to gain confidence in monitoring their Amazon RDS instances and optimizing their data solutions for optimal performance and availability.

Next up, Matt Houghton, a Data Architect at CDL Software and an AWS Community Builder, took the stage to share insights into CDL's monitoring approach. CDL is an insure-tech company that powers insurance comparison websites, managing a vast infrastructure with numerous external dependencies. Matt emphasized the critical need for comprehensive monitoring to ensure the availability and stability of CDL's solutions.

In his presentation, Matt revealed an eye-opening slide showcasing the extensive array of services and integrations within CDL's solutions. The sheer complexity of their infrastructure highlights the importance of robust monitoring practices when it comes to ensuring smooth operations and proactively addressing issues.

To achieve effective availability monitoring, Matt discussed several tools that he leverages at CDL. One of them is Rundeck, a workflow automation tool that enables the generation and execution of programmatically scheduled workflows. Matt demonstrated how Rundeck plays a crucial role in CDL's monitoring strategy by automating routine tasks and enabling the proactive execution of critical processes.

Another tool in CDL's arsenal is Heartbeat, which Matt described as essential for monitoring the status of each infrastructure component and third-party integration. Heartbeat enables CDL to proactively check the health and availability of various resources, ensuring that potential issues are identified and addressed promptly.

Furthermore, Matt emphasized the significance of OpenSearch for log analysis. By leveraging OpenSearch, CDL gains enhanced diagnostic capabilities, allowing them to dive deep into logs for issue resolution and optimization planning. The ability to analyse logs efficiently contributes to a faster response time in identifying and resolving problems, ensuring smooth operations and customer satisfaction.

Matt acknowledged that each deployed environment at CDL might have differences to avoid generating unwanted notifications or logs for non-production workloads. This customization helps streamline monitoring efforts and prevents unnecessary alerts, ensuring that the right people are notified of critical issues.

Moreover, Matt introduced the concept of run books, generated using Rundeck, to augment support tickets. By utilizing run books, CDL reduces cognitive overhead when team members are on call, enabling them to quickly access the necessary information and take appropriate actions. This approach not only facilitates efficient incident resolution but also promotes knowledge-sharing and collaboration among team members.

Matt wrapped up his presentation by highlighting the power of monitoring each component of the solution, regardless of the technology used. CDL's approach leverages both AWS-native and external tools to create a secure and stable service. By employing tools like Rundeck, Heartbeat, and OpenSearch, CDL ensures high availability, proactively addresses issues, and maintains a smooth customer experience.

Matt Houghton's insightful presentation showcased CDL's commitment to monitoring excellence and the importance of leveraging the right tools to manage complex infrastructures successfully. Attendees gained valuable insights into CDL's monitoring approach, equipping them with the knowledge to enhance their own monitoring strategies and ensure the availability and stability of their solutions.

Unlocking the full potential of DynamoDB

In an engaging presentation, we had an expert in DynamoDB take us on a journey through the history and capabilities of this powerful NoSQL database service. He began by sharing the story of DynamoDB, originally used within Amazon before it became commercially available on AWS. It was initially developed to address scaling challenges that Amazon encountered in 2004, and since then, DynamoDB has evolved into a Tier 0 service that supports numerous AWS services and external solutions.

One interesting aspect that he highlighted was the auto-scaling model of DynamoDB. Users have the option to manually provision Read Capacity Units (RCUs) and Write Capacity Units (WCUs) if they have a clear understanding of their expected workloads. However, there is also a "Pay-on-demand" model where these units scale automatically. It's important to note that even with auto-scaling, throttling can still occur due to the nature of storing and retrieving data. A notable point about the Pay-on-demand model is that when DynamoDB scales up its capacity units, they do not automatically scale back down. This aspect can surprise users who may experience higher costs after a spike in traffic subsides.

Moving on to indexes and partitions, he emphasized the importance of high cardinality when designing partitions. High cardinality allows data to be distributed across multiple nodes, leading to improved performance and scalability. When querying DynamoDB, targeted queries that leverage a single partition should be used for optimal performance.

To handle anticipated high traffic, he mentioned the concept of "pre-warming" DynamoDB. By initiating scaling before the actual demand arises, delays associated with auto-scaling can be mitigated.

The speaker also provided insights on handling low-cardinality keys by sharding partition keys. This approach enables the use of DynamoStream to aggregate results into a single shard, facilitating operations such as obtaining a "total" result.

Additionally, he also discussed techniques for dealing with Global Secondary Index (GSI) back-pressure, which can result in throttling and impact overall performance. When a GSI experiences high write or read activity, it can potentially lead to performance degradation and increased latency.

To tackle GSI back-pressure, several techniques can be employed:

  1. Provision Additional Capacity: If your GSI is facing high write or read activity, you can consider provisioning additional Read and Write Capacity Units (RCUs and WCUs) specifically for the affected GSI. This additional capacity allocation can help alleviate the back pressure and improve the GSI's performance.
  2. Throttle Requests: Implementing throttling mechanisms within your application can help control the rate of requests to the GSI. By limiting the rate at which requests are sent to the GSI, you can prevent overwhelming the system and manage the back pressure more effectively.
  3. Monitor and Optimize Workloads: Regularly monitor the performance and usage patterns of your GSIs. Identify any queries or operations that may be causing high load or inefficiencies. Optimize your queries, indexes, and data model to ensure optimal performance and minimize the likelihood of back-pressure situations.
  4. Consider Partitioning and Sharding: If your workload requires even higher throughput than what a single GSI can handle, you can explore partitioning or sharding techniques. Partitioning involves splitting the data across multiple GSIs, distributing the load and enabling parallel processing. Sharding involves dividing the GSI into smaller logical units, allowing for better distribution of data and workload.

By implementing these techniques, you can effectively manage GSI back-pressure, optimize performance, and ensure the smooth operation of your DynamoDB tables and associated indexes.

During his talk, he touched on the concept of a Sparse Index within DynamoDB and its implications for data organization and retrieval. A Sparse Index refers to an index that includes only a subset of the items in a table. Unlike traditional indexes that cover all items in the table, a Sparse Index allows for selective indexing based on specific criteria.

By using a Sparse Index, you can choose to include only certain items in the index that meet specific conditions or have particular attributes. This approach can be beneficial in scenarios where you want to optimize the performance of specific queries by narrowing down the scope of indexed items.

Sparse Indexes in DynamoDB provide flexibility in choosing which items to include in the index, enabling more targeted and efficient queries. This selective indexing can help improve query performance and reduce storage requirements when you have specific filtering requirements for your queries.

Considering performance expectations, he highlighted that DynamoDB aims for response latencies of less than 10ms for single item "put" or "get" requests. However, it's essential to consider the overall solution architecture, as this latency does not account for the surrounding components. He also noted that there is no agreed Service Level Agreement (SLA) specifically for latency.

Wrapping up the presentation, he shared some best practices when working with the DynamoDB SDK. These included enabling TCP keep-alive, utilizing a single instance of the DynamoDB client, setting reasonable connection timeouts, and creating the client outside the Lambda handler to optimize performance through caching. Several tips focused on connection reuse, as creating new connections can incur overhead, especially when using Key Management Service (KMS) integration.

His comprehensive insights into DynamoDB's features, scaling models, indexing strategies, and performance considerations provided attendees with valuable knowledge and best practices to harness the full potential of this powerful NoSQL database service.

Adapting to the Data Revolution: Unveiling the Future of Data Warehousing with Amazon Redshift

In his presentation, Srikant Das, an AWS Acceleration Lab Solutions Architect, took us on a journey into the future of data warehousing with Amazon Redshift, exploring its features and capabilities. He began by emphasizing the exponential growth of data storage, revealing that we now store in a single year what would have taken 20 years in the past—an astounding fact that showcases the data revolution we are experiencing. Srikant also highlighted the diverse range of data being stored and the increased focus on performance when analysing this vast amount of information. Additionally, he addressed the challenge of managing the cost associated with analytics due to the sheer volume of data being stored.

Srikant introduced Amazon Redshift, a powerful data warehousing solution known for its self-tuning capabilities. Redshift continuously optimizes its performance by automatically managing distribution styles, sort keys, and column encoding schemes. This self-tuning system ensures that queries run efficiently and at high speed, delivering excellent performance for analytics workloads. Moreover, Redshift is designed to comply with various compliance and regulatory requirements, making it a trusted choice for large enterprises with strict data handling processes. It also offers a more predictable cost structure, allowing organizations to better manage their analytics expenses.

During his talk, Srikant showcased diagrams illustrating how data can be ingested into Redshift from multiple sources. He highlighted the use of Redshift Spectrum, a feature that allows you to directly query data stored in Amazon S3. Redshift Spectrum enables users to extend their analytics capabilities to vast amounts of data without the need to load it into Redshift. By leveraging the power of Redshift Spectrum, users can run complex queries that span both Redshift and S3 data, providing a unified view of data across different storage layers.

Srikant also discussed the concept of federated queries, which enables users to perform real-time queries across multiple data sources. With federated queries, you can seamlessly integrate Redshift with other data sources, such as Amazon RDS databases, and access and analyse data from these sources in real-time. This capability eliminates the need for complex data movement or synchronization, allowing for on-the-fly analysis of data residing in various systems.

Moving forward, Srikant provided an in-depth understanding of Redshift's architecture and underlying mechanisms. Redshift utilizes a clustered architecture that consists of one or more compute nodes, each equipped with its CPU, memory, and disk storage. This architecture enables parallel processing and allows Redshift to execute queries rapidly. By distributing the workload across multiple compute nodes, Redshift achieves high-performance analytics processing.

Within each compute node, the available resources are further divided into smaller units known as node slices. Node slices represent individual processing units within the compute nodes and handle a portion of the data during query execution. The number of node slices per compute node varies depending on the specific Redshift node type. This partitioning of resources into node slices allows for efficient parallelization of queries, leading to improved performance and faster data processing.

Srikant also introduced Redshift Serverless, a pay-on-demand offering that provides the flexibility to pay only for the actual usage of the Redshift service. Redshift Serverless eliminates the need for managing and scaling Redshift clusters manually. Instead, it automatically scales compute capacity based on workload demand, ensuring optimal performance while reducing costs. This serverless model is particularly beneficial for workloads with unpredictable or fluctuating usage patterns, as it dynamically adjusts resources to match the workload, allowing organizations to scale effortlessly without upfront capacity planning.

Wrapping up his talk, Srikant emphasized Redshift's robust security measures and resilience, making it a reliable choice for data warehousing needs. With its ability to seamlessly handle large volumes of data and its self-tuning capabilities, Redshift empowers organizations to unlock valuable insights and adapt to the data revolution effectively.


Customer Session: Sentric Music - Our Data Modernisation Journey

Next up, we had a talk with Andy Valentine, a technical architect, and Ryan Kelly, a principal software engineer at Sentric Music. They shared insights into the challenges and successes encountered while working with vast amounts of data, particularly in the context of music publishing at scale. Sentric Music, the platform used by artists to ensure they are paid when their music is played worldwide, required a robust and performant solution to handle significant data volumes and generate accurate royalty statements.

Sentric Music was described as a platform for music publishing at scale. Artists rely on Sentric Music to register with various societies and maximize visibility to receive appropriate compensation when their music is played. Given the extensive setup required and the generation of large royalty statements for well-known artists, Sentric Music faced the challenge of architecting a high-availability solution capable of handling substantial data while providing ease of use and performance.

Problem Statement and Benefits of Data Solution Changes

The primary problem Sentric Music aimed to address was the creation of a high-availability solution that would be user-friendly for their relatively small development team. They also sought a way to gain insights into the data being processed and ensure optimal performance. By implementing changes to their existing data solution, Sentric Music aimed to resolve connection headroom issues and address scaling challenges during traffic spikes, particularly during royalty statement generation periods.

Adoption of CQRS (Command Query Responsibility Segregation)

Sentric Music opted to implement CQRS, which allowed them to separate read and write concerns within their system. This approach facilitated better scalability by enabling the system to handle intense moments of write and read activity independently. The decoupling of read and write operations provided flexibility in scaling the system based on specific workload patterns.

Challenges and Success of Database Upgrades

Sentric Music shared their experience with upgrading their database, highlighting their initial usage of Aurora 1 (MySQL 5.6) for an extended period. However, when AWS announced the deprecation (EOL) of Aurora 1 in March 2023, Sentric Music faced the fear associated with database upgrades—knowing that the database is a critical component of any solution. Despite the challenges, they successfully migrated to Aurora 2 by making a few minor adjustments. These adjustments included quoting queries to account for added keywords, handling changes in unions, and addressing SSL-related issues.

Continuing the Upgrade Path

Sentric Music did not stop at Aurora 2. They pursued further upgrades and migrated to Aurora 3 to avoid a similar EOL situation in the near future. This proactive approach ensured that they remained on supported and optimized versions of the database system.

Analytics Problems

Sentric Music encountered several challenges related to analytics, including streamlining ad-hoc requests, managing the cost of analytics, and integrating data from various sources. Ad-hoc requests for analytics insights can be time-consuming and resource-intensive, requiring a streamlined approach to handle them efficiently. Additionally, managing the cost of analytics became crucial as the volume of data increased. Sentric Music needed to optimize their analytics solution to ensure cost-effective operations. Moreover, integrating data from multiple sources, such as music streaming platforms and royalty collection societies, posed a significant data integration challenge.

Experiment with Lake Formation and Glue

As part of their data modernization journey, Sentric Music experimented with AWS Lake Formation and AWS Glue to address their data management and security needs. Lake Formation allowed them to build a secure data lake architecture, while AWS Glue facilitated data extraction, transformation, and loading processes. They also explored different security models offered by Lake Formation to ensure the privacy and integrity of their data.

C# Batch Processing

Sentric Music utilized C# batch processing for handling their large volumes of data. The C# batch processing framework enabled them to efficiently process and transform vast amounts of data in a scalable and reliable manner. This approach proved effective in managing their data processing needs.

Glue Crawler and Data Storage

Sentric Music employed AWS Glue Crawler to automate the discovery and cataloguing of data stored in Parquet files. By leveraging Parquet files, which have a smaller storage footprint, they optimized their data storage efficiency. Furthermore, they implemented efficient data partitioning strategies to enhance query performance. Although the data was stored in a semi-relational format, the use of Parquet files and optimized partitions facilitated efficient data retrieval and analysis.

The transition from Lake Formation to Batch Processing

Sentric Music decided to discontinue their usage of Lake Formation due to its unsuitability for their specific business needs. The way they were utilizing Lake Formation impacted their production environment, leading them to explore alternative solutions. They transitioned from using Step Functions to utilizing Batch for Parquet processing, which proved to be a more cost-effective and simpler solution for their data processing requirements.

Cost Considerations

During their presentation, Sentric Music discussed the costs associated with their data processing solution. They highlighted that the RDS export incurred the most significant cost, amounting to $190. In contrast, most other components of their infrastructure, including data processing and storage, had relatively lower costs, averaging around $10. This cost distribution was noteworthy considering the substantial amount of data being processed, indicating the efficiency and cost-effectiveness of their optimized solution.

Future Plans

Looking ahead, Sentric Music shared their future plans, which include exploring the use of x2g instances in AWS. These instances leverage Graviton processors and are expected to unlock additional benefits for their workload. Additionally, they expressed interest in evaluating Aurora Serverless V2 to assess potential advantages in terms of cost and performance.

Overall, Sentric Music's data modernization journey highlights their commitment to leveraging AWS services and technologies to address challenges related to availability, performance, and scalability. Their successful database upgrades and future plans reflect their proactive approach to optimizing their data infrastructure.

Adding your database to your CI/CD Pipeline using Blue/Green Deployments on RDS?

Kate Gawron, a Database Solutions Architect at AWS, introduced the concept of using Blue/Green Deployment strategy to deploy changes to a database. This approach allows for seamless switching between the current production environment (Blue) and the future environment (Green). Blue/Green Deployments are particularly beneficial for performing schema updates and, more importantly, database upgrades. Traditionally, schema updates were performed in-place, but this new approach provides a safer and more controlled deployment process.

Schema Updates and Planning

Historically, schema updates were performed directly on the production database, which can introduce risks and potential downtime. Alternatively, organizations could have a separate staging environment for testing changes before migrating them to the production database. However, this approach requires careful planning and orchestration to ensure a smooth transition. Blue/Green Deployments offer an alternative that streamlines the process and reduces risks associated with direct updates or separate staging environments.

Demo and Deployment Considerations

During the presentation, Kate Gawron provided a demonstration of how to execute Blue/Green Deployments using the AWS Management Console. However, an issue occurred during the demo, leading to a deployment failure. It was then mentioned that updating the binary log format to ROW was necessary to resolve the issue and successfully complete the deployment. This emphasizes the importance of considering and configuring appropriate settings for successful database deployments.

Timeout and Deployment Speed

When performing Blue/Green Deployments, it is crucial to set an appropriate timeout for the deployment process. If the database fails to spin up within the defined timeout period, the deployment will constantly roll back. The timeout duration can be configured according to the specific requirements of the deployment. It is worth noting that Blue/Green Deployments typically take around a minute to complete, providing a fast and efficient deployment process for database changes.

Partner Session: How Steamhaus unlocked the value of Sperry’s Rail Data using AWS Data services

Tom Misiukanis, an Engineer, Architect, and Principal Consultant at Steamhaus, discussed the importance of making railways safer and introduced the data collection process. Data is collected in the t1k format from machines that travel along the railway using various methods, including the Flying Banana.

Data Extraction Challenges

Tom explained the challenges they faced with the previous data extraction process. It involved manually monitoring data on the train and then connecting a laptop to extract the data. This manual process was time-consuming and prone to errors. The extracted data had to be uploaded to a target location for further analysis.

Sperry's Data-Driven Approach

Sperry's recognized the value of being a data-driven and tech-focused business. In 2016, they formed an innovation team to tackle the challenges and improve their data operations.

Initial AWS Solution and Scaling Issues

Sperry's initially implemented a simple AWS solution for data processing. However, as the amount of data grew, the solution started to face scalability challenges.

Steamhaus Solution with Step Functions

To overcome the scaling issues, Steamhaus implemented a solution using AWS Step Functions. Step Functions are considered best practice for processing data and offer a scalable and reliable workflow orchestration.

Automated Data Ingestion

Steamhaus addressed the problem of manual data uploads and potential data corruption during the extraction process. They collaborated with another partner to develop a program that could automatically push the data from the extraction laptop to the target location, ensuring a more reliable and error-free data ingestion process.

Frontend Solution for Analytics Visualization

Steamhaus also developed a front-end solution to visualize the results of the analytics. This frontend interface allowed users to view valuable information generated by Elmer, their analytics engine. It also provided visual outputs highlighting potential issues with the railway track, which is crucial for cost-effective maintenance.

Future Goals

Steamhaus outlined their future goals, which include leveraging machine learning (AWS SageMaker) to predict potential issues before the data is even captured. With a large amount of historical data, they aim to use ML to enhance safety and maintenance practices. Additionally, they plan to work on integrating more IoT devices for capturing more detailed and comprehensive data.

By integrating AWS data services, machine learning, and IoT, Steamhaus aims to further unlock the value of Sperry's Rail Data and improve railway safety.

Cost Optimization of RDS

AWS Trusted Advisor for Cost Optimization

AWS Trusted Advisor is a valuable tool that can assist in optimizing costs for RDS. It provides recommendations and identifies areas where cost optimizations can be implemented. By following the suggestions provided by Trusted Advisor, you can ensure that your RDS resources are utilized efficiently and cost-effectively.

Choosing Appropriate Instance Size

Selecting the right instance size for your RDS database is crucial for cost optimization. It's important to analyze your workload requirements and choose an instance size that meets your performance needs without overprovisioning. By selecting an instance that aligns with your workload demands, you can avoid unnecessary costs associated with underutilized resources.

Leveraging Reserved Instances

Reserved Instances (RIs) are a cost-saving option provided by AWS. By purchasing RIs, you can commit to a specific instance configuration for a certain period, typically one or three years, in exchange for significant discounts on the hourly usage rate. Utilizing RIs can result in substantial cost savings, especially for databases with steady and predictable workloads.

Metrics for Resource Optimization

Monitoring and analyzing RDS metrics are essential for ensuring cost optimization. By regularly reviewing CPU utilization, memory usage, and other relevant metrics, you can identify any potential performance bottlenecks or areas of inefficiency. This allows you to right-size your RDS resources and avoid unnecessary costs associated with underutilized or overprovisioned instances.

Identifying Orphaned Snapshots

Orphaned snapshots refer to the snapshots that are not automatically removed when a database is deleted. These leftover snapshots can accumulate over time, occupying storage and incurring additional costs. It's important to regularly review and identify any orphaned snapshots, and delete them to optimize costs and free up storage resources.

Configuring Backup Periods

Configuring appropriate backup periods is crucial for cost optimization. Backup requirements may vary for different environments and depend on factors such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO). By defining backup periods based on your specific needs and adjusting them as necessary, you can avoid excessive backup costs while meeting your data protection and recovery goals.

By implementing these cost optimization practices, including utilizing Trusted Advisor, choosing appropriate instance sizes, leveraging Reserved Instances, monitoring metrics, managing snapshots, and configuring backup periods, you can optimize the costs associated with your RDS databases and ensure efficient resource utilization.

Wrapping up 🎁

The Data Deep Dive Day provided a comprehensive exploration of various data management and optimization topics. We delved into the realm of data warehousing with Amazon Redshift, uncovering its self-tuning capabilities, compliance features, and seamless integration with other AWS services. The exponential growth of data and the need for performance and cost predictability were highlighted as key drivers for adopting Redshift.

We then turned our attention to Sentric Music's data modernization journey, which involved creating a high-availability solution capable of handling vast amounts of data. Their adoption of CQRS and successful migration to AWS Aurora 2 demonstrated the importance of scalability and keeping up with database upgrades. The future plans of exploring x2g instances and Aurora Serverless V2 showcased their commitment to continuous improvement.

The presentation on Blue/Green Deployments on RDS emphasized the value of this deployment strategy, particularly for performing schema updates and database upgrades. The demo by Kate Gawron highlighted the need for proper configuration, including updating the binary log format, and setting an appropriate timeout for deployments to avoid constant rollbacks.

The talk by Tom Misiukanis of Steamhaus showcased the importance of leveraging AWS data services to make railways safer. Their solution, built on Step Functions, addressed data ingestion challenges and provided a frontend for visualizing analytics results. The future goals of incorporating machine learning and IoT devices demonstrated their commitment to innovation.

Cost optimization of RDS was another critical aspect covered in the Data Deep Dive Day. The utilization of AWS Trusted Advisor for identifying cost optimization opportunities, selecting appropriate instance sizes, leveraging reserved instances, monitoring metrics, managing snapshots, and configuring backup periods were all highlighted as essential practices.

Overall, the Data Deep Dive Day provided a wealth of knowledge and insights into various aspects of data management, optimization, and innovation. From powerful data warehousing solutions to database deployments, data ingestion, cost optimization, and future advancements like machine learning, the event offered a comprehensive view of how organizations can harness the power of data to drive their success.