AWS Big Data Blog
Improve Apache Spark write performance on Apache Parquet formats with the EMRFS S3-optimized committer
The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of Amazon EMR 5.19.0. This committer improves performance when writing Apache Parquet files to Amazon S3 using the EMR File System (EMRFS). In this post, we run a performance benchmark to compare this new optimized committer with existing committer […]
Read MoreSpark enhancements for elasticity and resiliency on Amazon EMR
This blog post provides an overview of the issues with how open-source Spark handles node loss and the improvements in Amazon EMR to address the issues.
Read MoreVisualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight
Climate Change continues to have a profound effect on our quality of life. As a result, the investigation into sustainability is growing. Researchers in both the public and private sector are planning for the future by studying recorded climate history and using climate forecast models. To help explain these concepts, this post introduces the Global […]
Read MoreCreate real-time clickstream sessions and run analytics with Amazon Kinesis Data Analytics, AWS Glue, and Amazon Athena
Clickstream events are small pieces of data that are generated continuously with high speed and volume. Often, clickstream events are generated by user actions, and it is useful to analyze them. For example, you can detect user behavior in a website or application by analyzing the sequence of clicks a user makes, the amount of […]
Read MoreMetadata classification, lineage, and discovery using Apache Atlas on Amazon EMR
With the ever-evolving and growing role of data in today’s world, data governance is an essential aspect of effective data management. Many organizations use a data lake as a single repository to store data that is in various formats and belongs to a business entity of the organization. The use of metadata, cataloging, and data […]
Read MoreOur data lake story: How Woot.com built a serverless data lake on AWS
In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database. At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another. However, we decided to step back and focus […]
Read MoreAnalyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight
Although structured data remains the backbone for many data platforms, increasingly unstructured or semistructured data is used to enrich existing information or to create new insights. Amazon Athena enables you to analyze a wide variety of data. This includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, […]
Read MoreBannerconnect uses Amazon Redshift to help clients improve digital marketing results
Bannerconnect uses programmatic marketing solutions that empower advertisers to win attention and customers by getting their ads seen by the right person at the right time and place. Data-driven insights help large advertisers, trade desks, and agencies boost brand awareness and maximize the results of their digital marketing. Timely analysis of log data is critical […]
Read MoreRunning Amazon Payments analytics on Amazon Redshift with 750TB of data
The Amazon Payments Data Engineering team is responsible for data ingestion, transformation, and the computation and storage of data. It makes these services available for more than 300 business customers across the globe. These customers include product managers, marketing managers, program managers, data scientists, business analysts and software development engineers. They use the data for […]
Read MoreAnalyze your Amazon CloudFront access logs at scale
This blog post focuses on two measures to restructure Amazon CloudFront access logs for optimization: partitioning and conversion to columnar formats. For more details on performance tuning read the blog post about the top 10 performance tuning tips for Amazon Athena.
Read More








