AWS SAP Notes 07 - Data Analytics
aws sap

Nguyễn Huy Hoàng viết ngày 10/10/2021


  • Is a scalable streaming service, designed to ingest lots of data
  • Producers send data into a Kinesis stream
  • Streams can scale from low to near infinite data rates
  • It is a public service and it is highly available by design
  • Persistence: streams store a 24H moving window of data
  • Kinesis include storage to be able to ingest and retain data for 24H
  • Multiple consumers can access the data from that moving window

Kinesis Data Streams

  • Kineses Data Streams are using shards to stream data, initially there is one shard, additional shards can be added over time to increase performance
  • Each shard provides its own performance, each shard has 1MB/s ingestion capacity, 2MB/s consumption capacity
  • Shards directly affect the price of the Kinesis stream, we have to pay for each shard
  • Pricing is also affected by the length of the storage window. By default is 24H, it can be increased to 7days
  • Data is stored in Kinesis Data Records (1MB), these records are distributed across shards

SQS vs Kinesis Data Streams

  • Is it about ingestion (Kinesis) of data or about decoupling, worker pools (SQS)
  • SQS usually has 1 production group, 1 consumption group
  • SQS is designed for decoupling and asynchronous communication
  • SQS does not have the concept of persistence, no window for persistence
  • Kinesis is designed for huge scale ingestion, having multiple consumers with different rate of consumption
  • Kinesis is recommended for ingestion, analytics, monitoring, click streams

Kinesis Data Firehose

alt text

  • Used to provide data ingestion for other AWS services such as S3
  • Fully managed service used to load data for data lakes, data stores and analytics services
  • Data Firehose scales automatically, it is serverless and resilient
  • It is not a real time product, it is a Near Real Time product with a deliver product of ~60 seconds
  • Supports transformation of data on the fly using Lambda. This transformation can add latency
  • Firehose is a pay as you go service, we pay per volume of data
  • Firehose supported destinations:
    • HTTP endpoints
    • Splunk
    • RedShift
    • ElasticSearch
    • S3
  • Firehose can accept data directly from producers or from Kinesis Data Streams
  • Firehose receives the data in real-time, but the ingestion is buffered
  • Firehose buffer by default waits for 1MB of data in 60 seconds before delivering to consumer. For higher load, it will deliver every time there is an 1MB chunk of data
  • Data is sent directly form Firehose to destination, exception being Redshift, where data is stored in an intermediary S3 bucket
  • Firehose use cases:
    • Persistence for data coming into Kinesis Data Streams
    • Storing data in a different format (ETL)

Kinesis Data Analytics

  • It is a real-time data processing product using SQL
  • The product ingests data from Kinesis Data Streams or Firehose
  • After the data is processed, it can be sent directly to destinations such as:
    • Firehose (data becoming near-real time)
    • Kinesis Data Streams
    • AWS Lambda
  • Kinesis Data Analytics architecture: alt text
  • Kinesis Data Analytics use cases:
    • Anything using stream data which needs real-time SQL processing
    • Time-series analytics: election data, e-sports
    • Real-time dashboards: leader boards for games
    • Real-time metrics

EMR - Elastic Map Reduce

MapReduce 101

alt text
alt text
alt text
alt text

  • Is a framework designed to allow processing huge amount of data in a parallel, distributed way
  • Data Analysis Architecture: huge scale, parallel processing
  • MapReduce has two main phases: map and reduce
  • It also has to optional phases: combine and partition
  • At high level the process of map reduce is the following:
    • Data is separated into splits
    • Each split can be assigned to a mapper
    • The mapper perform the operation at scale
    • The data is recombined after the operation is completed
  • HDFS (Hadoop File System):
    • Traditionally stored across multiple data nodes
    • Highly fault-tolerant - data is replicated between nodes
    • Named Nodes: provide the namespace for the file system and controls access to HDFS
    • Block: a segment of data in HDFS, generally 64 MB

Amazon EMR Architecture

  • Is a managed implementation of Apache Hadoop, which is a framework for handling big data workloads
  • EMR includes other elements such as Spark, HBase, Presto, Flink, Hive, Pig
  • EMR can be operated long term, or we can provision ad-hoc (transient) clusters for short term workloads
  • EMR runs in one AZ only within a VPC using EC2 for compute
  • It can use spot instances, instance fleets, reserved and on-demand instances as well
  • EMR is used for big data processing, manipulation, analytics, indexing, transformation, etc.
  • EMR architecture: alt text
    • Historically we could have only one master node, nowadays we can have 3 master nodes
    • Core nodes: they are used for tracking task, we don't want to destroy these nodes
    • Core nodes also managed to HDFS storage for the cluster. The lifetime of HDFS is linked to the lifetime of the core nodes/cluster
    • Task nodes: used to only run tasks. If they are terminated, the HDFS storage is not affected. Ideally we use spot instances for task nodes
    • EMRFS: file system backed by S3, can persist beyond the lifetime of the cluster. Offers lower performance than HDFS, which is based on local volumes

Amazon Redshift

  • It is petabyte scale data warehouse
  • It is designed for reporting and analytics
  • It is an OLAP (column based) database, not OLTP (row/transaction)
    • OLTP (Online Transaction Processing): capture, stores, processes data from transactions in real-time
    • OLAP (Online Analytical Processing): designed for complex queries to analyze aggregated historical data from other OALP systems
  • Advanced features of Redshift:
    • RedShift Spectrum: allows querying data from S3 without loading it into Redshift platform
    • Federated Query: directly query data stored in remote data sources
  • Redshift integrates with Quicksight for visualization
  • It provides a SQL-like interface with JDBC/ODBC connections
  • Redshift is a provisioned product, it is not serverless. It does come with provisioning time
  • It uses a cluster architecture. A cluster is a private network, and it can not be accessed directly
  • Redshift runs in one AZ, not HA by design
  • All clusters have a leader node with which we can interact in order to do querying, planning and aggregation
  • Compute nodes: perform queries on data. A compute node is partition into slices. Each slice is allocation a portion of memory and disk space, where it processes a portion of workload. Slices work in parallel, a node can have 2, 4, 16 or 32 slices, depending the resource capacity
  • Redshift if s VPC service, it uses VPC security: IAM permissions, KMS encryption at rest, CloudWatch monitoring
  • Redshift Enhance VPC Routing:
    • Can be enabled
    • Traffic is routed based on the VPC networking configuration
    • Traffic can be controlled by security groups, it can use network DNS, it can use VPC gateways
  • Redshift architecture: alt text

Redshift Components

  • Cluster: a set of nodes, which consists of a leader node and one or more compute nodes
    • Redshift creates one database when we provision a cluster. This is the database we use to load data and run queries on your data
    • We can scale the cluster in or out by adding or removing nodes. Additionally, we can scale the cluster up or down by specifying a different node type
    • Redshift assigns a 30-minute maintenance window at random from an 8-hour block of time per region, occurring on a random day of the week. During these maintenance windows, the cluster is not available for normal operations
    • Redshift supports both the EC2-VPC and EC2-Classic platforms to launch a cluster. We create a cluster subnet group if you are provisioning our cluster in our VPC, which allows us to specify a set of subnets in our VPC
  • Redshift Nodes:
    • The leader node receives queries from client applications, parses the queries, and develops query execution plans. It then coordinates the parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results back to the client applications
    • Compute nodes execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications
    • Node Type:
      • Dense storage (DS) node type – for large data workloads and use hard disk drive (HDD) storage
      • Dense compute (DC) node types – optimized for performance-intensive workloads. Uses SSD storage
  • Parameter Groups: a group of parameters that apply to all of the databases that we create in the cluster. The default parameter group has preset values for each of its parameters, and it cannot be modified

Redshift Resilience and Recovery

  • Redshift can use S3 for backups in the form a snapshots
  • There are 2 types of backups:
    • Automated backups: occur every 8 hours or after every 5 GB of data, by default having 1 day retention (max 35). Snapshots are incremental
    • Manual snapshots: performed after manual triggering, no retention period
  • Restoring from snapshots creates a brand new cluster, we can chose a working AZ to be provisioned into
  • We can copy snapshots to another region where a new cluster can be provisioned
  • Copied snapshots also can have retention periods alt text

Amazon Redshift Workload Management (WLM)

  • Enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
  • Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues
  • From a user perspective, a user-accessible service class and a queue are functionally equivalent

AWS Batch

  • Managed compute service commonly used for large scale data analytics and processing
  • It is managed batch processing product
  • Batch Processing: jobs that can run without end-user interaction, or can be scheduled to run as resources permit
  • AWS Batch lets us worry about defining jobs, it will handle the underlying compute and orchestration
  • AWS Batch core components: alt text
    • Job: script, executable, docker container submitted to batch. Jobs are executed using containers in AWS. The job define the work. Jobs can depend on other jobs
    • Job Definition: metadata for a job, including IAM permissions, resource configurations, mount points, etc.
    • Job Queue: jobs are submitted to a queue, where they wait for compute environment capacity. Queues can have a priority
    • Compute Environment: the compute resources which do the actual work. Can be managed by AWS or by ourselves. We can configure the instance type, vCPU amount, spot price, or provide details on a compute environment we manage (ECS)
  • Integration with other services: alt text

AWS Batch vs Lambda

  • Lambda has a 15 minutes execution limit, for longer workflows we should use Batch
  • Lambda has limited disk space in the environment, we can fix this by using EFS, but this would require the function to be run inside of a VPC
  • Lambda is fully serverless with limited runtime selection
  • Batch is not serverless, it uses Docker with any runtime
  • Batch does not have a time limit for execution

Managed vs Unmanaged AWS Batch

  • Managed:
    • AWS manages capacity based on the workloads
    • We define the instance types, size and if we want to use on-demand or spot instances
    • We can determine our own max spot price
    • We need to create VPC gateways for access to the resources
  • Unmanaged:
    • We manage everything
    • Generally used if we have a full compute environment ready to go

AWS Quicksight

  • It a business analytics and intelligence (BA/BI) service
  • It is used for visualization and ad-hoc analysis
  • Cost-effective, on-demand service
  • It is able to discover and integrate with AWS data sources and supports a wide range of external data sources
  • Supported data sources:
    • Athena, Aurora, Redshift, Redshift Spectrum
    • S3, AWS Iot
    • Jira, GitHub, Twitter, SalesForce
    • Microsoft SQL Server, MySQL, PostgreSQL
    • Apache Spark, Snowflake, Presto, Teradata
Bình luận

{{ }}
Bỏ hay Hay
Male avatar
{{ comment_error }}

Hiển thị thử

Chỉnh sửa


Nguyễn Huy Hoàng

17 bài viết.
10 người follow
{{userFollowed ? 'Following' : 'Follow'}}
Cùng một tác giả
11 4
(Ảnh) Tại hội nghị Build 2016 diễn ra từ ngày 30/3 đến hết ngày 1/4 ở San Francisco, Microsoft đã đưa ra 7 thông báo lớn, quan trọng và mang tầm c...
Nguyễn Huy Hoàng viết hơn 4 năm trước
11 4
7 0
Viết code chạy một cách trơn tru ngay lần đầu tiên là một việc rất khó, thậm chí là bất khả thi. Do đó debug là một kỹ năng vô cùng quan trọng đối ...
Nguyễn Huy Hoàng viết hơn 4 năm trước
7 0
1 0
MultiFactor Authentication (MFA) Factor: different piece of evidence which proves the identity Factors: Knowledge: something we as users know: ...
Nguyễn Huy Hoàng viết 2 tháng trước
1 0
Bài viết liên quan
0 0
FSx FSx For Windows File Servers FSx for Windows are fully managed native Windows file servers/file shares Designed for integration with Wind...
Nguyễn Huy Hoàng viết 2 tháng trước
0 0
0 0
CloudFront It is a content deliver network (CDN) Its job is to improve the delivery of content from its original location to the viewers of the...
Nguyễn Huy Hoàng viết 2 tháng trước
0 0


{{ comment_count }}

bình luận

{{liked ? "Đã kipalog" : "Kipalog"}}

{{userFollowed ? 'Following' : 'Follow'}}
17 bài viết.
10 người follow

 Đầu mục bài viết

Vẫn còn nữa! x

Kipalog vẫn còn rất nhiều bài viết hay và chủ đề thú vị chờ bạn khám phá!