Tag the questions with any skills you have. Your dashboard will track each student's mastery of each skill.
Give this quiz to my class
Q 1/75
Score 0
An organization wants to simplify the management of cross-account data permissions by using a centralized service to define and enforce fine-grained access control at the database, table, and column levels. Which service is best suited for this?
30
AWS Identity and Access Management (IAM)
Amazon S3 Bucket Policies
Amazon Redshift Spectrum
AWS Lake Formation
Q 2/75
Score 0
A data engineer needs to join data from an Amazon S3-based data lake with a dimensional table stored in an Amazon Redshift cluster without moving the S3 data into Redshift permanently. Which feature should they use?
30
Amazon Redshift Federated Query
Amazon Athena
Amazon Redshift Spectrum
AWS Glue DataBrew
75 questions
Q.
An organization wants to simplify the management of cross-account data permissions by using a centralized service to define and enforce fine-grained access control at the database, table, and column levels. Which service is best suited for this?
1
30 sec
Q.
A data engineer needs to join data from an Amazon S3-based data lake with a dimensional table stored in an Amazon Redshift cluster without moving the S3 data into Redshift permanently. Which feature should they use?
2
30 sec
Q.
Which AWS Glue component is used to automatically discover data formats and retrieve schemas to populate the Data Catalog?
3
30 sec
Q.
A streaming application requires real-time data ingestion and processing with a retention period of 24 hours. The engineering team wants to use a managed service that can scale shards to handle throughput. Which service should they choose?
4
30 sec
Q.
A data engineer is designing a pipeline where an S3 event triggers a process to validate the schema of an uploaded CSV file. The process is lightweight and runs in less than 30 seconds. What is the most cost-effective compute option?
5
30 sec
Q.
You need to monitor AWS Glue job failures and receive an email notification whenever a job fails. Which combination of services provides this functionality?
6
30 sec
Q.
To optimize performance for an Amazon Redshift cluster, a data engineer wants to ensure that data is distributed across nodes based on a common join key. Which distribution style should be used?
7
30 sec
Q.
A company requires all data at rest in Amazon S3 to be encrypted using keys that are rotated annually and managed by a dedicated security team within AWS. Which encryption method meets this with the least operational overhead?
8
30 sec
Q.
A complex ETL pipeline involves multiple dependencies where Glue Jobs must only run after specific S3 objects are created and a Lambda function succeeds. What is the most robust way to orchestrate this?
9
30 sec
Q.
In an architectural review, you are asked to reduce the cost of a long-running Amazon EMR cluster used for daily batch processing. The workload is fault-tolerant. Which instance configuration is best?
10
30 sec
Q.
A data engineer needs to optimize the performance of an Amazon Athena query that scans a large dataset in Amazon S3. The dataset is currently stored in a single flat CSV file. Which combination of strategies will result in the greatest reduction in data scanned and improved query speed?
11
30 sec
Q.
A data engineer is designing a data lake on Amazon S3 and needs to ensure that any PII data is automatically identified and classified as it is uploaded. The organization also needs a dashboard to visualize the risk levels of their data across several S3 buckets. Which AWS service is specifically designed for this purpose?
12
30 sec
Q.
An Amazon Redshift database contains a large table with historical sales data. A data engineer notices that queries filtering by 'transaction_date' are becoming slower as the table grows to include millions of rows. Which optimization strategy should be applied to the table to improve the performance of these specific queries?
13
30 sec
Q.
A data engineering team monitors an Amazon Redshift cluster and notices that a specific query is consuming a disproportionate amount of memory, slowing down other critical reporting tasks. They want to automatically abort any query that runs for longer than 60 seconds to ensure consistent performance. Which feature should they use?
14
30 sec
Q.
A data engineer is designing a highly available and fault-tolerant ingestion pipeline that collects clickstream data from a website and writes it to an Amazon S3 data lake in near real-time. The engineer must ensure that the data is compressed into Parquet format before it reaches S3 to reduce query costs in Amazon Athena. Which solution implements this with the least operational overhead?
15
30 sec
Q.
A data engineer is designing a disaster recovery strategy for an Amazon Redshift cluster. The business requirement specifies that the data must be available in a secondary AWS Region if the primary Region experiences an outage. The Recovery Point Objective (RPO) is less than 24 hours. Which approach provides this capability with the minimum operational effort?
16
30 sec
Q.
A fintech company needs to process millions of stock market transactions per second with sub-second latency and store the raw data in an Amazon S3-based data lake for historical analysis. Which architecture provides the most scalable and cost-effective solution?
17
30 sec
Q.
An e-commerce company wants to implement a weekly ETL process that transforms nested JSON logs from S3 into Parquet format for Amazon Redshift Spectrum. They need to minimize costs and manage job dependencies. Which approach is best?
18
30 sec
Q.
A healthcare provider needs to perform complex analytical queries on 500 TB of historical patient records. The data is currently in Amazon S3. The solution must provide high performance for complex joins while keeping storage costs low. What is the most architecturaly sound approach?
19
30 sec
Q.
An IoT company receives sensor readings from 100,000 devices via Amazon Kinesis Data Streams. They need to perform a sliding window calculation to find the average temperature every 5 minutes. Which service should they use for this streaming transformation?
20
30 sec
Q.
A SaaS provider needs to move data from an Amazon Aurora database to an Amazon S3 data lake. The data contains PII that must be masked before storage. Which solution is most efficient for a recurring daily schedule?
21
30 sec
Q.
A data engineer needs to join two massive tables in Amazon Redshift: 'Sales' (10 billion rows) and 'Products' (1,000 rows). Which distribution style for the 'Products' table will result in the best query performance?
22
30 sec
Q.
A company wants to store data in Amazon S3 for long-term archiving. The data is rarely accessed but must be available within minutes if requested. Which S3 storage class is most cost-effective for this specific requirement?
23
30 sec
Q.
To optimize the cost of an Amazon Redshift cluster used only for business hours (9 AM to 5 PM), which feature should the data engineer implement?
24
30 sec
Q.
A media company wants to crawl their S3 data lake to populate a Data Catalog. The data is stored in folders partitioned by 'year', 'month', and 'day'. How does this partitioning benefit Amazon Athena queries?
25
30 sec
Q.
A developer needs to orchestrate a complex pipeline involving AWS Glue, Amazon EMR, and AWS Lambda with sophisticated error handling and conditional branching. Which service is most suitable?
26
30 sec
Q.
A global IoT company needs to ingest data from 50,000 sensors. The requirement is to ensure the data is ordered by the 'DeviceID' for downstream processing in a custom consumer application, and it must support a retention period of 7 days for potential re-processing. Which Kinesis configuration is most appropriate?
27
30 sec
Q.
A SaaS provider uses an Amazon Redshift cluster to store subscription and usage data for its clients. Several times a day, large-scale nightly reporting queries cause performance degradation for the real-time dashboards used by support staff. The company wants a cost-effective solution to ensure that reporting queries do not impact the dashboard performance without manually resizing the cluster. Which feature should they implement?
28
30 sec
Q.
A healthcare analytics company manages an Amazon S3 data lake containing sensitive patient records. They utilize an AWS Glue ETL job to transform these records into a schema suitable for analysis. Due to strict compliance requirements, the company must ensure that any sensitive data (PII) is automatically identified and masked during the ETL process before the data is written to the destination. Which approach is the most efficient and native way to achieve this using AWS Glue?
29
30 sec
Q.
A global IoT company collects telemetry data from 1 million sensors every minute and stores it in an Amazon S3 data lake. A data engineer needs to join this high-volume sensor data with a small 50 MB 'Device Metadata' CSV file stored in S3 to filter for specific regions. The join must be performed using an AWS Glue ETL Spark job. Which optimization technique will most significantly improve the join performance and reduce data shuffling across the cluster?
30
30 sec
Q.
A data engineer needs to ingest a continuous stream of sensor data and transform it in near real-time before loading it into Amazon S3 for long-term storage. Which combination of services provides the most efficient managed solution with minimal custom code?
31
30 sec
Q.
A company wants to store data that is accessed infrequently but must be available immediately (within milliseconds) when requested. Which Amazon S3 storage class is the most cost-effective for this scenario?
32
30 sec
Q.
Which AWS Glue component is primarily responsible for scanning data sources, identifying data formats, and populating the AWS Glue Data Catalog with table definitions?
33
30 sec
Q.
A data engineer is designing a pipeline where an AWS Glue Job needs to read data from an Amazon S3 bucket. What is the most secure way to grant the Glue Job the necessary permissions?
34
30 sec
Q.
When comparing Amazon RDS and Amazon Redshift for a data engineering project, which statement best describes the primary use case for Redshift?
35
30 sec
Q.
A data pipeline requires a streaming service that allows multiple consumers to read the same data stream independently and supports data retention for up to 365 days. Which service should be used?
36
30 sec
Q.
In AWS Glue, what is the primary function of a 'Script' within a Glue Job?
37
30 sec
Q.
Which S3 storage class provides the lowest cost for long-term archival of data that only needs to be retrieved once or twice a year, with a retrieval time of 12 hours?
38
30 sec
Q.
What happens if an AWS Glue Crawler finds a change in the schema of the source data compared to the existing table in the Data Catalog?
39
30 sec
Q.
A company needs to implement a data lake on AWS. They want to ensure that access to specific S3 folders used by different data engineering teams is strictly controlled based on the principles of least privilege. Which AWS service is primarily used to define these access permissions?
40
30 sec
Q.
An analytics company needs to provide SQL-based querying capabilities on data stored in an Amazon S3 data lake without the overhead of managing infrastructure or loading the data into a database. Which service is best suited for this serverless ad-hoc analysis?
41
30 sec
Q.
A data engineer is designing a disaster recovery plan for a data warehouse. They need a cost-effective storage solution for database backups that are used less than once a year, but the data must be stored across multiple Availability Zones for high durability. Which S3 storage class meets these requirements?
42
30 sec
Q.
A data engineer needs to ingest real-time data into an S3 bucket and prefers a managed service that can automatically scale and requires no manual management of shards or consumer applications. Which service is the best fit?
43
30 sec
Q.
A data engineer is building a pipeline where an AWS Glue Job must access a database in a private subnet of an Amazon VPC. Which AWS Glue component must be configured to allow the job to communicate with the database?
44
30 sec
Q.
45
30 sec
Q.
A data engineer needs to move data from an on-premises Oracle database to an Amazon S3 bucket on a weekly basis using an AWS Glue Job. Because the database is behind a corporate firewall, the Glue Job must securely access the on-premises network. Which IAM configuration is essential for the Glue Job's execution role to achieve this?
46
30 sec
Q.
A data engineer needs to select an Amazon Redshift feature to handle unpredictable, short bursts of high-volume queries from a data visualization tool without impacting the performance of standard extract, transform, and load (ETL) operations. Which feature should be used?
47
30 sec
Q.
A data engineer is designing a data lake and needs to decide between using an Amazon S3 bucket versus an Amazon RDS instance for raw data storage. Which factor most strongly favors choosing S3 over RDS for storing large volumes of unstructured satellite imagery data?
48
30 sec
Q.
A data engineer is configuring an AWS Glue Job to process sensitive customer data. To ensure that the job can only be triggered by specific automated events and that the data processed is encrypted at rest using a customer-managed key, which combination of AWS features should be implemented?
49
30 sec
Q.
Which AWS service provides a centralized console for managing fine-grained access control, such as column-level security, for data stored in an S3-based data lake?
50
30 sec
Q.
Which IAM policy element is used to grant a user permission to perform a specific action only if the request is encrypted via TLS?
51
30 sec
Q.
To track which user deleted a specific Amazon S3 bucket, which AWS service should a data engineer consult for API call history?
52
30 sec
Q.
What is the primary purpose of a KMS Key Policy?
53
30 sec
Q.
Which encryption method requires the client to encrypt data before sending it to Amazon S3?
54
30 sec
Q.
How can a data engineer implement data masking for sensitive PII data in Amazon Redshift without changing the physical data?
55
30 sec
Q.
What is the benefit of using an IAM Role instead of a permanent IAM User Access Key for an application running on EC2?
56
30 sec
Q.
In AWS Lake Formation, what does 'LF-TBAC' stand for regarding access control?
57
30 sec
Q.
When using SSE-KMS for S3 encryption, which log records the usage of the key to decrypt an object?
58
30 sec
Q.
Which type of KMS key is managed by AWS services and cannot be deleted by the customer?
59
30 sec
Q.
Which specific Lake Formation permission must be granted to a user to allow them to view only a subset of columns within a Glue Data Catalog table?
60
30 sec
Q.
An AWS Data Engineer needs to ensure that all data moving between an on-premises data center and Amazon S3 is protected from interception. Which approach specifically addresses security for 'data in transit'?
61
30 sec
Q.
A data engineer needs to prevent an IAM user from viewing sensitive 'SSN' columns in an Amazon Athena query while allowing them to see all other data. What is the most efficient way to enforce this using AWS Lake Formation?
62
30 sec
Q.
Which specific AWS KMS key type allows a data engineer to use the same key material across multiple AWS Regions to simplify the decryption of replicated data sets?
63
30 sec
Q.
An AWS Data Engineer needs to ensure that a specified IAM role can only access an Amazon S3 bucket if the request originates from a private corporate network. Which policy element should they use to enforce this security governance?
64
30 sec
Q.
An AWS Data Engineer needs to ensure that an Amazon Redshift cluster only accepts encrypted connections from client applications. Which parameter must be set to 'true' in the cluster's parameter group to enforce this security governance?
65
30 sec
Q.
An AWS Data Engineer needs to find which specific IAM user changed an AWS Lake Formation permission on a Glue Data Catalog table to investigate a security incident. Which service provides this information?
66
30 sec
Q.
An AWS Data Engineer needs to ensure that a specific group of researchers can access only the row data where the 'department' column is set to 'Research' in an S3-backed Glue table. Which AWS Lake Formation feature is most appropriate for this?
67
30 sec
Q.
An AWS Data Engineer needs to ensure that all objects uploaded to an Amazon S3 bucket are automatically encrypted using a specific AWS KMS Customer Managed Key (CMK) by default. Which bucket-level setting should be configured?
68
30 sec
Q.
A data engineer needs to ingest high-frequency streaming data from thousands of IoT devices and store it in Amazon S3 for long-term archival. The data must be transformed into Apache Parquet format before being stored to optimize storage costs. Which combination of services provides the most cost-effective and least complex solution?
69
30 sec
Q.
A data engineer is designing a pipeline to process sensitive healthcare data. The data is stored in Amazon S3, and the engineer needs to ensure that the data is encrypted at rest using encryption keys that can be rotated annually and have access strictly controlled via IAM policies. Which encryption method should the engineer choose to meet these requirements with the least operational effort?
70
30 sec
Q.
A data engineer is building a data lake on Amazon S3 and needs to ensure that multiple AWS services, including Amazon Athena and Amazon Redshift Spectrum, can access the metadata for the datasets. The solution must provide a centralized metadata repository and support automated schema discovery. Which service should be used?
71
30 sec
Q.
A data engineer is tasked with migrating a 10 TB production database from an on-premises PostgreSQL instance to Amazon Redshift. The migration must minimize downtime, and the engineer needs to perform a one-time migration followed by ongoing data replication to keep the data synchronized until the cutover. Which service is best suited for this task?
72
30 sec
Q.
A data engineer needs to join two large datasets in Amazon Athena. One dataset is a 500 GB table partitioned by date, and the other is a 10 GB lookup table. Both tables are stored in Apache Parquet format. To minimize query execution time and avoid 'Out of Memory' errors during the join, which optimization technique should be applied?
73
30 sec
Q.
A data engineer is using Amazon Redshift for a data warehousing solution. One of the tables is frequently joined with other tables on a specific column, and is also frequently filtered by the same column in the WHERE clause. The table contains millions of rows. Which distribution style would be most effective for this table to improve join performance and reduce data movement?
74
30 sec
Q.
A data engineer is designing a data quality validation step in an ETL pipeline. The solution must automatically identify anomalies in the data, such as missing values or unexpected data types, and provide a score for data quality without requiring custom code. Which AWS service or feature should be integrated into the AWS Glue job to provide this functionality?