Amazon AWS Certified Machine Learning — Specialty Exam Dumps and Practice Test Questions Set7 Q91-105
Visit here for our full Amazon AWS Certified Machine Learning — Specialty exam dumps and practice test questions.
Question 91
A data scientist is building a multi-class image classification model to categorize products into 50 different categories. The training dataset contains 100,000 images with highly imbalanced distribution where some categories have only 200 images while others have 5,000 images. After training a CNN model, overall accuracy is 85% but the model performs poorly on underrepresented categories with recall below 30%. What is the MOST effective approach to improve performance on minority classes?
A) Increase the number of training epochs to allow the model more time to learn minority classes
B) Apply class-weighted loss function and use data augmentation more aggressively on minority classes
C) Remove minority classes from the dataset and focus only on well-represented categories
D) Use a larger model architecture with more parameters to increase capacity
Correct Answer: B
Explanation:
The most effective approach to improve performance on minority classes is applying a class-weighted loss function combined with more aggressive data augmentation on minority classes. This dual strategy addresses class imbalance from complementary angles: class weights force the model to pay more attention to minority classes during training by penalizing their misclassification more heavily, while targeted data augmentation increases the effective size and diversity of minority class training data. Together, these techniques help the model learn robust features for underrepresented categories rather than ignoring them in favor of majority classes.
Class-weighted loss functions modify the standard cross-entropy loss by assigning different weights to different classes based on their frequency in the training set. Instead of treating all classification errors equally, weighted loss penalizes mistakes on minority classes more heavily than mistakes on majority classes. The weights are typically set inversely proportional to class frequencies, so a class with 200 examples might receive a weight 25 times higher than a class with 5,000 examples. During backpropagation, gradients from minority class examples are amplified by these weights, forcing the model to update parameters more significantly when it misclassifies minority examples. This rebalancing ensures that optimizing the loss function requires performing well on all classes, not just abundant ones.
For the product categorization scenario with 50 categories and severe imbalance, you would compute class weights as total number of samples divided by (number of classes times number of samples for each class), or use predefined weights emphasizing minority classes based on business importance. Most deep learning frameworks like PyTorch and TensorFlow provide built-in support for class weights in their loss functions. You specify weights when defining the loss criterion, and the framework automatically applies them during training.
Aggressive data augmentation on minority classes increases their effective representation by creating diverse variations of limited examples. For product images, augmentation techniques include random rotations, random cropping and zooming, color jittering adjusting brightness/contrast/saturation, horizontal flipping, random erasing removing random patches, and perspective transformations simulating different camera angles. For minority classes with only 200 examples, applying multiple augmentation techniques can effectively create thousands of varied training examples, giving the model much more diverse data to learn from. You can implement class-specific augmentation by detecting which class a sample belongs to during data loading and applying stronger augmentation to minority classes.
A is incorrect because simply increasing training epochs without addressing class imbalance would likely worsen the problem; with more epochs, the model would see majority class examples many more times than minority class examples, reinforcing bias toward predicting majority classes; the model might even overfit to majority classes while continuing to underperform on minority classes.
C is incorrect because removing minority classes abandons the business requirement to classify products into all 50 categories; this approach solves the technical problem by eliminating difficult classes but fails to meet actual use case requirements; if certain product categories need classification, removing them means the system cannot handle those products in production.
D is incorrect because using a larger model architecture with more parameters does not address the fundamental issue of class imbalance; a larger model might have more capacity to memorize training data, but with severe imbalance, it would primarily use that capacity to better fit majority classes rather than improving minority class performance; model capacity is not the bottleneck.
Question 92
A machine learning team is training a recommendation model using collaborative filtering on user-item interaction data. The dataset contains 50 million users, 1 million items, and 5 billion interactions. Training takes over 48 hours on a single large instance. Which approach would MOST effectively reduce training time while maintaining model quality?
A) Use distributed training across multiple instances with data parallelism
B) Reduce the dataset size by sampling 10% of users randomly
C) Switch to a simpler model like popularity-based recommendations
D) Increase the learning rate to converge faster
Correct Answer: A
Explanation:
The approach that would most effectively reduce training time while maintaining model quality is using distributed training across multiple instances with data parallelism. Distributed training leverages multiple compute instances working in parallel to process different portions of training data simultaneously, dramatically reducing wall-clock training time for large-scale datasets like this collaborative filtering scenario with 5 billion interactions. Data parallelism is particularly well-suited for recommendation systems where the dataset is massive but the model architecture is computationally manageable, allowing efficient distribution of data processing across workers while synchronizing model updates.
Data parallelism works by replicating the model across multiple instances (workers), with each worker maintaining a complete copy of model parameters. The training dataset is partitioned across workers, so each worker processes a different subset of data in parallel. During each training iteration, every worker performs forward and backward passes on its local data batch, computes gradients based on its subset, and then workers synchronize by aggregating gradients across all instances, typically averaging them to compute the global gradient. The aggregated gradient is used to update model parameters, and updated parameters are synchronized across all workers so every worker starts the next iteration with the same model state. This process allows processing multiple batches simultaneously across workers, achieving near-linear speedup with number of workers under ideal conditions.
For the collaborative filtering scenario with 5 billion interactions, distributing training across 8 or 16 instances could reduce the 48-hour training time to 6-8 hours or 3-4 hours respectively, assuming efficient scaling. SageMaker provides built-in support for distributed training with frameworks like TensorFlow and PyTorch. You configure distributed training by specifying instance type and instance count in the SageMaker estimator. For collaborative filtering models based on matrix factorization or neural collaborative filtering, data can be efficiently partitioned because user-item interactions can be processed independently in mini-batches. The model learns user embeddings and item embeddings, and gradients from different interaction batches can be aggregated to update these embeddings globally.
Implementation considerations include choosing appropriate instance types with sufficient CPU and memory (ml.c5.9xlarge or ml.m5.12xlarge for CPU-intensive collaborative filtering, or GPU instances like ml.p3.8xlarge if using deep neural collaborative filtering), configuring number of workers based on dataset size and budget (8-16 workers often provides good balance between speedup and cost), selecting distributed training strategy such as Horovod for efficient gradient aggregation with ring-allreduce communication or SageMaker’s distributed data parallel library optimized for AWS infrastructure.
B is incorrect because reducing the dataset to 10% of users by random sampling would sacrifice 90% of training data, severely degrading model quality and recommendation accuracy; collaborative filtering relies on learning patterns from the full interaction matrix across all users and items; discarding 45 million users and billions of their interactions would eliminate most collaborative signal that makes recommendations accurate; while training would be faster on less data, the resulting model would perform much worse in production.
C is incorrect because switching to simpler popularity-based recommendation system would dramatically reduce recommendation quality and personalization; popularity recommendations simply suggest most popular items to all users without any personalization based on individual user preferences or behavior; this approach ignores the vast majority of the 5 billion interaction dataset and provides no collaborative filtering; while extremely fast to compute, popularity-based recommendations fail to provide personalized, relevant suggestions that collaborative filtering delivers.
D is incorrect because simply increasing learning rate to converge faster is likely to destabilize training and degrade model quality rather than achieving faster convergence to a good solution; while learning rate is important hyperparameter, setting it too high can cause training to diverge, oscillate around optimum without converging, or converge to suboptimal solution with worse performance; learning rate must be carefully tuned, and arbitrarily increasing it does not guarantee faster training.
Question 93
A company is deploying a machine learning model for credit risk assessment that must comply with fair lending regulations. The model uses applicant features including age, income, employment history, and credit history. Regulators require that the model does not discriminate based on protected attributes. Which approach BEST ensures fairness and regulatory compliance?
A) Remove protected attributes like age from the training data and assume fairness is guaranteed
B) Train the model with all features, then use fairness metrics to detect bias and apply post-processing bias mitigation techniques
C) Use only objective financial features and ignore all demographic information
D) Train separate models for different demographic groups to ensure equal performance
Correct Answer: B
Explanation:
The approach that best ensures fairness and regulatory compliance is to train the model with all relevant features, then use fairness metrics to detect bias and apply post-processing bias mitigation techniques when necessary. This comprehensive methodology acknowledges that fairness in machine learning is complex and cannot be achieved simply by removing protected attributes, as discrimination can still occur through proxy variables that correlate with protected attributes. The proper approach involves measuring fairness explicitly using established metrics, identifying where disparities exist, and applying targeted interventions to achieve fairness while maintaining model accuracy for legitimate credit risk assessment.
Training the model with all relevant predictive features, including protected attributes when appropriate and legal, allows you to measure and control for potential disparate impact directly. In credit risk assessment, factors like age, employment history, income, and credit history are legitimately predictive of credit risk, and regulations like the Equal Credit Opportunity Act (ECOA) do not prohibit using age or other protected characteristics when they are demonstrably related to creditworthiness. However, the model must not discriminate unfairly against protected groups. By including protected attributes in the development process, you can explicitly measure whether the model exhibits disparate treatment (intentionally treating groups differently) or disparate impact (neutral policies that disproportionately harm protected groups), and take corrective action.
Fairness metrics provide quantitative measures of potential bias across different protected groups. Key fairness metrics for credit risk include demographic parity (whether approval rates are similar across groups), equalized odds (requires that true positive rates and false positive rates are similar across groups, meaning qualified applicants are approved at similar rates and unqualified applicants are rejected at similar rates regardless of group membership), equal opportunity (requires similar true positive rates across groups, meaning qualified applicants from all groups have equal chances of approval), and predictive parity (among those predicted to be creditworthy, actual default rates are similar across groups). Different fairness definitions may conflict, so you must choose metrics aligned with regulatory requirements and business values. For credit risk, equalized odds or equal opportunity are often most appropriate.
Post-processing bias mitigation techniques adjust model predictions or decision thresholds to achieve fairness while maintaining overall predictive accuracy. These techniques operate on trained model outputs rather than modifying the training process, making them flexible and auditable. Common post-processing approaches include threshold adjustment (using different classification thresholds for different groups to equalize desired fairness metrics), calibration (ensuring predicted probabilities are calibrated within each group so a 60% predicted default probability means 60% actual default rate for all groups), and reject option classification (allowing the model to abstain from decisions for cases near decision boundary where bias might be most pronounced). SageMaker Clarify provides built-in tools for detecting bias using various fairness metrics.
A is incorrect because simply removing protected attributes like age from training data does not guarantee fairness and may actually make bias harder to detect and correct; this naive approach suffers from the problem of proxy variables where other features like zip code, employment history, or seemingly neutral factors correlate with protected attributes and allow the model to make discriminatory decisions indirectly; without including protected attributes, you cannot measure whether the model exhibits disparate impact across groups.
C is incorrect because the notion of purely objective financial features is misleading as many financial features reflect historical discrimination and systemic inequalities; credit scores, for example, can embed historical bias from discriminatory lending practices; focusing only on financial features without measuring fairness outcomes can perpetuate rather than prevent discrimination; additionally, this approach prevents measuring disparate impact because you have no way to evaluate whether outcomes differ across protected groups.
D is incorrect because training separate models for different demographic groups explicitly treats groups differently based on protected characteristics, which constitutes disparate treatment and violates fair lending regulations; this approach is fundamentally discriminatory as it applies different decision-making processes based on protected attributes; regulations require that the same criteria and processes apply to all applicants regardless of protected group membership.
Question 94
A data scientist is building a neural network for regression to predict house prices. After training, the model achieves very low training loss but high validation loss. Training loss continues to decrease while validation loss increases after epoch 10. What is the PRIMARY issue and MOST appropriate solution?
A) The model is underfitting; increase model complexity by adding more layers
B) The model is overfitting; implement early stopping based on validation loss and add regularization
C) The learning rate is too low; increase it to speed up convergence
D) The data is insufficient; collect more training samples
Correct Answer: B
Explanation:
The primary issue is overfitting, where the model learns to memorize training data including noise and random fluctuations rather than learning generalizable patterns that transfer to new data. The most appropriate solution is to implement early stopping based on validation loss and add regularization techniques like L2 regularization or dropout. The classic symptom of overfitting is exactly what is described: decreasing training loss indicating the model is becoming increasingly accurate on training data, while validation loss increases indicating the model is becoming worse at predicting unseen validation data. This divergence between training and validation performance reveals that the model is learning training-specific patterns that do not generalize.
The divergence starting around epoch 10 shows the point where the model transitions from learning useful general patterns to overfitting training-specific noise. In early epochs, both training and validation loss decrease together as the model learns fundamental relationships like larger houses typically costing more or better locations commanding higher prices. After epoch 10, the model begins learning spurious correlations specific to the training set, such as memorizing that a particular combination of features appearing in training data has a specific price, even if that combination was coincidental or influenced by random factors. These memorized patterns do not help predict prices for new houses in validation set, causing validation performance to degrade even as training performance improves.
Early stopping is a simple but highly effective regularization technique that prevents overfitting by monitoring validation loss during training and stopping when validation performance stops improving. Implementation involves tracking validation loss after each epoch, maintaining best validation loss seen so far and corresponding model weights, incrementing patience counter when validation loss does not improve (fails to decrease by minimum delta), stopping training when patience counter reaches threshold (such as 5-10 epochs without improvement), and then restoring model weights from epoch with best validation performance. For house price prediction scenario, early stopping would automatically terminate training around epoch 10-15 when validation loss begins increasing, preventing the model from continuing to overfit for subsequent epochs.
Additional regularization techniques complement early stopping by constraining the model’s capacity to memorize training data. L2 regularization (weight decay) adds a penalty term to the loss function proportional to squared magnitude of model weights, discouraging the model from using large weight values that often correspond to overfitting. During training, L2 regularization shrinks weights toward zero, effectively reducing model complexity and forcing the model to use only the most important features with strongest signals. Dropout randomly deactivates a percentage of neurons during each training step, preventing the network from becoming overly reliant on specific neurons or developing co-adapted features that work only in specific combinations seen in training data. For the regression neural network, you might apply L2 regularization with penalty coefficient like 0.001 and dropout with rate of 0.2-0.5 on fully connected layers.
A is incorrect because the model is not underfitting; underfitting would manifest as both high training loss and high validation loss, indicating the model lacks capacity to learn patterns in data; in this scenario, training loss is very low showing the model has sufficient capacity to fit training data; the problem is that the model has too much capacity or flexibility relative to amount and diversity of training data, allowing it to memorize rather than generalize.
C is incorrect because the learning rate is not the primary issue; a too-low learning rate would cause slow convergence where both training and validation loss decrease very slowly over many epochs, but would not cause the specific pattern of diverging training and validation loss; the model is converging effectively on training set (evidenced by decreasing training loss), so learning rate is adequate for optimization.
D is incorrect because while more training data can help reduce overfitting by providing more diverse examples that make memorization less effective, it is not the most immediate or practical solution; collecting additional data is often expensive, time-consuming, or impossible, whereas regularization techniques and early stopping can be implemented immediately with existing dataset; additionally, the described symptoms indicate the model has sufficient data to learn but is not properly regularized.
Question 95
A financial services company is building a fraud detection system that processes credit card transactions in real-time. The system must make predictions within 100 milliseconds and handle 20,000 transactions per second during peak hours. The fraud patterns change frequently, requiring model updates every 3 days. Which deployment architecture provides the BEST combination of low latency, high throughput, and rapid model updates?
A) SageMaker Asynchronous Inference with queue-based processing
B) SageMaker Real-time Inference with multi-model endpoints to deploy multiple model versions
C) SageMaker Real-time Inference with auto-scaling and canary deployment using production variants
D) AWS Lambda functions processing transactions in batches every 5 minutes
Correct Answer: C
Explanation:
The deployment architecture that provides the best combination of low latency, high throughput, and rapid model updates is SageMaker Real-time Inference with auto-scaling configured for handling variable load and canary deployment using production variants for safe, rapid model updates. This architecture delivers the consistent sub-100 millisecond latency required for real-time fraud detection, scales automatically to handle 20,000 transactions per second during peak periods, and supports zero-downtime model updates every 3 days through SageMaker’s production variants feature which enables canary and blue-green deployment strategies.
SageMaker Real-time Inference endpoints are specifically designed for synchronous predictions with strict latency requirements. The endpoint keeps the model loaded in memory on dedicated instances, ensuring consistently low inference latency without unpredictability of cold starts or model loading delays. For gradient boosting models which are typically fast to execute (often completing inference in single-digit milliseconds), total latency including network overhead and request handling easily stays under the 100 millisecond requirement. With 20,000 transactions per second during peak hours, you would configure multiple instances behind the endpoint to distribute load. For example, if each instance can handle 1,000 predictions per second for your model, you would need approximately 20-25 instances during peak times to maintain required throughput with some headroom for spikes.
Auto-scaling is essential for handling variable transaction volumes characteristic of credit card processing, where traffic patterns fluctuate between overnight low-traffic periods and peak shopping hours. SageMaker supports application auto-scaling for real-time endpoints based on invocation metrics. You configure target tracking scaling policies that automatically adjust instance count to maintain target metric value. For this fraud detection system, you might configure target of 800 invocations per instance per second (80% utilization of 1,000 capacity) to maintain performance headroom, set minimum instance count to 5 for baseline capacity during off-peak periods, and set maximum instance count to 30 to handle peak loads with buffer capacity. Auto-scaling monitors actual invocation rates in real-time and adds or removes instances automatically.
Canary deployment using production variants enables safe, rapid model updates to keep pace with evolving fraud patterns. Production variants allow you to deploy multiple model versions behind single endpoint and control traffic distribution through variant weights. For the 3-day update cycle, you would deploy new model version as second production variant alongside current model, initially routing only 10-20% of traffic to new model (canary variant) while 80-90% continues to stable model (baseline variant). You monitor canary variant’s performance metrics including prediction latency, error rates, fraud detection accuracy, and false positive rates. If canary performs well, you gradually increase its traffic weight to 50%, then 100% over hours. If issues are detected, you immediately reduce canary traffic to 0% or remove variant entirely, instantly rolling back to stable model without downtime.
A is incorrect because SageMaker Asynchronous Inference is designed for workloads that can tolerate latencies from seconds to minutes and do not require immediate responses; async inference queues requests for processing and is appropriate for batch-like workloads or long-running predictions, not for real-time fraud detection where transactions must be approved or declined within milliseconds; using async inference would introduce unacceptable delays in transaction processing.
B is incorrect because multi-model endpoints are designed for hosting many different models on shared instances to improve resource utilization when you have numerous models with intermittent traffic; MME loads models on-demand based on which model is requested, which introduces model loading latency; for real-time fraud detection requiring sub-100 millisecond response times with high throughput, model loading overhead makes multi-model endpoints unsuitable; additionally, MME is not designed for canary deployments or A/B testing between model versions.
D is incorrect because Lambda functions processing transactions in batches every 5 minutes introduces 5-minute delays between transaction occurrence and fraud detection, which is completely unsuitable for real-time fraud prevention; fraudulent transactions would complete and funds would be transferred before detection occurs; batch processing with 5-minute intervals violates the fundamental requirement for real-time detection within 100 milliseconds.
Question 96
A data scientist is training a convolutional neural network for medical image classification using 100,000 chest X-ray images. The model shows training accuracy of 92% but validation accuracy of only 68%. The data scientist has already implemented dropout and L2 regularization. What additional technique would MOST effectively improve validation performance?
A) Reduce the number of convolutional layers to decrease model complexity
B) Apply extensive data augmentation including rotations, flips, zooms, and brightness adjustments
C) Increase the batch size to stabilize training
D) Remove dropout and L2 regularization to allow the model more flexibility
Correct Answer: B
Explanation:
The additional technique that would most effectively improve validation performance is applying extensive data augmentation including rotations, flips, zooms, and brightness adjustments. The large gap between training accuracy (92%) and validation accuracy (68%) indicates significant overfitting despite existing regularization through dropout and L2 penalties. Data augmentation is particularly powerful for computer vision tasks because it artificially increases the diversity and effective size of the training dataset by creating realistic variations of existing images, forcing the model to learn features that are robust to these variations rather than memorizing specific training examples.
Data augmentation for medical chest X-ray images can include various transformations that simulate real-world variations in how X-rays are captured and positioned. Random rotations of small angles (typically ±5 to ±15 degrees) account for slight variations in patient positioning during X-ray capture, helping the model learn anatomical features that are rotation-invariant. Horizontal flipping creates mirror images that are anatomically valid for chest X-rays since the human body is roughly bilaterally symmetric, effectively doubling the training data. Random zooming and scaling (typically 90-110% of original size) simulate variations in patient distance from imaging equipment and help the model recognize pathologies at different scales. Brightness and contrast adjustments simulate different exposure settings and imaging equipment characteristics, making the model robust to lighting variations across different hospitals and machines.
Additional augmentation techniques specific to medical imaging can further improve generalization. Random cropping and resizing forces the model to recognize pathologies regardless of their exact position in the image. Slight elastic deformations can simulate natural variations in patient anatomy and positioning. Gaussian noise addition makes the model robust to sensor noise in imaging equipment. For X-ray images specifically, you should avoid aggressive augmentations like large rotations or perspective distortions that could create anatomically unrealistic images, as these might teach the model to recognize impossible patterns. The key is finding augmentations that create realistic variations the model might encounter in production while avoiding unrealistic transformations that inject invalid training signal.
Implementation of data augmentation in modern deep learning frameworks is straightforward and computationally efficient. In PyTorch, you define augmentation transforms in the dataset’s getitem method or use torchvision.transforms to create augmentation pipelines. In TensorFlow/Keras, you use ImageDataGenerator or tf.keras.layers preprocessing layers to apply augmentations on-the-fly during training. Critically, augmentations are applied randomly during training only, not during validation or testing, ensuring that validation accuracy measures true generalization to unaugmented data. For the chest X-ray scenario with 100,000 images, aggressive augmentation could expose the model to millions of unique image variations during training, dramatically reducing overfitting. Combined with existing dropout and L2 regularization, data augmentation provides complementary regularization that should significantly close the gap between training and validation accuracy.
A is incorrect because reducing the number of convolutional layers decreases model capacity, which might reduce overfitting but would likely also reduce the model’s ability to learn complex visual features necessary for accurate medical image classification; the model is already achieving 92% training accuracy, indicating it has adequate capacity to learn useful features; reducing capacity risks underfitting where the model cannot capture important patterns in chest X-rays; the better approach is maintaining model capacity while improving generalization through data augmentation.
C is incorrect because increasing batch size primarily affects training dynamics and gradient estimation stability but does not directly address overfitting; larger batches provide more stable gradient estimates and can sometimes lead to better generalization through different optimization dynamics, but this effect is typically modest and inconsistent; increasing batch size also requires more memory and may necessitate learning rate adjustments; batch size adjustment is a hyperparameter tuning approach rather than fundamental solution to the large generalization gap.
D is incorrect because removing dropout and L2 regularization would almost certainly make overfitting worse, not better; these regularization techniques are specifically designed to reduce overfitting by constraining model complexity and preventing memorization of training data; removing them would give the model more flexibility to overfit the training set further, likely increasing training accuracy toward 100% while validation accuracy decreases even more; the problem is not that the model lacks flexibility but that it is already overfitting despite existing regularization.
Question 97
A machine learning team is building a natural language processing model to extract named entities (person names, organizations, locations, dates) from legal documents. The documents are lengthy, often exceeding 10,000 words, and contain domain-specific legal terminology. The team has 5,000 labeled documents. Which approach provides the BEST accuracy for this specialized NER task?
A) Train a CRF (Conditional Random Fields) model from scratch with hand-crafted features
B) Fine-tune a pre-trained BERT model on the legal documents with domain-specific vocabulary extension
C) Use Amazon Comprehend’s built-in entity recognition without customization
D) Build a rule-based system using regex patterns for each entity type
Correct Answer: B
Explanation:
The approach that provides the best accuracy for named entity recognition in specialized legal documents is fine-tuning a pre-trained BERT model on the legal documents with domain-specific vocabulary extension. This approach combines the powerful contextual understanding of pre-trained language models with adaptation to the specific domain of legal text and its unique terminology. BERT’s transformer architecture excels at capturing long-range dependencies and contextual relationships in text, which is crucial for understanding lengthy legal documents where entity references may depend on context established thousands of words earlier. Fine-tuning allows the model to specialize for legal domain vocabulary and entity patterns while leveraging general language understanding from pre-training.
Pre-trained BERT models like BERT-base or BERT-large have learned deep linguistic representations from billions of words of general text through masked language modeling and next sentence prediction objectives. This pre-training provides a strong foundation of language understanding including syntax, semantics, coreference resolution, and entity recognition capabilities. However, legal documents contain specialized terminology, formal language structures, Latin phrases, statutory references, and citation formats that differ significantly from general text. Domain-specific vocabulary extension addresses this gap by expanding BERT’s vocabulary and continuing pre-training on a large corpus of unlabeled legal documents before fine-tuning for NER. This domain adaptation step, sometimes called domain-adaptive pre-training, helps the model learn representations of legal terminology and language patterns.
The vocabulary extension process involves collecting a large corpus of legal documents (which can be unlabeled since pre-training is self-supervised), extracting domain-specific terms that are not well-represented in BERT’s original vocabulary (such as legal terminology like «appellant,» «deposition,» «tort,» «habeas corpus,» statutory references, and case citations), adding these terms to BERT’s tokenizer vocabulary to prevent them from being split into subword tokens that lose semantic meaning, initializing embeddings for new vocabulary tokens (often by averaging embeddings of their subword components), and continuing masked language modeling pre-training on legal text for several epochs to learn contextualized representations of legal language. This creates a legal-domain BERT model that understands legal terminology in context.
Fine-tuning the domain-adapted BERT for NER involves formatting the 5,000 labeled legal documents with entity annotations in BIO format (Begin-Inside-Outside) where each token is labeled as B-PERSON (beginning of person name), I-PERSON (inside person name), B-ORG (beginning of organization), B-LOC (beginning of location), B-DATE, or O (outside any entity). You add a token classification head on top of BERT’s encoder that predicts entity labels for each token, load the domain-adapted BERT weights as initialization, and fine-tune the entire model on labeled legal documents using a modest learning rate (typically 2e-5 to 5e-5) to avoid catastrophic forgetting of pre-trained knowledge. The bidirectional context in BERT is particularly valuable for legal NER because determining entity types often requires understanding both preceding and following context.
A is incorrect because training a CRF model from scratch with hand-crafted features requires extensive manual feature engineering including word shapes, prefixes, suffixes, capitalization patterns, part-of-speech tags, dependency parse features, gazetteers of known entities, and context window features; this approach is labor-intensive and cannot capture the deep contextual understanding that BERT provides through its attention mechanisms; CRF models also struggle with long-range dependencies in 10,000-word documents since they typically use limited context windows; while CRFs were state-of-the-art for NER before deep learning, they are now significantly outperformed by transformer-based models.
C is incorrect because Amazon Comprehend’s built-in entity recognition is trained on general text and recognizes standard entity types like PERSON, LOCATION, ORGANIZATION, and DATE using patterns learned from general domains; legal documents contain specialized entity types, unique naming conventions, complex nested entities, and domain-specific language that Comprehend’s general model has not been trained to handle; without customization or training on legal documents, Comprehend would miss many legal-specific entities, misclassify entities due to unfamiliarity with legal terminology, and fail to achieve the accuracy needed for production legal document processing.
D is incorrect because rule-based systems using regex patterns are brittle, require exhaustive enumeration of all possible entity patterns, and cannot handle the enormous variation in how entities appear in natural language; legal documents contain entities in countless formats that would require thousands of regex patterns to capture; person names have infinite variations and cultural diversity that regex cannot enumerate; organizations may be referred to by abbreviations, full names, or informal references that change based on context; dates appear in numerous formats; rule-based systems also cannot handle ambiguity where same text could be different entity types depending on context.
Question 98
A retail company is building a demand forecasting model for inventory management across 500 stores and 10,000 products. The forecasts must account for store-specific trends, product seasonality, promotional events, and local weather conditions. The data science team needs to generate forecasts for the next 30 days updated daily. Which AWS service provides the MOST comprehensive solution with minimal operational overhead?
A) SageMaker DeepAR algorithm with custom implementation for related time series
B) Amazon Forecast with related time series for weather and promotions
C) SageMaker Linear Learner with manual feature engineering for temporal patterns
D) Custom LSTM model built with TensorFlow on SageMaker
Correct Answer: B
Explanation:
The AWS service that provides the most comprehensive solution with minimal operational overhead for this complex multi-variate demand forecasting scenario is Amazon Forecast with related time series for incorporating weather and promotional data. Amazon Forecast is a fully managed time series forecasting service specifically designed for scenarios exactly like this where you need to forecast many related time series (500 stores times 10,000 products equals 5 million time series) while incorporating multiple external factors. The service automates the entire forecasting workflow including data preprocessing, algorithm selection, model training, hyperparameter tuning, and inference, making it ideal for minimizing operational overhead.
Amazon Forecast excels at handling the complexity described in this scenario through its support for multiple data types and automatic feature engineering. The service accepts target time series data containing historical demand for each product-store combination with timestamps and quantities sold, related time series data that varies over time such as daily weather conditions for each store location (temperature, precipitation, humidity) and promotional calendar indicating when products were on sale or featured in marketing campaigns, and item metadata that is static such as product category, brand, price tier, and store characteristics like size, location, demographics. Forecast automatically processes these different data types and learns how they interact to influence demand patterns.
A is incorrect because using SageMaker DeepAR algorithm requires manual implementation of significant infrastructure and workflows including writing custom preprocessing code to format time series data and related covariates, implementing training scripts with proper DeepAR configuration, managing training jobs and hyperparameter tuning manually, building custom inference pipelines to generate forecasts for 5 million time series, implementing scheduling and automation for daily forecast updates, and monitoring and maintaining the entire system; while DeepAR is a powerful algorithm (and is actually used within Amazon Forecast), implementing it directly on SageMaker requires substantially more operational effort.
C is incorrect because SageMaker Linear Learner is designed for classification and regression on tabular data, not for time series forecasting; using Linear Learner for forecasting would require extensive manual feature engineering to create lagged demand values, rolling statistics (moving averages, standard deviations), seasonal indicators for different periodicities (day of week, month, quarter), trend features, promotion encodings and interaction terms, weather features and their interactions with product categories, and calendar features for holidays; this feature engineering is complex, time-consuming, requires significant domain expertise, and would need to be manually updated and maintained.
D is incorrect because building a custom LSTM model with TensorFlow on SageMaker requires significant deep learning expertise and operational effort including designing the LSTM architecture (number of layers, hidden dimensions, attention mechanisms), implementing data preprocessing pipelines for sequential data, writing training code with proper sequence handling and batching, managing training infrastructure and hyperparameter optimization, implementing inference logic for generating multi-step forecasts, building automation for daily retraining and forecast generation, and maintaining and monitoring the entire custom system; while LSTMs can be effective for time series forecasting, building and operating a custom solution requires substantially more effort than using the purpose-built, fully managed Amazon Forecast service.
Question 99
A data scientist is evaluating a binary classification model for detecting fraudulent insurance claims. The test dataset contains 10,000 claims with 100 actual fraudulent cases (1% fraud rate). The model achieves 99% accuracy. However, the business team reports that the model is not useful in production. What is the MOST likely issue and appropriate evaluation metric?
A) The model is overfitting; use cross-validation for better evaluation
B) High accuracy is misleading due to class imbalance; evaluate using precision, recall, and F1-score for the fraud class
C) The test set is too small; collect more test data
D) The model architecture is wrong; switch to a different algorithm
Correct Answer: B
Explanation:
The most likely issue is that the high accuracy is misleading due to severe class imbalance, and the appropriate evaluation metrics are precision, recall, and F1-score specifically for the fraud class (positive class). With only 1% of claims being fraudulent, a naive model that predicts «not fraud» for every single claim would achieve 99% accuracy without detecting any fraud whatsoever. This scenario perfectly illustrates why accuracy is a poor metric for imbalanced classification problems and why the business team finds the model useless despite its seemingly impressive accuracy score. The model is likely predicting the majority class (legitimate claims) for almost all cases, missing the rare but critical fraudulent claims that the system is designed to detect.
A is incorrect because the issue is not overfitting to the training set but rather the use of an inappropriate evaluation metric that hides poor performance on the minority class; cross-validation would still show high accuracy on each fold due to class imbalance, perpetuating the same problem; cross-validation is valuable for assessing generalization and reducing variance in performance estimates, but it does not solve the fundamental issue that accuracy is meaningless for imbalanced data; even with cross-validation, you would still need to use appropriate metrics like F1-score to properly evaluate fraud detection performance.
C is incorrect because the test set size of 10,000 claims with 100 fraudulent examples is actually reasonable for evaluation purposes; the 100 fraudulent cases provide sufficient sample size to calculate meaningful precision and recall statistics; the problem is not the test set size but the choice of evaluation metric; collecting more test data would still show 99% accuracy and would not reveal the model’s poor fraud detection performance unless you use appropriate metrics; increasing test set size without changing evaluation metrics does not address the root cause.
D is incorrect because there is no evidence that the model architecture or algorithm is the primary problem; the issue is that the model is being evaluated using an inappropriate metric that makes it appear successful when it is actually failing at its intended task; the same problem would occur with any algorithm if evaluated only on accuracy for imbalanced data; before concluding that the algorithm is wrong, you must first properly evaluate performance using metrics that focus on fraud detection (precision, recall, F1-score) to understand actual performance.
Question 100
A machine learning engineer is deploying a computer vision model for quality inspection in a manufacturing facility. The facility has limited and unreliable internet connectivity to AWS cloud services. The model must process images from production line cameras in real-time with latency under 50 milliseconds and continue operating during network outages. Which deployment strategy is MOST appropriate?
A) Deploy the model to SageMaker Real-time Inference endpoints in the AWS cloud
B) Use AWS IoT Greengrass to deploy the model to edge devices at the manufacturing facility
C) Deploy the model on Lambda functions triggered by S3 image uploads
D) Use SageMaker Batch Transform to process images periodically
Correct Answer: B
Explanation:
The most appropriate deployment strategy for this edge manufacturing scenario with connectivity constraints is using AWS IoT Greengrass to deploy the model to edge devices located at the manufacturing facility for local inference. IoT Greengrass extends AWS capabilities to edge locations, allowing machine learning models to run locally on on-premises hardware while maintaining optional cloud connectivity for management, monitoring, and updates. This architecture directly addresses the critical requirements of operating with unreliable internet connectivity, achieving real-time inference with sub-50 millisecond latency, and maintaining continuous operation during network outages that would disable cloud-dependent solutions.
A is incorrect because deploying to SageMaker Real-time Inference endpoints in the AWS cloud requires continuous internet connectivity to send every image from the manufacturing facility to AWS for processing and receive results; with unreliable connectivity, the system would experience frequent failures whenever the internet connection drops, halting quality inspection and potentially allowing defective products through the production line; additionally, cloud-based inference introduces network latency from uploading images, cloud processing, and downloading results, which combined with variable network conditions would likely exceed the 50 millisecond latency requirement.
C is incorrect because deploying on Lambda functions triggered by S3 uploads requires uploading images to S3 over the internet, which depends on network connectivity and introduces significant latency from image upload time, S3 event processing delay, Lambda function initialization (cold starts), and result retrieval; this architecture could introduce latencies of seconds rather than the required sub-50 milliseconds; critically, Lambda deployment requires internet connectivity for every inference, making it completely non-functional during network outages.
D is incorrect because SageMaker Batch Transform is designed for asynchronous batch processing of large datasets with latencies of minutes to hours, not real-time inference with millisecond latency requirements; batch transform accumulates data and processes it periodically rather than providing immediate results for each input; in a manufacturing quality inspection context, batch processing would mean defective products continue through the production line for minutes or hours before detection, rendering the quality control system essentially useless.
Question 101
A data scientist is training a deep neural network for time series forecasting to predict server CPU utilization for the next 24 hours. The model uses the past 7 days of CPU measurements taken every 5 minutes as input features. During training, validation loss decreases for the first 20 epochs but then begins to increase while training loss continues to decrease. What is the PRIMARY cause and MOST effective solution?
A) Learning rate is too high causing unstable training; reduce the learning rate
B) The model is overfitting to training data; implement early stopping and add dropout layers
C) Insufficient training data; collect more historical server data
D) Input features are not normalized; apply standardization to the time series data
Correct Answer: B
Explanation:
The primary cause of validation loss decreasing initially then increasing while training loss continues to decrease is overfitting, and the most effective solution is implementing early stopping based on validation loss combined with adding dropout layers as regularization. This pattern is a textbook symptom of overfitting where the model transitions from learning generalizable patterns in the early epochs to memorizing training-specific patterns in later epochs. The divergence between training and validation performance reveals that the model is becoming increasingly specialized to the training data at the expense of generalization to new data, which is exactly what early stopping and dropout are designed to prevent.
A is incorrect because the learning rate being too high would cause unstable training with erratic, non-monotonic behavior in both training and validation loss, often with divergence where losses increase to infinity or NaN values; the described scenario shows training loss decreasing smoothly and validation loss initially decreasing before increasing, which indicates stable optimization that is converging on the training set; this pattern does not suggest learning rate issues; the model is successfully optimizing the training loss, the problem is that the optimization objective (training loss) is not properly aligned with the actual goal (validation performance) due to overfitting.
C is incorrect because while more training data can sometimes help reduce overfitting by providing more diverse examples that make memorization less effective, the described symptom (diverging training and validation loss after epoch 20) indicates that the model has sufficient data to learn but is not properly regularized to prevent overfitting; collecting more data is expensive and time-consuming, whereas early stopping and dropout can be implemented immediately with the existing dataset; additionally, for time series forecasting, the temporal nature means you cannot simply collect more historical data without waiting for time to pass.
D is incorrect because lack of input normalization would typically cause training difficulties from the beginning, manifesting as very slow convergence, unstable gradients, or failure to train at all; if normalization were the issue, you would see poor performance in early epochs rather than the described pattern where validation loss decreases successfully for 20 epochs before diverging; the fact that both training and validation loss decrease together initially indicates that the model is learning successfully and inputs are in a reasonable range; normalization is important for neural network training but is not the cause of the overfitting pattern described.
Question 102
A financial institution is building a credit scoring model to predict loan default risk. Regulatory requirements mandate that the institution must be able to explain every prediction to customers and provide specific reasons for credit denials. The model must achieve high accuracy while maintaining full interpretability. Which modeling approach BEST satisfies both accuracy and explainability requirements?
A) Deep neural network with attention mechanisms for interpretability
B) Gradient Boosting (XGBoost) with SHAP values for detailed feature attribution
C) Random Forest with global feature importance scores
D) Ensemble of multiple black-box models with voting
Correct Answer: B
Explanation:
The modeling approach that best satisfies both high accuracy and full explainability requirements is Gradient Boosting using XGBoost combined with SHAP (SHapley Additive exPlanations) values for detailed feature attribution. This combination provides excellent predictive performance that rivals or exceeds deep learning for structured financial data while offering rigorous, mathematically grounded explanations for individual predictions that satisfy regulatory requirements for transparency in credit decisions. SHAP values quantify exactly how much each feature contributed to a specific prediction, allowing the institution to provide customers with concrete, defensible explanations for credit decisions.
A is incorrect because deep neural networks, even with attention mechanisms, remain fundamentally difficult to interpret and are generally considered black boxes for regulatory purposes; attention weights show which inputs the model focused on but do not provide clear quantitative attributions of how much each feature contributed to the prediction; attention mechanisms can be misleading as high attention does not necessarily mean high contribution to the output; for regulated credit decisions where regulators and customers need clear, defensible explanations, neural networks present significant interpretability challenges that make them less suitable than transparent tree-based models with SHAP explanations.
C is incorrect because while Random Forest provides global feature importance scores showing which features are generally important across all predictions, these global importance measures do not explain individual predictions with the specificity required for credit decision explanations; global importance shows that debt-to-income ratio is important overall but does not tell a specific applicant how much their particular debt-to-income ratio contributed to their denial versus other factors; for regulatory compliance, you need local explanations for individual decisions, not just global model behavior; SHAP provides both local and global interpretability.
D is incorrect because ensembles of multiple black-box models create even greater interpretability challenges than single black-box models; combining predictions from multiple complex models through voting makes it nearly impossible to provide clear explanations of why a specific decision was made; you cannot easily attribute the final decision to specific features when it results from averaging or voting across multiple opaque models; this approach maximizes accuracy at the expense of explainability, which is exactly the opposite of what regulatory requirements demand.
Question 103
A data scientist is building a recommendation system for an e-commerce platform with 10 million users and 500,000 products. The system must generate personalized product recommendations that update in real-time as users browse and purchase items. The data science team has limited expertise in recommendation algorithms and needs to deploy the solution within 3 weeks. Which approach provides the BEST balance of recommendation quality, real-time updates, and rapid implementation?
A) Build a custom collaborative filtering model using SageMaker Factorization Machines
B) Implement a content-based filtering system using product attributes and Elasticsearch
C) Use Amazon Personalize with USER_PERSONALIZATION recipe and real-time event tracking
D) Build a custom deep learning recommendation model using neural collaborative filtering with PyTorch
Correct Answer: C
Explanation:
The approach that provides the best balance of recommendation quality, real-time updates, and rapid implementation is Amazon Personalize with the USER_PERSONALIZATION recipe and real-time event tracking. Amazon Personalize is a fully managed machine learning service specifically designed for building recommendation systems at scale without requiring deep expertise in recommendation algorithms or extensive development time. The service handles all aspects of data processing, model training, hyperparameter optimization, deployment, and scaling automatically, making it ideal for teams with limited recommendation expertise who need to deploy production-quality systems quickly. The 3-week timeline is realistic with Personalize but would be extremely challenging with custom implementations.
A is incorrect because building a custom collaborative filtering model using SageMaker Factorization Machines requires significant development effort including implementing data preprocessing to create user-item interaction matrices, writing training scripts with proper Factorization Machines configuration, managing model training and hyperparameter tuning, building custom inference infrastructure to serve recommendations at scale for 10 million users, implementing real-time event processing to update recommendations as user behavior changes, and developing recommendation retrieval logic; this custom approach would require several months of development time, not 3 weeks, and demands substantial recommendation systems expertise.
B is incorrect because implementing a content-based filtering system using Elasticsearch focuses only on product attributes and individual user preferences without leveraging collaborative patterns across the user base; content-based approaches recommend items similar to what a user has liked based on product features, but miss the powerful collaborative signals that reveal what similar users enjoyed; for example, content-based filtering might recommend products with similar descriptions but would not discover that users who bought item A often also enjoy item B even when the items have different attributes; additionally, building a content-based system requires developing custom similarity algorithms, managing Elasticsearch infrastructure, engineering product features, and implementing recommendation logic.
D is incorrect because building a custom deep learning recommendation model using neural collaborative filtering with PyTorch requires extensive expertise in deep learning and recommendation systems including designing neural network architectures for collaborative filtering (embedding layers, interaction layers, hidden layers), implementing complex training pipelines for 10 million users and 500,000 products, managing distributed training infrastructure to handle the large-scale data, building custom serving infrastructure for real-time recommendations, implementing real-time event processing and model updates, and extensive testing and optimization; this approach would require months of development by experienced machine learning engineers, far exceeding the 3-week timeline.
Question 104
A healthcare organization is training a machine learning model to diagnose diseases from medical images. The training dataset contains 50,000 images, but 80% are normal cases while only 20% show various pathologies. After training a convolutional neural network, the model achieves 85% accuracy but performs poorly on pathology detection with recall of only 45%. What combination of techniques would MOST effectively improve pathology detection performance?
A) Increase training epochs and use a larger neural network architecture
B) Apply focal loss, class-weighted sampling, and targeted data augmentation on pathology images
C) Remove normal cases to balance the dataset perfectly
D) Switch to a simpler model like logistic regression for better generalization
Correct Answer: B
Explanation:
The combination of techniques that would most effectively improve pathology detection performance is applying focal loss, class-weighted sampling, and targeted data augmentation on pathology images. This multi-pronged approach addresses class imbalance from complementary angles: focal loss modifies the training objective to focus on hard-to-classify examples and down-weight easy examples, class-weighted sampling ensures pathology images appear more frequently in training batches despite being minority examples, and targeted data augmentation increases the diversity and effective quantity of pathology training data. Together, these techniques force the model to learn robust features for detecting pathologies rather than achieving high accuracy by predominantly predicting normal cases.
A is incorrect because simply increasing training epochs without addressing class imbalance would likely worsen the problem by allowing the model more time to overfit to the majority normal class; the model would become even more confident in predicting normal cases while continuing to struggle with pathologies; using a larger architecture increases model capacity but does not address the optimization dynamics that cause the model to ignore the minority class; more capacity without rebalancing the training signal would primarily be used to better fit normal cases rather than improving pathology detection.
C is incorrect because removing normal cases to achieve perfect balance would discard 75% of the training data (reducing from 50,000 images to approximately 12,500 images with equal normal and pathology counts), severely limiting the amount of training data available; this sacrifices valuable information about what normal cases look like, which the model needs to learn to distinguish them from pathologies; additionally, the resulting model would be calibrated to a 50-50 distribution that does not match the real-world distribution where normal cases are much more common.
D is incorrect because switching to a simpler model like logistic regression would drastically reduce the model’s capacity to learn complex visual patterns in medical images that convolutional neural networks excel at detecting; CNNs use hierarchical feature learning through convolutional layers to identify edges, textures, shapes, and pathological patterns at multiple scales, which logistic regression cannot capture; the problem is not model complexity but rather class imbalance affecting training dynamics; a simpler model would likely perform even worse on pathology detection because it lacks the representational capacity to learn subtle visual indicators of disease.
Question 105
A machine learning team is deploying multiple versions of a sentiment analysis model to production simultaneously to conduct A/B testing. They want to route 70% of traffic to the current stable model (version 1) and 30% to an experimental model (version 2) while collecting performance metrics for both. The team needs to adjust traffic distribution dynamically based on performance without downtime. Which SageMaker feature enables this capability MOST effectively?
A) Deploy two separate endpoints and use Application Load Balancer weighted target groups
B) Use SageMaker Multi-Model Endpoints with custom routing logic in the application
C) Create a single endpoint with production variants configured with weights of 70 and 30
D) Deploy models separately and implement traffic routing in application code
Correct Answer: C
Explanation:
The SageMaker feature that enables A/B testing with dynamic traffic distribution most effectively is creating a single endpoint with production variants configured with weights of 70 and 30. Production variants are SageMaker’s native, purpose-built capability for A/B testing, canary deployments, and blue-green deployments. This feature allows deploying multiple model versions behind a single endpoint URL with automatic traffic distribution according to configured weights, integrated monitoring for each variant, and the ability to dynamically adjust traffic distribution without downtime or application changes. This is exactly what A/B testing requires and provides a clean, managed solution without complex infrastructure or custom routing logic.
Production variants work by allowing you to specify multiple model configurations within a single endpoint, each with its own model artifact, instance type, instance count, and traffic weight. When you create the endpoint configuration, you define variant 1 containing the stable sentiment model version 1 with weight 70, and variant 2 containing the experimental model version 2 with weight 30. The weights are relative, so 70 and 30 result in a 70-30 traffic distribution, while weights of 7 and 3 would achieve the same ratio. When you deploy this configuration to an endpoint, SageMaker automatically begins routing approximately 70% of incoming inference requests to variant 1 and 30% to variant 2. The routing happens probabilistically on the server side and is completely transparent to the client application, which continues calling the same endpoint URL without any awareness of multiple variants or traffic splitting.
The production variants approach provides several critical advantages for A/B testing. First, SageMaker automatically publishes detailed CloudWatch metrics for each variant separately, including invocation count, model latency (time spent in model inference), overhead latency (time spent in SageMaker infrastructure), error counts and rates, and any custom metrics you emit from your inference code. This automatic per-variant monitoring allows you to objectively compare performance between the stable and experimental models, measuring differences in prediction latency, error rates, and inference throughput. For sentiment analysis A/B testing, you can also log prediction results to S3 and perform offline analysis comparing model outputs on the same inputs to evaluate differences in sentiment classification accuracy, confidence scores, and disagreement rates.
Dynamic traffic adjustment is a key capability that makes production variants powerful for A/B testing. You can update the variant weights at any time by modifying the endpoint configuration and calling UpdateEndpoint, which applies the new traffic distribution without any downtime. The traffic shift happens gracefully as new requests begin routing according to updated weights while in-flight requests complete. For the sentiment analysis scenario, if variant 2 (experimental model) performs well during the initial 30% testing phase, you can progressively increase its weight to 50%, then 70%, then 100% as confidence grows, implementing a safe gradual rollout. Conversely, if variant 2 shows problems like higher error rates, worse accuracy, or increased latency, you can immediately reduce its weight to 0% or delete the variant entirely, effectively rolling back to 100% variant 1 traffic without any service interruption.
A is incorrect because deploying two separate endpoints and using Application Load Balancer (ALB) with weighted target groups adds unnecessary infrastructure complexity, operational overhead, and additional cost; you must manage two independent SageMaker endpoints, configure and maintain an ALB layer in front of them, implement health checks and target group management, and modify client applications to route traffic through the ALB rather than directly to SageMaker; this approach also introduces additional latency from the ALB hop and loses the integrated monitoring advantages of production variants where SageMaker automatically tracks metrics per variant.
B is incorrect because Multi-Model Endpoints are designed for hosting many different models on shared instances to improve resource utilization and reduce costs when you have numerous models with sporadic traffic; MME loads models on-demand based on which model the client explicitly requests in the API call, but does not provide automatic probabilistic traffic distribution for A/B testing; to use MME for A/B testing, you would need to implement custom routing logic in the application that randomly selects between model versions with appropriate probabilities and explicitly requests each model by name; this requires application changes, custom probability logic, and loses the centralized traffic management and monitoring that production variants provide.
D is incorrect because implementing traffic routing in application code requires significant development effort, introduces complexity and potential bugs into the application layer, makes it difficult to adjust traffic percentages without code changes and redeployment, requires custom implementation of metrics collection and comparison logic, and places A/B testing concerns in the application rather than the infrastructure layer where they belong; this approach is the most complex and error-prone option, requiring substantial ongoing maintenance; production variants move A/B testing logic to the SageMaker infrastructure where it can be managed declaratively through endpoint configuration.