BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon Exam MLS-C01 Topic 3 Question 104 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 104
Topic #: 3
[All MLS-C01 Questions]

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

Show Suggested Answer Hide Answer
Suggested Answer: A

AComprehensive Explanation: The best way to split the dataset into a training dataset and a validation dataset is to pick a date so that 80% of the data points precede the date and assign that group of data points as the training dataset. This method preserves the temporal order of the data and ensures that the validation dataset reflects the most recent trends and patterns in the commodity price. This is important for forecasting models that rely on time series analysis and sequential data. The other methods would either introduce bias or lose information by ignoring the temporal structure of the data.

References:

Time Series Forecasting - Amazon SageMaker

Time Series Splitting - scikit-learn

Time Series Forecasting - Towards Data Science


Contribute your Thoughts:

Johnna
23 days ago
Wait, are we sure the answer isn't B? Because if it's not, I'm going to be kicking myself for the rest of the day. Option B all the way!
upvoted 0 times
...
Daniel
25 days ago
Option D might sound tempting, but that would just be a random mess. We need to split the data in a way that mimics the real-world scenario the model will be used in.
upvoted 0 times
Jesusa
8 days ago
C: Definitely. Option A ensures that the model is trained on past data and validated on future data, just like in real life.
upvoted 0 times
...
Denise
9 days ago
B: I agree. Option D would not provide a realistic representation of the data. We need to split it properly.
upvoted 0 times
...
Donte
11 days ago
A: Option A seems like the best choice. We need to maintain the chronological order of the data for accurate forecasting.
upvoted 0 times
...
...
Catarina
1 months ago
Haha, I'm just picturing the data scientist flipping a coin to decide which data points go where. But in all seriousness, Option B is the clear winner here.
upvoted 0 times
Cherrie
1 days ago
Definitely, random sampling wouldn't be as effective as choosing a date for the split.
upvoted 0 times
...
Jovita
9 days ago
Yeah, it makes sense to use a specific date to divide the data points.
upvoted 0 times
...
Erick
16 days ago
I agree, Option B is the most logical choice for splitting the dataset.
upvoted 0 times
...
...
Lyndia
2 months ago
I think randomly sampling data points for the training dataset is also a valid approach. As long as it's done without replacement, it should provide a good representation of the dataset.
upvoted 0 times
...
James
2 months ago
I agree with Kimberely. It makes sense to split the dataset based on a specific date to ensure a fair comparison of model performance.
upvoted 0 times
...
Kimberely
2 months ago
I think the data scientist should pick a date so that 80% of the data points precede the date and assign them as the training dataset.
upvoted 0 times
...
Destiny
2 months ago
I agree with Stefany. Option B is the way to go. Forecasting models need to be trained on historical data and then tested on future data to see how well they perform.
upvoted 0 times
...
Stefany
2 months ago
Option B makes the most sense. We want the training data to come first in time, so the model can learn from the past and then be validated on the future data.
upvoted 0 times
Carissa
11 days ago
Stratified sampling could introduce bias and not represent the dataset accurately.
upvoted 0 times
...
Fannie
13 days ago
Randomly sampling data points might not capture the time sequence needed for accurate forecasting.
upvoted 0 times
...
Muriel
14 days ago
It's important for the model to learn from past data first before being validated on future data.
upvoted 0 times
...
Nydia
19 days ago
I agree, option B is the best choice for splitting the dataset.
upvoted 0 times
...
...

Save Cancel