Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon Exam MLS-C01 Topic 3 Question 104 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 104
Topic #: 3
[All MLS-C01 Questions]

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

Show Suggested Answer Hide Answer
Suggested Answer: A

AComprehensive Explanation: The best way to split the dataset into a training dataset and a validation dataset is to pick a date so that 80% of the data points precede the date and assign that group of data points as the training dataset. This method preserves the temporal order of the data and ensures that the validation dataset reflects the most recent trends and patterns in the commodity price. This is important for forecasting models that rely on time series analysis and sequential data. The other methods would either introduce bias or lose information by ignoring the temporal structure of the data.

References:

Time Series Forecasting - Amazon SageMaker

Time Series Splitting - scikit-learn

Time Series Forecasting - Towards Data Science


Contribute your Thoughts:

Johnna
2 days ago
Wait, are we sure the answer isn't B? Because if it's not, I'm going to be kicking myself for the rest of the day. Option B all the way!
upvoted 0 times
...
Daniel
3 days ago
Option D might sound tempting, but that would just be a random mess. We need to split the data in a way that mimics the real-world scenario the model will be used in.
upvoted 0 times
...
Catarina
10 days ago
Haha, I'm just picturing the data scientist flipping a coin to decide which data points go where. But in all seriousness, Option B is the clear winner here.
upvoted 0 times
...
Lyndia
26 days ago
I think randomly sampling data points for the training dataset is also a valid approach. As long as it's done without replacement, it should provide a good representation of the dataset.
upvoted 0 times
...
James
28 days ago
I agree with Kimberely. It makes sense to split the dataset based on a specific date to ensure a fair comparison of model performance.
upvoted 0 times
...
Kimberely
30 days ago
I think the data scientist should pick a date so that 80% of the data points precede the date and assign them as the training dataset.
upvoted 0 times
...
Destiny
30 days ago
I agree with Stefany. Option B is the way to go. Forecasting models need to be trained on historical data and then tested on future data to see how well they perform.
upvoted 0 times
...
Stefany
1 months ago
Option B makes the most sense. We want the training data to come first in time, so the model can learn from the past and then be validated on the future data.
upvoted 0 times
...

Save Cancel