Amazon MLS-C01 Exam - Topic 3 Question 104 Discussion

Actual exam question for Amazon's MLS-C01 exam

Question #: 104
Topic #: 3

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

APick a date so that 80% to the data points precede the date Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.

BPick a date so that 80% of the data points occur after the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.

CStarting from the earliest date in the dataset. pick eight data points for the training dataset and two data points for the validation dataset. Repeat this stratified sampling until no data points remain.

DSample data points randomly without replacement so that 80% of the data points are in the training dataset. Assign all the remaining data points to the validation dataset.

Show Suggested Answer

Suggested Answer: A

AComprehensive Explanation: The best way to split the dataset into a training dataset and a validation dataset is to pick a date so that 80% of the data points precede the date and assign that group of data points as the training dataset. This method preserves the temporal order of the data and ensures that the validation dataset reflects the most recent trends and patterns in the commodity price. This is important for forecasting models that rely on time series analysis and sequential data. The other methods would either introduce bias or lose information by ignoring the temporal structure of the data.

References:

Time Series Forecasting - Amazon SageMaker

Time Series Splitting - scikit-learn

Time Series Forecasting - Towards Data Science

by Albina at Sep 15, 2024, 01:55 PM

Limited Time Offer

25%

Off

Get Premium MLS-C01 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Herminia

3 months ago

C sounds too complicated for this kind of task.

upvoted 0 times

...

Shoshana

4 months ago

A is definitely the way to go!

upvoted 0 times

...

Odette

4 months ago

Wait, can you really just pick a random date?

upvoted 0 times

...

Omega

4 months ago

I disagree, B seems more logical to me.

upvoted 0 times

...

Yvette

4 months ago

Option A makes the most sense for time series data.

upvoted 0 times

...

Shawn

4 months ago

Random sampling sounds tempting, but I recall that for time series forecasting, we shouldn't randomize the order, so I think option D is not appropriate.

upvoted 0 times

...

Berry

5 months ago

I practiced a similar question where we had to split data for model validation, and I feel like option C is too small a sample size for training.

upvoted 0 times

...

Magdalene

5 months ago

I'm not entirely sure, but I think picking a date for the split is crucial, and option B seems wrong because it suggests using later data for training.

upvoted 0 times

...

Mi

5 months ago

I remember we discussed that for time series data, it's important to maintain the chronological order, so I think option A makes the most sense.

upvoted 0 times

...

Veronica

5 months ago

Hmm, this is a tricky one. I'm leaning towards option B, where the training set comes after the validation set. That way, the model can learn from the future and be tested on the past, which could potentially work better for forecasting. But I'm not 100% sure, so I might need to do some research to decide.

upvoted 0 times

...

Sol

5 months ago

For time series data, I think option A is the way to go. Splitting by date ensures that the training set comes before the validation set, which mimics the real-world scenario where you'd use historical data to forecast the future. The other options don't seem as appropriate for this type of problem.

upvoted 0 times

...

Angella

5 months ago

This seems like a straightforward time series forecasting problem. I'd go with option A - split the dataset by date so that the training set comes before the validation set. That way, the model can learn from the past and be tested on future data, which is the real-world scenario.

upvoted 0 times

...

Jolene

5 months ago

I'm a bit confused on the best approach here. Should I really just split by date, or is there a more sophisticated way to do the train-test split? Option D sounds interesting, but I'm not sure if random sampling is the right way to handle time series data.

upvoted 0 times

...

Shasta

5 months ago

Okay, let's see. Aspirin sensitivity is the key here. I think I know the right answer, but I'll double-check my reasoning.

upvoted 0 times

...

Johnna

1 year ago

Wait, are we sure the answer isn't B? Because if it's not, I'm going to be kicking myself for the rest of the day. Option B all the way!

upvoted 0 times

...

Daniel

1 year ago

Option D might sound tempting, but that would just be a random mess. We need to split the data in a way that mimics the real-world scenario the model will be used in.

upvoted 0 times

Jesusa

1 year ago

C: Definitely. Option A ensures that the model is trained on past data and validated on future data, just like in real life.

upvoted 0 times

...

Denise

1 year ago

B: I agree. Option D would not provide a realistic representation of the data. We need to split it properly.

upvoted 0 times

...

Donte

1 year ago

A: Option A seems like the best choice. We need to maintain the chronological order of the data for accurate forecasting.

upvoted 0 times

...

Catarina

1 year ago

Haha, I'm just picturing the data scientist flipping a coin to decide which data points go where. But in all seriousness, Option B is the clear winner here.

upvoted 0 times

Cherrie

1 year ago

Definitely, random sampling wouldn't be as effective as choosing a date for the split.

upvoted 0 times

...

Jovita

1 year ago

Yeah, it makes sense to use a specific date to divide the data points.

upvoted 0 times

...

Erick

1 year ago

I agree, Option B is the most logical choice for splitting the dataset.

upvoted 0 times

...

Lyndia

1 year ago

I think randomly sampling data points for the training dataset is also a valid approach. As long as it's done without replacement, it should provide a good representation of the dataset.

upvoted 0 times

...

James

1 year ago

I agree with Kimberely. It makes sense to split the dataset based on a specific date to ensure a fair comparison of model performance.

upvoted 0 times

...

Kimberely

1 year ago

I think the data scientist should pick a date so that 80% of the data points precede the date and assign them as the training dataset.

upvoted 0 times

...

Destiny

1 year ago

I agree with Stefany. Option B is the way to go. Forecasting models need to be trained on historical data and then tested on future data to see how well they perform.

upvoted 0 times

...

Stefany

1 year ago

Option B makes the most sense. We want the training data to come first in time, so the model can learn from the past and then be validated on the future data.

upvoted 0 times