复制
收藏
提问
简洁

We’re going to explore a dataset which contains reviews of wines from various magazines (from Zackthoutt on Kaggle.com). We’re going to look at the data and formulate some hypotheses. Provide the code you use along the way. (a) Using pandas, import the file winemag-data.csv to a DataFrame. (b) Use len (which gives the number of observations/rows in the table) to find the total number of reviews in the dataset. (c) Subset the data to get two new tables, french_wine and italian_wine, containing only French and Italian wines respectively (we saw how to do this in the previous lecture!). (d) Foreachreview,thereviewershaveawardedpointsbetween1and100(althoughthey’veonlybothered to include wines with a score above 80 in the magazine). We’re going to assume that the points column is normally distributed. Construct a null hypothesis and an alternative hypothesis using this score. (e) Use scipy.stats.ttest_ind(X, Y) to do a two-sample t-test between the two groups at the 1% sig- nificance level (so p-values below this are significant). What is the p-value? Are the means statistically different? (f) You might be surprised by the previous answer. Print out the means of the scores of each group. Why might you be surprised? (g) This result is due to the fact that the hypothesis test doesn’t care about the size of any differences, only whether or not one exists. We can use Cohen’s d to see how large the effect size is. d = μ1 ́ μ2 . σ Taking σ to be the standard deviation of all of the wines (use pandas.DataFrame.describe() to obtain this - we’ll assume all the wines have the same standard deviation), compute Cohen’s d for these two groups. (h) A Cohen’s d below 0.2 is usually called a small effect size; about 0.5 is medium, and 0.8 or more is a large effect size. How big is the effect size here

1c9ab6955
1个月前
全网
文档
学术
百科
知识库
时间不限
自动
回答简洁模式
深入研究一些

Sure, let's go through the steps you've outlined using Python and pandas. I'll provide the code for each step.

(a) Import the file winemag-data.csv to a DataFrame

import pandas as pd

# Load the dataset
df = pd.read_csv('winemag-data.csv')

(b) Find the total number of reviews in the dataset

# Get the number of rows (observations)
total_reviews = len(df)
print(f"Total number of reviews: {total_reviews}")

(c) Subset the data to get two new tables, french_wine and italian_wine

# Subset for French wines
french_wine = df[df['country'] == 'France']

# Subset for Italian wines
italian_wine = df[df['country'] == 'Italy']

(d) Construct a null hypothesis and an alternative hypothesis

The null hypothesis (H0) is that there is no difference in the mean scores of French and Italian wines. The alternative hypothesis (H1) is that there is a difference in the mean scores.

H0 = "The mean score of French wines is equal to the mean score of Italian wines."
H1 = "The mean score of French wines is not equal to the mean score of Italian wines."

(e) Perform a two-sample t-test

from scipy.stats import ttest_ind

# Perform the t-test
t_stat, p_value = ttest_ind(french_wine['points'], italian_wine['points'], equal_var=False)

print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Check if the means are statistically different
if p_value < 0.01:
    print("The means are statistically different.")
else:
    print("The means are not statistically different.")

(f) Print out the means of the scores of each group

mean_french = french_wine['points'].mean()
mean_italian = italian_wine['points'].mean()

print(f"Mean score of French wines: {mean_french}")
print(f"Mean score of Italian wines: {mean_italian}")

(g) Compute Cohen’s d

# Calculate the standard deviation of all wines
std_dev_all = df['points'].std()

# Calculate Cohen's d
d = (mean_french - mean_italian) / std_dev_all
print(f"Cohen's d: {d}")

(h) Interpret the effect size

# Interpret the effect size
if d < 0.2:
    print("Small effect size.")
elif d < 0.5:
    print("Medium effect size.")
else:
    print("Large effect size.")

This code will help you explore the dataset, perform the necessary statistical tests, and interpret the results. Remember to adjust the file path if the dataset is not in the current working directory.

你觉得结果怎么样?

以上内容由AI搜集生成,仅供参考

在线客服