Big Data and Hypothesis Testing

We have seen that interval estimates of the population mean m and the population proportion p narrow as the sample size increases. This occurs because the standard error of the associated sampling distributions decrease as the sample size increases. Now consider the relationship between interval estimation and hypothesis testing that we discussed earlier in this chapter. If we construct a 100(1 – a)% interval estimate for the population mean, we reject H₀: μ = μ₀ if the 100(1 – a)% interval estimate does not contain μ₀. Thus, for a given level of confidence, as the sample size increases we will reject H₀: μ = μ₀ for increasingly smaller differences between the sample mean X and the hypothesized population mean m₀. We can see that when the sample size n is very large, almost any difference between the sample mean X and the hypothesized population mean m₀ results in rejection of the null hypothesis.

1. Big Data, Hypothesis Testing, and p Values

In this section, we will elaborate how big data affects hypothesis testing and the magnitude of p values. Specifically, we will examine how rapidly the p value associated with a given difference between a point estimate and a hypothesized value of a parameter decreases as the sample size increases.

Let us consider the online news service PenningtonDailyTimes.com (PDT). PDT’s primary source of revenue is the sale of advertising, and prospective advertisers are willing to pay a premium to advertise on websites that have long visit times. To promote its news service, PDT’s management wants to promise potential advertisers that the mean time spent by customers when they visit PenningtonDailyTimes.com is greater than last year, that is, more than 84 seconds. PDT therefore decides to collect a sample tracking the amount of time spent by individual customers when they visit PDT’s website in order to test its null hypothesis H₀: μ ≤ 84.

For a sample mean of 84.1 seconds and a sample standard deviation of 5 = 20 seconds, Table 9.6 provides the values of the test statistic t and the p values for the test of the null hypothesis H₀: m ≤ 84. The p value for this hypothesis test is essentially 0 for all samples in Table 9.6 with at least n = 1,000,000.

PDT’s management also wants to promise potential advertisers that the proportion of its website visitors who click on an ad this year exceeds the proportion of its website visitors who clicked on an ad last year, which was .50. PDT collects information from its sample on whether the visitor to its website clicked on any of the ads featured on the website, and it wants to use these data to test its null hypothesis H₀: p ≤ .50.

For a sample proportion of .51, Table 9.7 provides the values of the test statistic z and the p values for the test of the null hypothesis H₀: p ≤ .5 p value for this hypothesis test is essentially 0 for all samples in Table 9.7 with at least n = 100,000.

We see in Tables 9.6 and 9.7 that the p value associated with a given difference between a point estimate and a hypothesized value of a parameter decreases as the sample size increases. As a result, if the sample mean time spent by customers when they visit PDT’s website is 84.1 seconds, PDT’s null hypothesis H_o: μ ≤ 84 is not rejected at a = .01 for samples with n ≤ 100,000, and is rejected at a = .01 for samples with n ≥ 1,000,000.

Similarly, if the sample proportion of visitors to its website clicked on an ad featured on the website is .51, PDT’s null hypothesis H₀: p ≤ .50 is not rejected at a = .01 for samples with n ≤10,000, and is rejected at a = .01 for samples with n ≥ 100,000. In both instances, as the sample size becomes extremely large the p value associated with the given difference between a point estimate and the hypothesized value of the parameter becomes extremely small.

2. Implications of Big Data in Hypothesis Testing

Suppose PDT collects a sample of 1,000,000 visitors to its website and uses these data to test its null hypotheses H₀: μ ≤ 84 and H₀: p ≤ .50 at the .05 level of significance. The sample mean is 84.1 and the sample proportion is .51, so the null hypothesis is rejected in both tests as Tables 9.6 and 9.7 show. As a result, PDT can promise potential advertisers that the mean time spent by individual customers who visit PDT’s website exceeds 84 seconds and the proportion individual visitors to of its website who click on an ad exceeds .50. These results suggest that for each of these hypothesis tests, the difference between the point estimate and the hypothesized value of the parameter being tested is not likely solely a consequence of sampling error. However, the results of any hypothesis test, no matter the sample size, are only reliable if the sample is relatively free of nonsampling error. If

nonsampling error is introduced in the data collection process, the likelihood of making a Type I or Type II error may be higher than if the sample data are free of nonsampling error. Therefore, when testing a hypothesis, it is always important to think carefully about whether a random sample of the population of interest has been taken.

If PDT determines that it has introduced little or no nonsampling error into its sample data, the only remaining plausible explanation for these results is that these null hypotheses are false. At this point, PDT and the companies that advertise on PenningtonDailyTimes. com should also consider whether these statistically significant differences between the point estimates and the hypothesized values of the parameters being tested are of practical significance. Although a .1 second increase in the mean time spent by customers when they visit PDT’s website is statistically significant, it may not be meaningful to companies that might advertise on PenningtonDailyTimes.com. Similarly, although an increase of .01 in the proportion of visitors to its website that click on an ad is statistically significant, it may not be meaningful to companies that might advertise on PenningtonDailyTimes.com. Determining whether these statistically significant differences have meaningful implications for ensuing business decisions of PDT and its advertisers.

Ultimately, no business decision should be based solely on statistical inference. Practical significance should always be considered in conjunction with statistical significance. This is particularly important when the hypothesis test is based on an extremely large sample because even an extremely small difference between the point estimate and the hypothesized value of the parameter being tested will be statistically significant. When done properly, statistical inference provides evidence that should be considered in combination with information collected from other sources to make the most informed decision possible.

Source: Anderson David R., Sweeney Dennis J., Williams Thomas A. (2019), Statistics for Business & Economics, Cengage Learning; 14th edition.

Statistics and Econometrics

Big Data and Hypothesis Testing

1. Big Data, Hypothesis Testing, and p Values

2. Implications of Big Data in Hypothesis Testing

Leave a Reply Cancel reply