Math 105, Topics in Mathematics

Lesson 1: Data and Statistical Studies

IntroductionGo back to top of page

We define statistics as the study of a large population on the basis of a small data sample. We make inferences about the population based on the sample data. What is data? We are mainly interested in numerical data. For us, data are numbers that describe a numerical characteristic of a certain number of members of the population. We may talk about data on height, weight, number of typos, and so on. In this course we talk about data in the context of statistics.

1.1 What Is a Statistical Population? Go back to top of page

In statistics we try to understand or make inferences or projections about a large collection of similar objects. Such a collection of individuals or objects under study is called a population.

Example 1.1. The following are examples of a population.
  1. If we are studying the income distribution of Americans, then the population is the whole American population.

  2. If we are studying the income distribution of the immigrant American population, then the population is the whole immigrant American population.

  3. If we are studying the growth of the fish population in Clinton Lake, then the population is the total fish population in Clinton Lake.

  4. If we are studying African elephants, then the population is the total population of African elephants.

The N-VALUE: The total number of members in the population under study is called the N-value of the population or the population size.

The N-value is often unknown and must be estimated because either an accurate head count of all the members in the population is impossible or too expensive. In that case this N-value will be unknown, and you may need to estimate the N-value by statistical methods. For example, it is not possible to count the number of salmon in a river. The N-value of the salmon population in the river has to be estimated.

1.2 Statistical Studies: Census, Surveys, Public Opinion Polls, Clinical StudiesGo back to top of page

Census

Article 1 and Article 2 of the Constitution of the United States mandates that a national census be conducted every ten years. By census we mean an official enumeration of the population. Not only in the United States, but all over the world, a census is conducted every ten years. Allow me to make a few comments about census:

  1. Originally the intent of the census was to count heads for taxes and representation. That is why it may also become a political issue as it did during the year 2000 census.
  2. Census is one major source of data about the population, and the United Nations assumes a role in the worldwide census.
  3. Census has often failed to count all members of the population. It is believed that a complete count is not really possible.
  4. In the 2000 census, the U.S. population was counted by using statistical techniques. The Congress and the administration fought over this law, and the law was challenged in the courts.

Surveys

A more realistic and economical alternative to census is to collect data only from a small subgroup and then use this data to make inferences about the whole population. This approach is called a survey, and the subgroup of the population from which the data is collected is called a sample.

The basic idea behind a survey is that if we can find a "representative" sample of the whole population (that means it is not biased) then anything we need to know about the population can be derived from that sample.

Public Opinion Polls

We all know about public opinion polls--the Gallup poll, Harris poll, and so on. Sometimes predictions made by various opinion polls regarding the outcomes of various elections went wrong. Well known are the predictions regarding the presidential elections in 1936 (Franklin Roosevelt vs. Alfred Landon) and 1948 (Harry Truman vs. Thomas Dewey). In those years, sampling methods were not sophisticated, and the samples the pollsters drew failed to represent the whole population. They used phone books to draw samples, and not everyone had a telephone. In the year 2000, pollsters recognized and acknowledged that the presidential election was too close to make a prediction. But in the evening, the news media could not resist the temptation and made a prediction that turned out to be wrong.

Clinical Studies

When a vaccine or a new drug is tested, the statistical methods used are interesting. I will not try to analyze these thoroughly but the main points of the process are:

  1. We pick two samples to be called the control group and the treatment group. The two samples need not have the same size.
  2. The treatment group receives the treatment, and the control group does not receive the treatment.
  3. Both the groups are ignorant about who is receiving the treatment and who is not.
  4. Finally, the two groups are compared. If the treatment group does better than the control group, then it is accepted that the treatment is working.

1.3 Sampling Methods Go back to top of page

Random Sampling

Developing a "representative sample" is a real challenge for a statistician. If a statistician tries to pick a sample, his/her human bias is essentially bound to result in a "biased sample." Whatever method we use to select a sample, the selection of the sample members must be random. That means that mathematics and methods of chance must guide the selection of sample members. A sample picked in such a manner is called a random sample, and the method is called random sampling.

Another important concern regarding sampling is its cost. If resources are limited, it may become necessary to keep the sample size small.

We discuss two methods of random sampling here.

  1. In the method of simple random sampling each member of the population has an equal chance of being selected in the sample.
  2. The other method of sampling is called stratified sampling: First, divide the population into categories, called strata, and randomly select a sample from these strata. Then further divide the chosen strata into categories, called substrata, and select a random sample of substrata from each of those strata. The process is continued for a number of times.

Sample Size

The sample size required for statistical studies need not be very large, even when the population is large. In practice, it is often less than 1500. If you follow news media polls or others, they normally sample from 700 to 1200 people.

Sampling : Terminology and Key Concepts

The job of a statistician is to make inferences about a large population on the basis of a (small) sample.

  1. Any numerical value computed from the sample data is called a statistic.
  2. Any numerical value computed from the whole population data is called a parameter.
  3. Unless the population is small, the actual value of a parameter is never known. However, because the samples are small, we can always compute the actual values of the statistics. The game here is to estimate the parameters by appropriate statistics.

Example. Suppose we want to understand the income distribution of the U.S. population, and we want to know the average income of the U.S. population.

  1. Here the average U.S. income is a parameter.
  2. Because it is almost impossible to compute the actual value of the (parameter) average U.S. income, we take a sample (say of size 1500) and compute the average income of the sample members. This sample average is a statistic.
  3. It is reasonable to use this (statistic) sample average income as an estimate for the (parameter) average U.S. income.

Sampling Error

A statistic used to estimate a parameter is only an estimate. We do not expect that the value of the statistic will be exactly equal to the value of the parameter. In the above example, we would not expect the sample average income to be exactly equal to the average U.S. income. The difference between the actual value of the parameter and the computed/observed value of the statistic used to estimate the parameter is called the sampling error.

There are two types of sampling errors:

  1. Chance error: Because a sample is not the whole population, a statistic is not expected to be the exact value of the parameter. Given that other characteristics of samples are "perfect" and identical, two different samples will produce two different values of the statistics. You get different estimates (i.e., the value of the statistic) from different samples, for the same parameter. The error in estimation that arises this way is called the chance error. This error arises out of the sampling variability, and the choice of sample belongs to randomness or chance. Statisticians are comfortable with chance error for various reasons.
    1. First, the very nature of statistical methods makes this error unavoidable.
    2. Second, by increasing sample size this error can also be controlled.
    3. Finally, the statistician can tell us how often this error exceeds the tolerable limit.

  2. Sampling bias: The error that arises from poor sampling is called sampling bias. Although many sophisticated methods of sampling are available, implementation is not easy. In any case, sampling bias can be eliminated by strictly and properly implementing the sampling methods. Of course, the cost of sampling is the casualty.

1.4 The Capture-recapture Method: Estimating N-value Go back to top of page

Suppose we want to estimate the number of fish in Clinton Lake. Let N be the number of fish in the lake. Using the capture-recapture method, we do the following.

Step 1. (The capture) Capture a sample of m fish, tag them, and release them back into the water.

Step 2. (The recapture) After everything has settled down, capture a new sample of n fish. Count the number of fish that have already been tagged. Suppose that k of them are tagged. It is reasonable to assume that m/N=k/n, approximately.

Solving for N, we have an estimate N of N given by

N = mn/k.

Problems on 1.4: Estimating N-value

Exercise 1.4.1. As part of a project we made two trips to a local lake. The first day we caught m=325 fish and tagged them. On the second day we caught n=525 fish, and of those k=125 were tagged fish. Estimate the total number of fish in the lake.
Solution

Exercise 1.4.2. Last year you tagged m = 526 birds migrating through Lawrence. This year again you captured n = 517 birds migrating through Lawrence, and of those k = 113 were tagged last year. Estimate the total number of birds migrating through Lawrence every year.
Solution

Exercise 1.4.3. You want to estimate the number of homeless people in New York. On a night you identify 376 homeless people in New York. After some time, on another night you identify 497 homeless people. Of these 497, you found that 119 were identified last time as well. Estimate the number of homeless people in New York.
Solution

Exercise 1.4.4. To estimate the number of tigers in Sunderban you capture 194 tigers and tag them. After some time you capture 212 tigers, and of those 87 were tagged. Estimate the number of tigers in Sunderban.
Solution


Homework

After you have completed all the exercises for this lesson, you should begin your homework assignment. Click on the homework link at the top of this lesson. (The homework link has been disabled for this preview.) To access your homework you will be asked for your user ID and password. For your user ID enter your seven-digit ID assigned by KU; do not use your old six-digit KUID. If you do not have a ID assigned by KU, enter your Student Code, which was provided in the welcome letter you received after you enrolled. Your password is your last name, with only the first letter capitalized.

back to topGo back to top of page