AP PREP

The Extraction from Data

The ability to process data depends on the capabilities of the users and their tools. Data sets pose challenges regarless of size, such as:

  • the need to clean data
  • incomplete data
  • invalid data
  • the need to combine data sources

Collecting Data

  • Issues to consider:
    • Source
      • Do you need more sources?
    • Tools to analyze Data

Processing Data

  • It is affected by size
    • Can one computer handle the task
      • May need to use parallel processing
      • Use two or more processors to handle different parts of the task

Potential Bias

  • Intentional:
    • Who collected the data?
    • Do they have an agenda?
  • Unintentional:
    • How is the data collected?
    • Who collected the data?

Data Cleaning

  • Identifying incomplete, corrup, duplicate, or inaccurate records
  • Replacing, modifying, or deleting the "dirty" data

Collegeboard Quiz

image

  1. A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

Answer: A A unique identifier would be required in order to distinguish between two students with the same first and last names.

  1. A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

Answer: B It will be a challenge to clean the data from the different counties to make the data uniform. The way pollution data is captured and organized may vary significantly from county to county.

  1. A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Answer: B and C Different users may abbreviate city names differently. This may require the student to clean the data to make it uniform before it can be processed. Misspelled city names will not be an exact match to information stored by the Web site. This may require the student to clean the data to make it uniform before it can be processed.

  1. Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

Answer: A The attendance for a particular show can be calculated dividing the total dollar amount of all tickets sold by the average ticket price.

  1. A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

Answer: D Determining the number of bicycles the car encountered would require the use of image recognition software to examine the images collected by the camera. The images are the data collected and no metadata would be required.

  1. Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

Answer: C Question I can be answered because the teacher can detect a correlation between responses to questions 1 and 3 on the survey. Question II can be answered because the teacher can detect a correlation between responses to questions 1 and 2 on the survey. Question III cannot be answered because the survey is anonymous and the teacher cannot compare student grades with the responses to the survey questions.

JSON Dataset

import pandas as pd
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/iris.json')
# print dataframe
print(df)
     sepalLength  sepalWidth  petalLength  petalWidth    species
0            5.1         3.5          1.4         0.2     setosa
1            4.9         3.0          1.4         0.2     setosa
2            4.7         3.2          1.3         0.2     setosa
3            4.6         3.1          1.5         0.2     setosa
4            5.0         3.6          1.4         0.2     setosa
..           ...         ...          ...         ...        ...
145          6.7         3.0          5.2         2.3  virginica
146          6.3         2.5          5.0         1.9  virginica
147          6.5         3.0          5.2         2.0  virginica
148          6.2         3.4          5.4         2.3  virginica
149          5.9         3.0          5.1         1.8  virginica

[150 rows x 5 columns]
import pandas as pd
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/iris.json')
cols_to_print = [ 'sepalLength','sepalWidth', 'petalLength', 'petalWidth', 'species']
df = df[cols_to_print]
rows_to_print = [0,1,2,3,4,5, 6, 7, 8]
df = df.iloc[rows_to_print]

print(df)
   sepalLength  sepalWidth  petalLength  petalWidth species
0          5.1         3.5          1.4         0.2  setosa
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa
5          5.4         3.9          1.7         0.4  setosa
6          4.6         3.4          1.4         0.3  setosa
7          5.0         3.4          1.5         0.2  setosa
8          4.4         2.9          1.4         0.2  setosa
print( "--Max sepalLength-- \n", 
    df[df.sepalLength == df.sepalLength.max()])

print("")

print( "--Min sepalWidth-- \n", 
    df[df.sepalWidth == df.sepalWidth.min()])

print("")


print( "--Max petalLength-- \n", 
    df[df.petalLength == df.petalLength.max()])

print("")
--Max sepalLength-- 
    sepalLength  sepalWidth  petalLength  petalWidth species
5          5.4         3.9          1.7         0.4  setosa

--Min sepalWidth-- 
    sepalLength  sepalWidth  petalLength  petalWidth species
8          4.4         2.9          1.4         0.2  setosa

--Max petalLength-- 
    sepalLength  sepalWidth  petalLength  petalWidth species
5          5.4         3.9          1.7         0.4  setosa