Extracting Information From Data
AP PREP
The Extraction from Data
The ability to process data depends on the capabilities of the users and their tools. Data sets pose challenges regarless of size, such as:
- the need to clean data
- incomplete data
- invalid data
- the need to combine data sources
Collecting Data
- Issues to consider:
- Source
- Do you need more sources?
- Tools to analyze Data
- Source
Processing Data
- It is affected by size
- Can one computer handle the task
- May need to use parallel processing
- Use two or more processors to handle different parts of the task
- Can one computer handle the task
Potential Bias
- Intentional:
- Who collected the data?
- Do they have an agenda?
- Unintentional:
- How is the data collected?
- Who collected the data?
Data Cleaning
- Identifying incomplete, corrup, duplicate, or inaccurate records
- Replacing, modifying, or deleting the "dirty" data
Collegeboard Quiz
- A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.
Answer: A A unique identifier would be required in order to distinguish between two students with the same first and last names.
- A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
Answer: B It will be a challenge to clean the data from the different counties to make the data uniform. The way pollution data is captured and organized may vary significantly from county to county.
- A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
Answer: B and C Different users may abbreviate city names differently. This may require the student to clean the data to make it uniform before it can be processed. Misspelled city names will not be an exact match to information stored by the Web site. This may require the student to clean the data to make it uniform before it can be processed.
- Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?
Answer: A The attendance for a particular show can be calculated dividing the total dollar amount of all tickets sold by the average ticket price.
- A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
Answer: D Determining the number of bicycles the car encountered would require the use of image recognition software to examine the images collected by the camera. The images are the data collected and no metadata would be required.
- Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?
Answer: C Question I can be answered because the teacher can detect a correlation between responses to questions 1 and 3 on the survey. Question II can be answered because the teacher can detect a correlation between responses to questions 1 and 2 on the survey. Question III cannot be answered because the survey is anonymous and the teacher cannot compare student grades with the responses to the survey questions.
import pandas as pd
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/iris.json')
# print dataframe
print(df)
import pandas as pd
# reads the JSON file and converts it to a Pandas DataFrame
df = pd.read_json('files/iris.json')
cols_to_print = [ 'sepalLength','sepalWidth', 'petalLength', 'petalWidth', 'species']
df = df[cols_to_print]
rows_to_print = [0,1,2,3,4,5, 6, 7, 8]
df = df.iloc[rows_to_print]
print(df)
print( "--Max sepalLength-- \n",
df[df.sepalLength == df.sepalLength.max()])
print("")
print( "--Min sepalWidth-- \n",
df[df.sepalWidth == df.sepalWidth.min()])
print("")
print( "--Max petalLength-- \n",
df[df.petalLength == df.petalLength.max()])
print("")