All Flashcards
What is the definition of Data?
Raw, unorganized facts that need to be processed.
What is the definition of Information?
Data that has been processed to find trends, connections, and solutions.
What is the definition of Big Data?
Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations.
What is the definition of Metadata?
Data that provides information about other data.
What is the definition of Scalability?
A system's ability to adapt to increasing or decreasing data loads.
What is the definition of Server farm?
A cluster of computer servers acting as a single system to meet intense processing needs.
What is the definition of Data center?
A facility used to house computer systems and associated components, such as telecommunications and storage systems.
What is the definition of Data Cleaning?
The process of making data uniform by eliminating inconsistencies and removing invalid data.
What is the definition of Data Bias?
A systematic error that skews results in a particular direction, often due to flawed sampling or data collection methods.
What is the definition of Correlation?
A statistical measure that expresses the extent to which two variables are linearly related.
What is the relationship between Data and Information?
Data is raw and unprocessed, while information is refined and provides insights.
Why is scalability crucial for efficient big data processing?
It allows the system to handle increasing data loads without changing its fundamental operation, ensuring efficiency.
Why are computers essential for processing big data?
Due to their speed and accuracy in handling massive amounts of data.
How does metadata help in data management?
It helps find, organize, sort, and group data by providing additional context.
Why is data cleaning important for accurate analysis?
It ensures data uniformity and eliminates inconsistencies, which can hinder analysis.
Why does collecting more data alone not fix bias?
Because bias is often systematic and requires specific steps to identify and correct, not just an increase in volume.
Explain the importance of recognizing 'Correlation ≠ Causation'.
Just because two things happen together doesn't mean one causes the other; it prevents incorrect conclusions.
What is the role of parallel systems in big data processing?
Parallel systems and multiple computers are often needed for large-scale data processing to improve speed and efficiency.
How do server farms support intense processing needs?
They house many computers that work together to handle large-scale data processing.
What are some examples of metadata?
Title, author, date created, file size, and tags.
What are the general steps of data cleaning?
- Identify inconsistencies. 2. Eliminate inconsistencies. 3. Flag or remove invalid data. 4. Flag or remove incomplete data.
What are the steps to mitigate bias in a dataset?
- Identify potential biases. 2. Collect data from diverse sources. 3. Adjust data collection methods. 4. Re-evaluate data representation.