What are the general steps of data cleaning?
1. Identify inconsistencies. 2. Eliminate inconsistencies. 3. Flag or remove invalid data. 4. Flag or remove incomplete data.
What are the steps to mitigate bias in a dataset?
1. Identify potential biases. 2. Collect data from diverse sources. 3. Adjust data collection methods. 4. Re-evaluate data representation.
How is Big Data applied in real-world scenarios?
Analyzing customer behavior in e-commerce, tracking global shipping, predicting disease outbreaks.
How is Metadata applied in real-world scenarios?
Organizing digital photo libraries, managing music collections, improving search engine results.
How is Data Cleaning applied in real-world scenarios?
Ensuring accurate customer databases, validating survey responses, standardizing medical records.
How is Data Bias awareness applied in real-world scenarios?
Creating fair algorithms for loan applications, ensuring equitable hiring processes, developing unbiased AI systems.
What is the relationship between Data and Information?
Data is raw and unprocessed, while information is refined and provides insights.
Why is scalability crucial for efficient big data processing?
It allows the system to handle increasing data loads without changing its fundamental operation, ensuring efficiency.
Why are computers essential for processing big data?
Due to their speed and accuracy in handling massive amounts of data.
How does metadata help in data management?
It helps find, organize, sort, and group data by providing additional context.
Why is data cleaning important for accurate analysis?
It ensures data uniformity and eliminates inconsistencies, which can hinder analysis.
Why does collecting more data alone not fix bias?
Because bias is often systematic and requires specific steps to identify and correct, not just an increase in volume.
Explain the importance of recognizing 'Correlation โ Causation'.
Just because two things happen together doesn't mean one causes the other; it prevents incorrect conclusions.
What is the role of parallel systems in big data processing?
Parallel systems and multiple computers are often needed for large-scale data processing to improve speed and efficiency.
How do server farms support intense processing needs?
They house many computers that work together to handle large-scale data processing.
What are some examples of metadata?
Title, author, date created, file size, and tags.