Skip to content

mac40/BDC

Repository files navigation

Big Data Computing

Big Data phenomenon

  • Technological progress

    • storage capacity
    • communication bandwidth
    • computing power
    • Reduction of ICT costs
  • Digital Universe

    • Integration of digital technologies in every human activity
    • Scientific research (produces a lot of data)
    • Exponential growth of data
  • Data can be either structured (database records) or unstructured (textual data)

Application Domains

  • The analysis of large datasets arises in:
    • Retailing: product improvement, recommandation systems
    • Banking/Finance: fraud detection...
    • Telecommunications: user profiling
    • Science: validation methods
    • Medicine: diagnosis/therapy
    • Social studies: IOT

The Four V's of DATA

  1. Volume
    • size of data poses several computational challenges and requires a data-centric perspective
  2. Velocity
    • the data arrives at such high rate that tey cannot be stored and processed offline, but need to be processed in streaming
  3. Variety
    • large datasets often come unconstructed and may relate to very different scenarios
  4. Veracity
    • large datasets coming form real-word applications are likely to contain noisy, uncerain data
  • All points above require a paradigm shift with respect to traditional computing

Course presentation

Main objectives

  • Novel computing/programming frameworks for big data processing: theory and practice
    • Spark
  • A sample of key primitives for data analysis
    • Rigorous setting (be able to analitically predict what's going to happen)
    • Algorithmic solutions with focus on large inputs

Specific Content

  • Computational Frameworks: MapReduce, Apache Spark
  • Clustering primitives (Professor's focus)
  • Graph analysis primitives
  • Association analysis primitives (Data mining)
  • Data stream processing

Evaluation

  • Written exam (26 points)
  • Homeworks (6+1 points)
    • groups of max 3/4 sudents
    • 4 assignments, one every 2/3 weeks
    • Use of Apache Spark on individual PCs (assignments 1-3) and CloudVeneto (assignment 4)

Online tools