Big data analytics

  • With all that scattered data, here are 3 techniques used to analyze the data. Note that Artificial Intelligence techniques such as Machine learning automate the process which would otherwise be too voluminous to achieve.
  • Mapreduce for Hadoop DFS: It is divided into three steps:
    1. map component: the data in common is regrouped. user
    2. shuffle step: groups the data by key. machine mapreduce code
    3. reduce component: aggregates the groups and produces the result user
  • result sent to HDFS for storage
  • Examples: Common crawl which crawls all the internet and provides a copy of aggregated datasets for a given search:
  • Bloom filters: a probabilistic approach developed in the 70s to find a term in a list. (So its not 100% accurate):
    • Example: spam filters
  • Pagerank: an approach developed by google which ranks website on its search engine. It does this my measuring linking between websites as well as user behaviour….it uses cookies for this !
    • Scores are calculated by computing the number of links pointing to a website;it does not depend on the number of visits a page has. Note that the scoring system was downloadable but has been hidden since 2016…but there are many indicators suggesting part of it it still active.
    • Pagerank is named after Larry Page, one of the founders of Google.

Reference: Big Data: A very short introduction by Dawn E. Holmes. Oxford University Press, 2017


Scroll to Top