3 types of data exist:
- Until recently written by hand and kept in notebooks, this data is now stored electronically on spreadsheets and databases, and consists of spreadsheet-style tables with rows and columns, each row being a record and each column a well-defined field (name, date of birth…).
- We are contributing to these structured data when, for example, we provide the information necessary to order goods online.
- Carefully structured and tabulated data is relatively easy to manage and is amenable to statistical analysis, indeed until recently statistical analysis methods could apply only to structured data.
- By contrast, unstructured data is not so easily categorized. It includes:
- word-processing documents
- Once the use of the world-wide web became widespread, it transpired that many such potential sources of information remained inaccessible because hey lacked the structure needed for existing analytic techniques to be applied.
- However by identifying key features data that appears initially to be unstructured may not be completely without structure.
- Emails for example, contain for example structured metadata in the heading as well as the actual unstructured messages in the text so it may be classified as semi-structured data.
- Metadata tags, which are essentially descriptive references, can be used to add some structure to unstructured data.
- Adding a word to an image on a website makes it identifiable and subsequently easier to search for. Semi-structured data is also found on social networking websites, which use hashtags so that messages (which are unstructured data) on a particular topic can be identified.
- Dealing with unstructured data is challenging: since it cannot be stored in traditional databases or spreadsheets, special tools have to be developed to extract useful information.
Reference: Big Data: A very short introduction by Dawn E. Holmes. Oxford University Press, 2017