Challenges and solutions for managing unstructured data

Q: What are unstructured data sources?

Sources of unstructured data include the following: Text files—Virtually every office file you’re used to handling is a source of unstructured data. This includes word-processing documents, presentations, and PDFs — anything that doesn’t have a pre-defined format. Rich media files—Audio and video files do not fit into a structured data model, and neither do digital photographs. Each of these file types can come in its own format, making it even more difficult to analyze. Email—Some aspects of email are considered semi-structured (the “to” and “from” and “subject” lines, for example), but mostly emails are the source of unstructured text. Social media—Social media is also a source of unstructured data, although like email, some of it can be considered semi-structured. IoT data—Device sensors generate an extraordinarily large volume of log files that are unstructured and difficult to analyze in conventional ways.

Unstructured Data Backup Solution Brief

What is unstructured data?

Unstructured data is information that is not stored according to a predefined data model or schema, such as a relational database management system, or even non-relational databases, such as NoSQL. The vast majority of data in the world is unstructured, encompassing text, rich media, video, images, audio, sensor data from Internet of Things (IoT) devices, and more. Unstructured data can be created by humans or machines and is challenging to store or analyze using traditional data management strategies.

Why is unstructured data important?

Data is increasingly recognized as the most important asset that businesses possess. Yet few organizations have been able to reap full value from the immense volumes of unstructured data — estimated by analysts to be 80 percent of all data they generate or otherwise acquire during the course of doing business. Managing unstructured data at scale using conventional file services approaches with network attached storage (NAS) devices has proven difficult and costly because of data replication, physical limitations, and governance challenges.

With the right tools, organizations can extract tremendous value from unstructured data. For example, businesses could mine social media posts for data that reflects satisfaction with their brands. Clinicians at hospitals could share a common—and massive—repository of genomic sequences for research purposes.

But how and where to store all this unstructured data, as files or objects, has continued to challenge businesses. Traditional NAS infrastructure helps with performance, but it is costly and doesn’t scale out. Next-generation scale-out NAS is available but not yet widely deployed. Software-defined object storage is beginning to be deployed but most enterprise workloads weren’t designed to use object storage. Adoption has been slow and difficult. Enterprises need a more scalable and efficient way to manage unstructured data.

What is an example of unstructured data?

Examples of unstructured data include the following:

An invoice that comes into your finance department for processing that is of a unique (non-standard) design
A waiter’s handwritten customer orders that a restaurant chain is attempting to tally up for food inventory purposes
A photo displayed on your webpage to show what an item for sale looks like
A barcode that lets your cashier check out items for customers
An X-ray that a doctor can analyze to treat a patient
An email sent to you by a colleague
An office memo written in a word-processing document
A presentation deck containing both text and images

What are unstructured data sources?

Sources of unstructured data include the following:

Text files—Virtually every office file you’re used to handling is a source of unstructured data. This includes word-processing documents, presentations, and PDFs — anything that doesn’t have a pre-defined format.
Rich media files—Audio and video files do not fit into a structured data model, and neither do digital photographs. Each of these file types can come in its own format, making it even more difficult to analyze.
Email—Some aspects of email are considered semi-structured (the “to” and “from” and “subject” lines, for example), but mostly emails are the source of unstructured text.
Social media—Social media is also a source of unstructured data, although like email, some of it can be considered semi-structured.
IoT data—Device sensors generate an extraordinarily large volume of log files that are unstructured and difficult to analyze in conventional ways.

What is unstructured data used for?

Unstructured data is used within every business function: finance (invoices), marketing (photos), IT (IoT data), sales (emails with customers), and customer service (social media).

Although it’s changing rapidly, at this point, much of the unstructured data collected and stored is processed manually, if at all. For example, email is mostly processed by a human reading it, extracting what is important (sometimes by copying and pasting into another email or into an application), and taking action based on its contents.

But with advancing AI technologies such as machine learning, machine vision, and natural language processing, more of this unstructured information can be harnessed and analyzed automatically, driving faster business insight.

What is structured vs. unstructured data?

Structured data is stored in a fixed place within a file or record. It’s typically stored in a relational database (RDBMS) but can also be found in NoSQL databases, for example. Structured data can be text, dates, or numbers.

Unstructured data has not been defined or stored in a predefined way. Although it most commonly consists of text, it can also include numbers, images, and audio.

How do you classify unstructured data?

Data classification is the process of analyzing data and categorizing it into buckets, typically based on metadata (data about data) such as the type of file, its contents, or its date.

By classifying unstructured data by, for example, how sensitive it is, you can better perform unstructured data management that complies with your governance policies by deciding where the data should be stored and who should access it.

Are files unstructured data?

Files can be either structured or unstructured data. Common examples of structured data are spreadsheets or SQL database files. Other files, like word-processing documents, presentations, and emails are unstructured. Some files—like invoice templates that display the exact same information in the exact same way every time the template is used—are called semi-structured because there’s a way of getting the information out of them without AI or machine-learning models. So it’s not a question of whether the data is in a file or not; the question is whether within that file the data is stored in a predefined format.

What are the characteristics of unstructured data?

Unstructured data is information that either does not have a predefined data model or is not organized in a predefined manner. That means that it:

Isn’t stored according to a data model
Doesn’t have any discernible structure
Doesn’t have a pattern to it
Can’t be stored as rows and columns

How much data is unstructured?

Approximately 80% of all data is unstructured, and that percentage grows higher every year.

How is unstructured data processed?

There are several techniques that you can use to process unstructured data. Here are some of the most widely used:

Metadata analysis—This “data about data” is critical to analyzing unstructured data. For example, a blog post (unstructured text) has metadata consisting of title, author, URL, publishing date, any descriptive tags or keywords, and even perhaps a category name—there are no metadata standards, so each business defines its own.

Image analysis—Images contain unstructured data types that can be very valuable to extract for business, financial, medical, and scientific reasons. New AI-based systems can analyze and match an unstructured image with characteristics similar to a known image. For example, optical character recognition (OCR) technology converts text in image files by matching the shapes of specific images to characters in a language.

Natural language processing (NLP)—This is a subset of AI/ML that aids in analyzing unstructured textual data. NLP uses several techniques to process and extract meaning and make sense of unstructured text, such as grammar and semantics.

Data visualization—When teams choose to visualize data, they present it in a graphical form to allow viewers to understand and analyze it simply by looking at it.

A modern approach to managing files and objects

Cohesity’s software-defined, hyperscale platform simplifies data management by consolidating backups and unstructured data in the form of files and objects from multiple application workloads on a single platform. The platform is architected on Cohesity SpanFS, a unique globally distributed file system that supports various protocols, including NFS, SMB, and S3 object storage.

With Cohesity, your organization can protect existing NAS investments—in fact optimize them—by only using that storage for higher-performance data while offloading infrequently accessed-unstructured data to Cohesity SmartFiles. A modern approach to files and objects management, SmartFiles eliminates legacy hardware forklift upgrades and costly and time-consuming manual infrastructure updates while guaranteeing all of your unstructured data is protected wherever it resides—in the data center, the cloud, or at the edge.

Cohesity SmartFiles also features:

Unlimited scaling in a pay-as-you-grow model
Global deduplication and compression
Global actionable search on all file and object metadata
User and file system quotas with audit logs
Small file optimization
Integration with Cohesity Marketplace apps for increased data visibility, cyber resilience, and analytics
Lower TCO for unstructured data management

5 times in a row: Cohesity ranked a Leader and Outperformer for 2025 GigaOm Radar for Unstructured Data Management

By Vaishnavi Nambiar ,Product Marketing Manager

GigaOm Radar: Cohesity three-peats as a Leader and Outperformer for Unstructured Data Management

By Michael Pacheco ,Senior Product Marketing Manager, Cohesity

The economic benefits of managing unstructured data with Cohesity SmartFiles

By Tim Desai ,Director, Product Marketing - SmartFiles

Challenges and solutions for managing unstructured data

Table of Contents

What is unstructured data?

Why is unstructured data important?

What is an example of unstructured data?

What are unstructured data sources?

What is unstructured data used for?

What is structured vs. unstructured data?

How do you classify unstructured data?

Are files unstructured data?

What are the characteristics of unstructured data?

How much data is unstructured?

How is unstructured data processed?

A modern approach to managing files and objects

5 times in a row: Cohesity ranked a Leader and Outperformer for 2025 GigaOm Radar for Unstructured Data Management

GigaOm Radar: Cohesity three-peats as a Leader and Outperformer for Unstructured Data Management

The economic benefits of managing unstructured data with Cohesity SmartFiles

Get started today

Challenges and solutions for managing unstructured data

Table of Contents

What is unstructured data?

Why is unstructured data important?

What is an example of unstructured data?

What are unstructured data sources?

What is unstructured data used for?

What is structured vs. unstructured data?

How do you classify unstructured data?

Are files unstructured data?

What are the characteristics of unstructured data?

How much data is unstructured?

How is unstructured data processed?

A modern approach to managing files and objects

You May Also Like

5 times in a row: Cohesity ranked a Leader and Outperformer for 2025 GigaOm Radar for Unstructured Data Management

GigaOm Radar: Cohesity three-peats as a Leader and Outperformer for Unstructured Data Management

The economic benefits of managing unstructured data with Cohesity SmartFiles

Get started today