Where's the Metadata Part... The Importance of a Data ...

Where's the Metadata?

Oringinally posted on October 18, 2023 by Ted Martin in Data Governance.

Updated on January 09, 2025.

Reading time: 8 minutes

What is "Metadata"? Essentially it is data about data... It is the data that describes the characteristics, structure, categories, quality, origin, and usage of your data, helping to put it in its proper context. Metadata can help data users and data managers to understand, organize, and optimize their data assets. Metadata also helps facilitate data governance, data integration, and data analysis across different sources and platforms.

Metadata is like the part of the Data Iceberg that is underwater

Benefits of Metadata

Adding Metadata to your Data Catalog can provide many benefits, such as:

Data Discovery: Metadata can help find the relevant data they need for their tasks more quickly, and assist with determining the trustworthiness of the data. For example, Metadata can provide information about the name, description, age, type, format, location, and availability of a data set.
Data Quality: Metadata can help assess the quality and reliability of a data set. For example, Metadata can provide information about the accuracy, completeness, consistency, timeliness, and validity of a data set.
Data Security: Metadata can help data managers to protect the confidentiality and integrity of a data set. For example, Metadata can provide information about the ownership, access rights, permissions, and encryption of a data set.
Data Compliance: Metadata can help data managers to ensure the compliance and accountability of a data set. For example, Metadata can provide information about the lineage, usage, history, and audit trail of a data set.
Data Integration: Metadata can help to integrate and harmonize data from different sources and platforms. For example, Metadata can provide information about the schema, structure, semantics, and standards of a data set.

How to I get this Metadata?

Most data stores will directly or indirectly maintain Metadata about your data. Some Metadata can be queried directly (e.g. file name or column names) or accessed from an API, and some Metadata has to be derived from the data itself (e.g. number of Null values, min and max values). I will explore how to mine Metadata from several different systems, and has been split into different parts for readability:

1. Relational Database Management Systems (RDMS)
2. Dashboarding Platform (Power BI)
3. Data Lakes (Azure Data Lake Gen 2)

2. Dashboarding Platforms

The Metadata from Dashboards, like Power BI Online, can be accessed via API. Documentation on the Power BI REST APIs is available here.
Note: to query these REST APIs in the examples below, you will need to include a valid Bearer Token in the Authorization Header

Workspaces (Groups)

Example querying all Workspaces: GET https://api.powerbi.com/v1.0/myorg/groups?$filter={$filter}&$top={$top}&$skip={$skip}

Response:

{
  "@odata.context": "http://wabi-azure-location-primary-redirect.analysis.windows.net/v1.0/myorg/$metadata#groups",
  "@odata.count": 2,
  "value": [
    {
      "id": "{WorkspaceGuid1}",
      "isReadOnly": false,
      "isOnDedicatedCapacity": false,
      "type": "Workspace",
      "name": "My Reports"
    },
    {
      "id": "{WorkspaceGuid2}",
      "isReadOnly": false,
      "isOnDedicatedCapacity": false,
      "type": "Workspace",
      "name": "Sales"
    }
  ]
}

Data Sets

Example querying datasets in a Workspace: GET https://api.powerbi.com/v1.0/myorg/groups/{WorkspaceGuid}/datasets

Response:

{
  "@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/groups/{WorkspaceGuid}/$metadata#datasets",
  "value": [
    {
      "id": "{DatasetGuid1}",
      "name": "Regional Sales Sample",
      "webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/datasets/{DatasetGuid}",
      "addRowsAPIEnabled": false,
      "configuredBy": "admin@company.ca",
      "isRefreshable": true,
      "isEffectiveIdentityRequired": true,
      "isEffectiveIdentityRolesRequired": true,
      "isOnPremGatewayRequired": false,
      "targetStorageMode": "Abf",
      "createdDate": "2022-12-14T21:08:25.617Z",
      "createReportEmbedURL": "https://app.powerbi.com/reportEmbed?config",
      "qnaEmbedURL": "https://app.powerbi.com/qnaEmbed?config",
      "upstreamDatasets": [],
      "users": [],
      "queryScaleOutSettings": {
        "autoSyncReadOnlyReplicas": true,
        "maxReadOnlyReplicas": 0
      }
    }
  ]
}

Reports

To get the Reports in a Workspace you would use this endpoint: GET https://api.powerbi.com/v1.0/myorg/reports or GET https://api.powerbi.com/v1.0/groups/{WorkspaceGuid}/reports

{
  "@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/groups/{WorkspaceGuid}/$metadata#reports",
  "value": [
    {
      "id": "{ReportGuid1}",
      "reportType": "PowerBIReport",
      "name": "SalesMarketingt",
      "webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/reports/{ReportGuid1}",
      "embedUrl": "https://app.powerbi.com/reportEmbed?reportId={ReportGuid1}&groupId={WorkspaceGuid}",
      "isFromPbix": true,
      "isOwnedByMe": true,
      "datasetId": "{DatasetGuid}",
      "datasetWorkspaceId": "{WorkspaceGuid}",
      "users": [],
      "subscriptions": []
    },
    {
      "id": "{ReportGuid2}",
      "reportType": "PowerBIReport",
      "name": "CustomerDashboard",
      "webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/reports/{ReportGuid2}",
      "embedUrl": "https://app.powerbi.com/reportEmbed?reportId={ReportGuid2}&groupId={WorkspaceGuid}",
      "isFromPbix": true,
      "isOwnedByMe": true,
      "datasetId": "{DatasetGuid}",
      "datasetWorkspaceId": "{WorkspaceGuid}",
      "users": [],
      "subscriptions": []
    }
  ]
}

Pages

Example querying pages from a specific Dashboard Report: GET https://api.powerbi.com/v1.0/myorg/reports/{ReportGuid}/pages

Response:

{
  "@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/$metadata#pages",
  "value": [
    {
      "name": "ReportSection415ddae5451454e8571d",
      "displayName": "Regional Sales Analysis",
      "order": 0
    },
    {
      "name": "ReportSection0cbcb911a66680713a58",
      "displayName": "Geographic Analysis",
      "order": 1
    }
  ]
}

Data Sources

Example querying data sources for a specific Data Set: GET https://api.powerbi.com/v1.0/myorg/datasets/{datasetId}/datasources. This is great for linking back to your a data sources to build a lineage model!

Response:

{
  "@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/$metadata#datasources",
  "value": [
    {
      "datasourceType": "Sql",
      "connectionDetails": {
        "server": "ServerName",
        "database": "DatabaseName"
      },
      "datasourceId": "{DatasourceGuid}",
      "gatewayId": "{GatewayGuid}"
    }
  ]
}

3. Data Lakes

More organizations are beginning to use Data Lakes. You can obtain some Metadata (such as file location, filename, date created, content type) from an Azure Data Lake (Storage Gen 2) via API . Documentation on the Azure Storage Service REST APIs are available here.

Read Structure of File System

It can be helpful to give Business Users access to the File Structure, without having to give them access to the Data Lake. Here are some helpful APIs to build a model of your File Structure

Example API call to list all Containers: GET https://{AccountName}}.dfs.core.windows.net/?resource=account Response:

{
    "filesystems": [
        {
            "DefaultEncryptionScope": "$account-encryption-key",
            "DenyEncryptionScopeOverride": "false",
            "etag": "0x8DB2B948CCC244F",
            "lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
            "name": "bronze"
        },
        {
            "DefaultEncryptionScope": "$account-encryption-key",
            "DenyEncryptionScopeOverride": "false",
            "etag": "0x8DA0DB66EB4FBBA",
            "lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
            "name": "gold"
        },
        {
            "DefaultEncryptionScope": "$account-encryption-key",
            "DenyEncryptionScopeOverride": "false",
            "etag": "0x8DA0DB867EA1751",
            "lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
            "name": "silver"
        }
    ]
}

Example API call to list all Directories and Files in a given Container: GET https://{AccountName}}.dfs.core.windows.net/{ContainerName}?resource=filesystem&directory={FilePath}&continuation={ContinuationToken}&recursive=true Response:

{
    "paths": [
        {
            "contentLength": "0",
            "creationTime": "133522433227406344",
            "etag": "0x8DC2C07FEE2A308",
            "expiryTime": "0",
            "group": "$superuser",
            "isDirectory": "true",
            "lastModified": "Mon, 12 Feb 2024 20:27:12 GMT",
            "name": "Folder1/Folder2",
            "owner": "9c1a7931-guid-46b5-8b09-3cbcd3b23c38",
            "permissions": "rwxr-x---"
        },
        {
            "contentLength": "553",
            "creationTime": "133517977461722225",
            "etag": "0x8DC29BA3FE20D8A",
            "expiryTime": "0",
            "group": "$superuser",
            "lastModified": "Fri, 09 Feb 2024 21:58:27 GMT",
            "name": "Folder1/data.csv",
            "owner": "9c1a7931-guid-46b5-8b09-3cbcd3b23c38",
            "permissions": "rw-r-----"
        }
    ]
}

Read Schema of Text files

If you are using delimited text files (e.g. CSV, JSON), it is easy to query the contents of the file, and then infer the schema. Note: If a file's Content-Type is set to text/plain, the API payload will return in plain text, where as if the Content-Type is set to something else, like application/vnd.ms-excel, the API payload will be Base64 encoded. The Response Header x-ms-properties will also include any additional custom Metadata key/value pairs, with the value being base64 encoded.

Example API call to get a File's Contents: GET https://{AccountName}.dfs.core.windows.net/{ContainerName}/{FilePath}.

Read Schema of Big Data Files

If you are using a compressed columnar file format (like Parquet, Delta, Avro), it is not as simple as an API call to get the schema, you will need to do some processing to unlock all the Metadata. There are libraries, such as PyArrow for Python, that allow you to do this.

import pyarrow.parquet as pq
import io
from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client('test.parquet')

with io.BytesIO() as f:
    download_stream = blob_client.download_blob(0)
    download_stream.readinto(f)
    schema = pq.read_schema(f)
    meta = pq.read_metadata(fname)
    print(schema)
    print(meta)

Add Custom Metadata to a File or Directory

You can add custom Key/Value Metadata to any File or Directory in your Data Lake! Add the x-ms-properties to your Request Header in a PATCH API Call. Multiple Key/Value pairs can be separated by a comma, and the Value must be Base64 encoded (the Key is normal plain text).

PATCH https://{AccountName}.dfs.core.windows.net/{ContainerName}/{FilePath}?action=setProperties --header "x-ms-properties: CustomValue=SGVyZSBpcyB0aGUgVmFsdWUgZm9yIDE=,CustomValue2=SGVyZSBpcyB0aGUgVmFsdWUgZm9yIDI="

About the Author:

Ted Martin

Ted currently works as the Senior Data Analyst on the Database Operations Team at CARFAX Canada. Ted is an enthusiastic, outside-the-box thinker, who masters new skills or technologies extremely quickly. Like many Data Professionals, his career path was not straight-cut. Ted started programming and building websites at the bright age of 12, and built his first database at age 16 (all self-taught). He studied at Western University (Ontario, Canada) with an Honours Degree in the Medical Sciences. To gain a greater global perspective, Ted spent 2 years traveling and studying overseas at in China. Before finding his calling working in Data, he banked a few years’ experience in Retail Sales Management and Business Development, giving him a balanced mind on both the business and technology side of solving problems.

DataFluent

Where's the Metadata?

Benefits of Metadata

How to I get this Metadata?

2. Dashboarding Platforms

Workspaces (Groups)

Data Sets

Reports

Pages

Data Sources

3. Data Lakes

Read Structure of File System

Read Schema of Text files

Read Schema of Big Data Files

Add Custom Metadata to a File or Directory

Ted Martin

Related Posts

Add a comment

Categories

Popular Tags

Archives

Where's the Metadata?

Benefits of Metadata

How to I get this Metadata?

2. Dashboarding Platforms

Workspaces (Groups)

Data Sets

Reports

Pages

Data Sources

3. Data Lakes

Read Structure of File System

Read Schema of Text files

Read Schema of Big Data Files

Add Custom Metadata to a File or Directory

Ted Martin

Related Posts