Adding Metadata to your Data Catalog can provide many benefits, such as:
Most data stores will directly or indirectly maintain Metadata about your data. Some Metadata can be queried directly (e.g. file name or column names) or accessed from an API, and some Metadata has to be derived from the data itself (e.g. number of Null values, min and max values). I will explore how to mine Metadata from several different systems, and has been split into different parts for readability:
The Metadata from Dashboards, like Power BI Online, can be accessed via API. Documentation on the Power BI REST APIs is available here.
Note: to query these REST APIs in the examples below, you will need to include a valid Bearer Token in the Authorization
Header
Example querying all Workspaces: GET https://api.powerbi.com/v1.0/myorg/groups?$filter={$filter}&$top={$top}&$skip={$skip}
Response:
{
"@odata.context": "http://wabi-azure-location-primary-redirect.analysis.windows.net/v1.0/myorg/$metadata#groups",
"@odata.count": 2,
"value": [
{
"id": "{WorkspaceGuid1}",
"isReadOnly": false,
"isOnDedicatedCapacity": false,
"type": "Workspace",
"name": "My Reports"
},
{
"id": "{WorkspaceGuid2}",
"isReadOnly": false,
"isOnDedicatedCapacity": false,
"type": "Workspace",
"name": "Sales"
}
]
}
Example querying datasets in a Workspace: GET https://api.powerbi.com/v1.0/myorg/groups/{WorkspaceGuid}/datasets
Response:
{
"@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/groups/{WorkspaceGuid}/$metadata#datasets",
"value": [
{
"id": "{DatasetGuid1}",
"name": "Regional Sales Sample",
"webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/datasets/{DatasetGuid}",
"addRowsAPIEnabled": false,
"configuredBy": "admin@company.ca",
"isRefreshable": true,
"isEffectiveIdentityRequired": true,
"isEffectiveIdentityRolesRequired": true,
"isOnPremGatewayRequired": false,
"targetStorageMode": "Abf",
"createdDate": "2022-12-14T21:08:25.617Z",
"createReportEmbedURL": "https://app.powerbi.com/reportEmbed?config",
"qnaEmbedURL": "https://app.powerbi.com/qnaEmbed?config",
"upstreamDatasets": [],
"users": [],
"queryScaleOutSettings": {
"autoSyncReadOnlyReplicas": true,
"maxReadOnlyReplicas": 0
}
}
]
}
To get the Reports in a Workspace you would use this endpoint: GET https://api.powerbi.com/v1.0/myorg/reports
or GET https://api.powerbi.com/v1.0/groups/{WorkspaceGuid}/reports
{
"@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/groups/{WorkspaceGuid}/$metadata#reports",
"value": [
{
"id": "{ReportGuid1}",
"reportType": "PowerBIReport",
"name": "SalesMarketingt",
"webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/reports/{ReportGuid1}",
"embedUrl": "https://app.powerbi.com/reportEmbed?reportId={ReportGuid1}&groupId={WorkspaceGuid}",
"isFromPbix": true,
"isOwnedByMe": true,
"datasetId": "{DatasetGuid}",
"datasetWorkspaceId": "{WorkspaceGuid}",
"users": [],
"subscriptions": []
},
{
"id": "{ReportGuid2}",
"reportType": "PowerBIReport",
"name": "CustomerDashboard",
"webUrl": "https://app.powerbi.com/groups/{WorkspaceGuid}/reports/{ReportGuid2}",
"embedUrl": "https://app.powerbi.com/reportEmbed?reportId={ReportGuid2}&groupId={WorkspaceGuid}",
"isFromPbix": true,
"isOwnedByMe": true,
"datasetId": "{DatasetGuid}",
"datasetWorkspaceId": "{WorkspaceGuid}",
"users": [],
"subscriptions": []
}
]
}
Example querying pages from a specific Dashboard Report: GET https://api.powerbi.com/v1.0/myorg/reports/{ReportGuid}/pages
Response:
{
"@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/$metadata#pages",
"value": [
{
"name": "ReportSection415ddae5451454e8571d",
"displayName": "Regional Sales Analysis",
"order": 0
},
{
"name": "ReportSection0cbcb911a66680713a58",
"displayName": "Geographic Analysis",
"order": 1
}
]
}
Example querying data sources for a specific Data Set: GET https://api.powerbi.com/v1.0/myorg/datasets/{datasetId}/datasources
. This is great for linking back to your a data sources to build a lineage model!
Response:
{
"@odata.context": "http://wabi-azure-location.analysis.windows.net/v1.0/myorg/$metadata#datasources",
"value": [
{
"datasourceType": "Sql",
"connectionDetails": {
"server": "ServerName",
"database": "DatabaseName"
},
"datasourceId": "{DatasourceGuid}",
"gatewayId": "{GatewayGuid}"
}
]
}
More organizations are beginning to use Data Lakes. You can obtain some Metadata (such as file location, filename, date created, content type) from an Azure Data Lake (Storage Gen 2) via API . Documentation on the Azure Storage Service REST APIs are available here.
It can be helpful to give Business Users access to the File Structure, without having to give them access to the Data Lake. Here are some helpful APIs to build a model of your File Structure
Example API call to list all Containers: GET https://{AccountName}}.dfs.core.windows.net/?resource=account
Response:
{
"filesystems": [
{
"DefaultEncryptionScope": "$account-encryption-key",
"DenyEncryptionScopeOverride": "false",
"etag": "0x8DB2B948CCC244F",
"lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
"name": "bronze"
},
{
"DefaultEncryptionScope": "$account-encryption-key",
"DenyEncryptionScopeOverride": "false",
"etag": "0x8DA0DB66EB4FBBA",
"lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
"name": "gold"
},
{
"DefaultEncryptionScope": "$account-encryption-key",
"DenyEncryptionScopeOverride": "false",
"etag": "0x8DA0DB867EA1751",
"lastModified": "Fri, 10 Feb 2023 18:28:04 GMT",
"name": "silver"
}
]
}
Example API call to list all Directories and Files in a given Container: GET https://{AccountName}}.dfs.core.windows.net/{ContainerName}?resource=filesystem&directory={FilePath}&continuation={ContinuationToken}&recursive=true
Response:
{
"paths": [
{
"contentLength": "0",
"creationTime": "133522433227406344",
"etag": "0x8DC2C07FEE2A308",
"expiryTime": "0",
"group": "$superuser",
"isDirectory": "true",
"lastModified": "Mon, 12 Feb 2024 20:27:12 GMT",
"name": "Folder1/Folder2",
"owner": "9c1a7931-guid-46b5-8b09-3cbcd3b23c38",
"permissions": "rwxr-x---"
},
{
"contentLength": "553",
"creationTime": "133517977461722225",
"etag": "0x8DC29BA3FE20D8A",
"expiryTime": "0",
"group": "$superuser",
"lastModified": "Fri, 09 Feb 2024 21:58:27 GMT",
"name": "Folder1/data.csv",
"owner": "9c1a7931-guid-46b5-8b09-3cbcd3b23c38",
"permissions": "rw-r-----"
}
]
}
If you are using delimited text files (e.g. CSV, JSON), it is easy to query the contents of the file, and then infer the schema. Note: If a file's Content-Type
is set to text/plain
, the API payload will return in plain text, where as if the Content-Type
is set to something else, like application/vnd.ms-excel
, the API payload will be Base64 encoded. The Response Header x-ms-properties
will also include any additional custom Metadata key/value pairs, with the value being base64 encoded.
Example API call to get a File's Contents: GET https://{AccountName}.dfs.core.windows.net/{ContainerName}/{FilePath}
.
If you are using a compressed columnar file format (like Parquet, Delta, Avro), it is not as simple as an API call to get the schema, you will need to do some processing to unlock all the Metadata. There are libraries, such as PyArrow
for Python, that allow you to do this.
import pyarrow.parquet as pq
import io
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client('test.parquet')
with io.BytesIO() as f:
download_stream = blob_client.download_blob(0)
download_stream.readinto(f)
schema = pq.read_schema(f)
meta = pq.read_metadata(fname)
print(schema)
print(meta)
You can add custom Key/Value Metadata to any File or Directory in your Data Lake! Add the x-ms-properties
to your Request Header in a PATCH
API Call. Multiple Key/Value pairs can be separated by a comma, and the Value must be Base64 encoded (the Key is normal plain text).
PATCH https://{AccountName}.dfs.core.windows.net/{ContainerName}/{FilePath}?action=setProperties --header "x-ms-properties: CustomValue=SGVyZSBpcyB0aGUgVmFsdWUgZm9yIDE=,CustomValue2=SGVyZSBpcyB0aGUgVmFsdWUgZm9yIDI="