Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?

0

Overview

V-Order is Microsoft’s “special sauce” to further optimise data within Parquet files and is a read optimiser feature. V-Order also helps optimise Power BI DirectLake functionality. However, V-Order comes with an overhead when writing and can be deactivated/activated on a table in a Lakehouse. Now in Fabric Warehouse we can disable V-Ordering, however this is on the whole Warehouse. We can’t choose which tables, plus it can’t be reversed so once you switch if off in a Warehouse, that’s it.

I want to test the speed and capacity unit usage when loading a table when V-Order is on and off. The process for this test is to create 2 Warehouses, with one Warehouse V-Order deactivated and the other with V-Order activated (the default). I’ll then use a CTAS (CREATE TABLE AS SELECT) statement in each Warehouse to load 1.1billion rows from a Lakehouse table. All of these items are in the same workspace and this workspace is allocated to an F64 Fabric capacity.

Deactivating V-Order

When creating a Warehouse the default is V-Order is activated, we can deactivate and check by running the following SQL command in a Warehouse:

--deactivate V-Order
ALTER DATABASE CURRENT SET VORDER = OFF;

--check status
SELECT [name], [is_vorder_enabled] 
FROM sys.databases;

Why Deactivate V-Order

Well, that’s the big question…why keep it enabled and why deactivate it? Well, it’s for read optimisation and enables DirectLake in semantic models, therefore we want to have V-Order enabled in whatever data store is close to our business reporting. But, if read performance isn’t that important and if the overhead of writing to V-Order enabled tables is greater than the benefit, then perhaps deactivating may be in the best interest of both speed and capacity unit usage. Use cases have included transient “staging” data that is moving through a data loading process, why optimise for reading if it’s just being landed to be moved on somewhere else?


SQL Workloads

For the testing I’ll be running the following on each Warehouse. The CREATE TABLE AS SELECT (CTAS) query uses a 3 part naming convention to query the Lakehouse table in the same workspace.

--CTAS query
CREATE TABLE dbo.stgWebTelemetry
AS
SELECT 
    UserID,
    EventType,
    ProductKey,
    URL,
    Device,
    SessionViewSeconds,
    EventYear,
    EventMonth,
    EventDate
FROM LakehouseOne.dbo.raw_webtelemetry;

The dataset I’ll be using is my standard test data set of synthetic web telemetry data. If you’d like to use data for your own testing, check out my GitHub which has a SQL script to load a Warehouse with example data from Microsoft.


Duration of Writing

Let’s look at how long it takes to create and write a table, I’ll use a CTAS statement to create and load a table from a Lakehouse to a Warehouse. I ran the load 3 times and created an average plus minimum and maximum load times, there wasn’t a huge difference across the 3 load times,

Looking at the results, it shows that when V-Order is activated, the load is on average 50% slower to run than when its deactivated.


Duration of Reading

However, on the flip side it’s slower to read the table, with the average being twice as long to read. I ran 3 different SQL GROUP BY queries (see References in this blog), each query added more columns into the SELECT. I haven’t included a minimum graph as all queries generated a 0 minimum due to caching.


Capacity Units (CU) Usage when Writing

Time isn’t the only metric I’m interested in, I’d like to know how many capacity units are consumed when running the load query. Ideally we’d like all our workloads to consume as few of those precious capacity units as possible because that means we can run workloads at a lower compute (or more workloads). In order to get the capacity units used per query:

  • In each Warehouse: SELECT distributed_statement_id FROM queryinsights.exec_requests_history WHERE command LIKE ‘%CREATE TABLE%’
  • Open the Fabric Capacity Metrics App, drill down into timepoint detail and expose the operationId column in the Background chart.
  • Filter the visual using the operationId column and use the value from the distributed_statement_id from the queryinsights view.
  • Note the Total CU(s) column

In the results below we can see that on average when writing to the new table when V-Order is on uses twice as many CUs than when off. Minimum and Maximum values don’t vary that much. I actually ran these tests over and over again on a couple of different dataset sizes (10GB and 50GB), both with 1.1B rows and observed similar trends.


Capacity Units (CU) Usage when Reading

As V-Ordering is a read optimisation process, we need to see the impact when selecting from the table. What we save when writing data may not actually benefit us if reading the data becomes slow and CU usage becomes high. In my dataset testing, I didn’t see a huge jump in CU usage when reading the data.


Conclusion

We can see that when deactivating V-Order, speeds and capacity units usage decrease when writing data, when reading data speeds are slower but CU usage remains roughly the same.

So should you switch it off in a Warehouse? It depends! If it’s more beneficial for a Warehouse to have V-Order deactivated because read is less important than write, then go for it. Your mileage may vary, but I’m wondering why you’d want to switch off V-Order in a Warehouse…would you use a Warehouse for staging data? Perhaps…but to my mind and in reference to all the architecture patterns out there, the Warehouse is usually for the gold/presentation layer, so read would be important.


References


SQL SELECTs

--query 1
SELECT 
    EventType,
    COUNT(*)
FROM FactWebTelemetry_NoVOrder
GROUP BY 
    EventType

--query 2
SELECT 
    EventType,
    ProductKey,
    COUNT(*)
FROM FactWebTelemetry_NoVOrder
GROUP BY 
    EventType,
    ProductKey;

--query 3
SELECT 
    EventType,
    ProductKey,
    Device,
    COUNT(*)
FROM FactWebTelemetry_NoVOrder
GROUP BY 
    EventType,
    ProductKey,
    Device;

Leave a Reply

Your email address will not be published. Required fields are marked *