Analyzing NCU Reports

Learn how to open, navigate, and understand Nsight Compute performance reports in Wafer.

Opening a Report

Select NCU Profiler

Open the Wafer panel and select NCU Profiler from the tool dropdown.

Click Select File

Click the Select .ncu-rep file button in the tool panel. You’ll be able to choose between server-side and local analysis.

Choose Your Report

Navigate to and select an .ncu-rep file. The report will load and display in the panel.

Analysis Modes

Wafer offers two ways to analyze NCU reports:

Server-Side Analysis

When you select a report, Wafer uploads it to our B200 server for parsing. This is the default mode and works even if you don’t have NCU installed locally. How it works:

Your .ncu-rep file is uploaded to Wafer’s analysis server
The server parses the report using NCU on a B200 GPU
Parsed metrics, kernel data, and recommendations are sent back to your editor

Advantages:

No local NCU installation required
Works on any machine (Mac, Windows, Linux without NVIDIA GPU)
Consistent parsing with latest NCU version

Progress indicators:

Upload progress shows during file transfer
“Analyzing on B200 server…” appears during parsing (can take 2-5 minutes for large reports)

Local Analysis

If you have NCU installed locally, you can parse reports on your machine. Click Analyze Locally to use your local NCU installation. How it works:

Wafer detects your local NCU installation
Runs ncu commands to parse the report on your machine
Results display immediately without network transfer

Advantages:

No upload required—keeps data local
Faster for small reports
Works offline

Requirements:

NVIDIA Nsight Compute installed with ncu CLI on PATH

If NCU isn’t detected on your system, the local analysis option will show instructions for installing NCU.

Understanding the Interface

Kernel Summary

When you open a report, you’ll see a table listing all profiled kernels with key metrics:

Column	Description
Kernel Name	The CUDA kernel function name
Duration	Execution time in microseconds (µs)
Memory %	Memory throughput as percentage of peak
Compute %	Compute throughput as percentage of peak
Occupancy	Achieved occupancy percentage
Registers	Registers used per thread
Block Size	Threads per block
Grid Size	Total number of blocks

Click on column headers to sort kernels by that metric. Click a kernel row to select it and view detailed diagnostics.

Sorting and Selection

Click a column header to sort by that metric
Click a kernel row to select it and view its diagnostics
Right-click a kernel for additional options (copy to clipboard, save)

Performance Diagnostics

The diagnostics panel shows optimization recommendations for the selected kernel:

Bottleneck identification — What’s limiting performance (compute, memory, latency)
Actionable recommendations — Specific suggestions for improvement
Metric context — Explanation of why certain metrics matter

Diagnostics are generated by analyzing the NCU report data. Expand the panel to see full recommendations.

Source View (PTX/SASS/CUDA)

The Source view is one of the most powerful features for understanding kernel performance. Click the Source tab to view assembly code for the selected kernel.

Available Views

View	Description
CUDA	Your original source code with line-level metrics
PTX	Parallel Thread Execution assembly (NVIDIA’s intermediate representation)
SASS	GPU assembly (final machine code that runs on the GPU)
CUDA + SASS	Side-by-side view correlating source lines to assembly

Use the dropdown in the Source tab header to switch between views.

Understanding the Source View

The source view shows your code alongside performance metrics:

Line-level timing — See which source lines contribute most to execution time
Instruction counts — Understand how many instructions each line generates
Hot spots — Lines with high execution counts are highlighted
Assembly correlation — See exactly what PTX/SASS your source generates

The CUDA + SASS view is especially useful for optimization. You can see exactly which source lines generate expensive instruction sequences, helping you target optimizations precisely.

Using Source View for Optimization

Find hot lines — Look for source lines with high timing or instruction counts
Check instruction mix — Memory instructions are often more expensive than compute
Identify register spills — Look for local memory access in SASS (indicates register pressure)
Verify vectorization — Check if vector loads/stores are being used where expected

Source view requires that the NCU report was captured with source information. Use --set full when profiling to ensure source data is included.

Key Metrics Explained

Duration

The total execution time of the kernel. Lower is better, but raw duration alone doesn’t tell you if the kernel is efficient.

Memory Throughput %

How effectively the kernel uses memory bandwidth compared to the GPU’s theoretical peak. A high percentage means you’re memory-bound—optimize memory access patterns.

Compute Throughput %

How effectively the kernel uses compute resources. A high percentage means you’re compute-bound—optimize arithmetic operations or increase occupancy.

Achieved Occupancy

The percentage of maximum possible warps that were active on average. Low occupancy may indicate:

Too many registers per thread
Too much shared memory per block
Small grid size

Registers per Thread

The number of registers used by each thread. High register usage can limit occupancy. Consider:

Using __launch_bounds__ to hint register limits
Moving data to shared memory
Simplifying computations

Tips for Analysis

Start with the slowest kernels

Sort by duration to identify the kernels that take the most time. These are your optimization targets.

Look at throughput balance

If memory throughput is high but compute is low, you’re memory-bound. If compute is high but memory is low, you’re compute-bound. Balance both for optimal performance.

Check occupancy limiters

Low occupancy often indicates register pressure or shared memory limits. Check the registers-per-thread metric.

Use diagnostics for guidance

The diagnostics panel provides specific recommendations based on the metrics. Use these as starting points for optimization.

Exporting Results

You can export kernel data for use in other tools:

Right-click a kernel and select Copy as CSV to copy metrics to clipboard
Use the export button to save all kernel data

This is useful for:

Tracking performance over time
Sharing results in issues or PRs
Importing into spreadsheets for analysis

Next Steps

Creating Profiles

Learn how to run NCU profiling from VS Code.

Getting Started

NCU Profiler

Perfetto

ROCprofiler Compute

More

Analyzing Reports

Analyzing NCU Reports

Opening a Report

Analysis Modes

Server-Side Analysis

Local Analysis

Understanding the Interface

Kernel Summary

Sorting and Selection

Performance Diagnostics

Source View (PTX/SASS/CUDA)

Available Views

Understanding the Source View

Using Source View for Optimization

Key Metrics Explained

Duration

Memory Throughput %

Compute Throughput %

Achieved Occupancy

Registers per Thread

Tips for Analysis

Exporting Results

Next Steps

Creating Profiles

Getting Started

NCU Profiler

Perfetto

ROCprofiler Compute

More

​Analyzing NCU Reports

​Opening a Report

​Analysis Modes

​Server-Side Analysis

​Local Analysis

​Understanding the Interface

​Kernel Summary

​Sorting and Selection

​Performance Diagnostics

​Source View (PTX/SASS/CUDA)

​Available Views

​Understanding the Source View

​Using Source View for Optimization

​Key Metrics Explained

​Duration

​Memory Throughput %

​Compute Throughput %

​Achieved Occupancy

​Registers per Thread

​Tips for Analysis

​Exporting Results

​Next Steps

Creating Profiles

Analyzing NCU Reports

Opening a Report

Analysis Modes

Server-Side Analysis

Local Analysis

Understanding the Interface

Kernel Summary

Sorting and Selection

Performance Diagnostics

Source View (PTX/SASS/CUDA)

Available Views

Understanding the Source View

Using Source View for Optimization

Key Metrics Explained

Duration

Memory Throughput %

Compute Throughput %

Achieved Occupancy

Registers per Thread

Tips for Analysis

Exporting Results

Next Steps