Skip to main content

Analyzing NCU Reports

Learn how to open, navigate, and understand Nsight Compute performance reports in Wafer.

Opening a Report

1

Select NCU Profiler

Open the Wafer panel and select NCU Profiler from the tool dropdown.
2

Click Select File

Click the Select .ncu-rep file button in the tool panel. You’ll be able to choose between server-side and local analysis.
3

Choose Your Report

Navigate to and select an .ncu-rep file. The report will load and display in the panel.

Analysis Modes

Wafer offers two ways to analyze NCU reports:

Server-Side Analysis

When you select a report, Wafer uploads it to our B200 server for parsing. This is the default mode and works even if you don’t have NCU installed locally. How it works:
  1. Your .ncu-rep file is uploaded to Wafer’s analysis server
  2. The server parses the report using NCU on a B200 GPU
  3. Parsed metrics, kernel data, and recommendations are sent back to your editor
Advantages:
  • No local NCU installation required
  • Works on any machine (Mac, Windows, Linux without NVIDIA GPU)
  • Consistent parsing with latest NCU version
Progress indicators:
  • Upload progress shows during file transfer
  • “Analyzing on B200 server…” appears during parsing (can take 2-5 minutes for large reports)

Local Analysis

If you have NCU installed locally, you can parse reports on your machine. Click Analyze Locally to use your local NCU installation. How it works:
  1. Wafer detects your local NCU installation
  2. Runs ncu commands to parse the report on your machine
  3. Results display immediately without network transfer
Advantages:
  • No upload required—keeps data local
  • Faster for small reports
  • Works offline
Requirements:
  • NVIDIA Nsight Compute installed with ncu CLI on PATH
If NCU isn’t detected on your system, the local analysis option will show instructions for installing NCU.

Understanding the Interface

Kernel Summary

When you open a report, you’ll see a table listing all profiled kernels with key metrics:
ColumnDescription
Kernel NameThe CUDA kernel function name
DurationExecution time in microseconds (µs)
Memory %Memory throughput as percentage of peak
Compute %Compute throughput as percentage of peak
OccupancyAchieved occupancy percentage
RegistersRegisters used per thread
Block SizeThreads per block
Grid SizeTotal number of blocks
Click on column headers to sort kernels by that metric. Click a kernel row to select it and view detailed diagnostics.

Sorting and Selection

  • Click a column header to sort by that metric
  • Click a kernel row to select it and view its diagnostics
  • Right-click a kernel for additional options (copy to clipboard, save)

Performance Diagnostics

The diagnostics panel shows optimization recommendations for the selected kernel:
  • Bottleneck identification — What’s limiting performance (compute, memory, latency)
  • Actionable recommendations — Specific suggestions for improvement
  • Metric context — Explanation of why certain metrics matter
Diagnostics are generated by analyzing the NCU report data. Expand the panel to see full recommendations.

Source View (PTX/SASS/CUDA)

The Source view is one of the most powerful features for understanding kernel performance. Click the Source tab to view assembly code for the selected kernel.

Available Views

ViewDescription
CUDAYour original source code with line-level metrics
PTXParallel Thread Execution assembly (NVIDIA’s intermediate representation)
SASSGPU assembly (final machine code that runs on the GPU)
CUDA + SASSSide-by-side view correlating source lines to assembly
Use the dropdown in the Source tab header to switch between views.

Understanding the Source View

The source view shows your code alongside performance metrics:
  • Line-level timing — See which source lines contribute most to execution time
  • Instruction counts — Understand how many instructions each line generates
  • Hot spots — Lines with high execution counts are highlighted
  • Assembly correlation — See exactly what PTX/SASS your source generates
The CUDA + SASS view is especially useful for optimization. You can see exactly which source lines generate expensive instruction sequences, helping you target optimizations precisely.

Using Source View for Optimization

  1. Find hot lines — Look for source lines with high timing or instruction counts
  2. Check instruction mix — Memory instructions are often more expensive than compute
  3. Identify register spills — Look for local memory access in SASS (indicates register pressure)
  4. Verify vectorization — Check if vector loads/stores are being used where expected
Source view requires that the NCU report was captured with source information. Use --set full when profiling to ensure source data is included.

Key Metrics Explained

Duration

The total execution time of the kernel. Lower is better, but raw duration alone doesn’t tell you if the kernel is efficient.

Memory Throughput %

How effectively the kernel uses memory bandwidth compared to the GPU’s theoretical peak. A high percentage means you’re memory-bound—optimize memory access patterns.

Compute Throughput %

How effectively the kernel uses compute resources. A high percentage means you’re compute-bound—optimize arithmetic operations or increase occupancy.

Achieved Occupancy

The percentage of maximum possible warps that were active on average. Low occupancy may indicate:
  • Too many registers per thread
  • Too much shared memory per block
  • Small grid size

Registers per Thread

The number of registers used by each thread. High register usage can limit occupancy. Consider:
  • Using __launch_bounds__ to hint register limits
  • Moving data to shared memory
  • Simplifying computations

Tips for Analysis

Sort by duration to identify the kernels that take the most time. These are your optimization targets.
If memory throughput is high but compute is low, you’re memory-bound. If compute is high but memory is low, you’re compute-bound. Balance both for optimal performance.
Low occupancy often indicates register pressure or shared memory limits. Check the registers-per-thread metric.
The diagnostics panel provides specific recommendations based on the metrics. Use these as starting points for optimization.

Exporting Results

You can export kernel data for use in other tools:
  • Right-click a kernel and select Copy as CSV to copy metrics to clipboard
  • Use the export button to save all kernel data
This is useful for:
  • Tracking performance over time
  • Sharing results in issues or PRs
  • Importing into spreadsheets for analysis

Next Steps