Analyzing NCU Reports
Learn how to open, navigate, and understand Nsight Compute performance reports in Wafer.Opening a Report
1
Select NCU Profiler
Open the Wafer panel and select NCU Profiler from the tool dropdown.
2
Click Select File
Click the Select .ncu-rep file button in the tool panel. You’ll be able to choose between server-side and local analysis.
3
Choose Your Report
Navigate to and select an
.ncu-rep file. The report will load and display in the panel.Analysis Modes
Wafer offers two ways to analyze NCU reports:Server-Side Analysis
When you select a report, Wafer uploads it to our B200 server for parsing. This is the default mode and works even if you don’t have NCU installed locally. How it works:- Your
.ncu-repfile is uploaded to Wafer’s analysis server - The server parses the report using NCU on a B200 GPU
- Parsed metrics, kernel data, and recommendations are sent back to your editor
- No local NCU installation required
- Works on any machine (Mac, Windows, Linux without NVIDIA GPU)
- Consistent parsing with latest NCU version
- Upload progress shows during file transfer
- “Analyzing on B200 server…” appears during parsing (can take 2-5 minutes for large reports)
Local Analysis
If you have NCU installed locally, you can parse reports on your machine. Click Analyze Locally to use your local NCU installation. How it works:- Wafer detects your local NCU installation
- Runs
ncucommands to parse the report on your machine - Results display immediately without network transfer
- No upload required—keeps data local
- Faster for small reports
- Works offline
- NVIDIA Nsight Compute installed with
ncuCLI on PATH
If NCU isn’t detected on your system, the local analysis option will show instructions for installing NCU.
Understanding the Interface
Kernel Summary
When you open a report, you’ll see a table listing all profiled kernels with key metrics:| Column | Description |
|---|---|
| Kernel Name | The CUDA kernel function name |
| Duration | Execution time in microseconds (µs) |
| Memory % | Memory throughput as percentage of peak |
| Compute % | Compute throughput as percentage of peak |
| Occupancy | Achieved occupancy percentage |
| Registers | Registers used per thread |
| Block Size | Threads per block |
| Grid Size | Total number of blocks |
Sorting and Selection
- Click a column header to sort by that metric
- Click a kernel row to select it and view its diagnostics
- Right-click a kernel for additional options (copy to clipboard, save)
Performance Diagnostics
The diagnostics panel shows optimization recommendations for the selected kernel:- Bottleneck identification — What’s limiting performance (compute, memory, latency)
- Actionable recommendations — Specific suggestions for improvement
- Metric context — Explanation of why certain metrics matter
Diagnostics are generated by analyzing the NCU report data. Expand the panel to see full recommendations.
Source View (PTX/SASS/CUDA)
The Source view is one of the most powerful features for understanding kernel performance. Click the Source tab to view assembly code for the selected kernel.Available Views
| View | Description |
|---|---|
| CUDA | Your original source code with line-level metrics |
| PTX | Parallel Thread Execution assembly (NVIDIA’s intermediate representation) |
| SASS | GPU assembly (final machine code that runs on the GPU) |
| CUDA + SASS | Side-by-side view correlating source lines to assembly |
Understanding the Source View
The source view shows your code alongside performance metrics:- Line-level timing — See which source lines contribute most to execution time
- Instruction counts — Understand how many instructions each line generates
- Hot spots — Lines with high execution counts are highlighted
- Assembly correlation — See exactly what PTX/SASS your source generates
Using Source View for Optimization
- Find hot lines — Look for source lines with high timing or instruction counts
- Check instruction mix — Memory instructions are often more expensive than compute
- Identify register spills — Look for local memory access in SASS (indicates register pressure)
- Verify vectorization — Check if vector loads/stores are being used where expected
Source view requires that the NCU report was captured with source information. Use
--set full when profiling to ensure source data is included.Key Metrics Explained
Duration
The total execution time of the kernel. Lower is better, but raw duration alone doesn’t tell you if the kernel is efficient.Memory Throughput %
How effectively the kernel uses memory bandwidth compared to the GPU’s theoretical peak. A high percentage means you’re memory-bound—optimize memory access patterns.Compute Throughput %
How effectively the kernel uses compute resources. A high percentage means you’re compute-bound—optimize arithmetic operations or increase occupancy.Achieved Occupancy
The percentage of maximum possible warps that were active on average. Low occupancy may indicate:- Too many registers per thread
- Too much shared memory per block
- Small grid size
Registers per Thread
The number of registers used by each thread. High register usage can limit occupancy. Consider:- Using
__launch_bounds__to hint register limits - Moving data to shared memory
- Simplifying computations
Tips for Analysis
Start with the slowest kernels
Start with the slowest kernels
Sort by duration to identify the kernels that take the most time. These are your optimization targets.
Look at throughput balance
Look at throughput balance
If memory throughput is high but compute is low, you’re memory-bound. If compute is high but memory is low, you’re compute-bound. Balance both for optimal performance.
Check occupancy limiters
Check occupancy limiters
Low occupancy often indicates register pressure or shared memory limits. Check the registers-per-thread metric.
Use diagnostics for guidance
Use diagnostics for guidance
The diagnostics panel provides specific recommendations based on the metrics. Use these as starting points for optimization.
Exporting Results
You can export kernel data for use in other tools:- Right-click a kernel and select Copy as CSV to copy metrics to clipboard
- Use the export button to save all kernel data
- Tracking performance over time
- Sharing results in issues or PRs
- Importing into spreadsheets for analysis