------------------------------------------------------------------------------
NVIDIA Compute Visual Profiler 
Linux Release Notes
Version 4.0
------------------------------------------------------------------------------

PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.

Please refer Changelog.txt for changes with respect to the previous version.

FILES IN THE RELEASE:
--------------------
* computeprof/bin/computeprof    : Compute Visual Profiler Executable

* computeprof/bin/libQt*.so.4    : Qt shared libraries

* computeprof/projects           : Directory containing sample profiler projects

* computeprof/doc                : Directory containing files for user documentation.


SUPPORTED LINUX DISTRIBUTIONS
-----------------------------
Compute Visual Profiler platform support is same as  that for the CUDA Toolkit. 
Please refer the CUDA Toolkit Linux release notes.


SYSTEM REQUIREMENTS
-------------------
. CUDA-enabled GPU
  See http://www.nvidia.com/object/cuda_learn_products.html
. NVIDIA Driver
. NVIDIA CUDA Toolkit


INSTALLATION AND SETUP
---------------------
The installation is part of the CUDA toolkit installation. The files are
installed under "<CudaToolkitDir>/computeprof" where <CudaToolkitDir> is the 
directory under which the CUDA Toolkit is installed.

Setup LD_LIBRARY PATH to include the ComputeVisualProfiler bin directory:
 > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<CudaToolkitDir>/computeprof/bin


RUNNING Compute Visual Profiler
----------------------------
 > <CudaToolkitDir>/computeprof/bin/computeprof &

Refer the Compute Visual Profiler User Guide "Compute_Visual_Profiler_User_Guide.pdf" for more information.


KNOWN ISSUES
------------
1) Selecting profiler counters can result in an Compute application execution error and profiling 
   data is not generated. In such a case you will not be able to collect any profiling counter data. 
   You can disable all profiler counters and only collect other basic profiling information.

2) Following are some other issues related to profiler counters:
   . "warp serialize" counter for GPUs with compute capability 1.x is known to 
     give incorrect and high values for some cases.

   . For GPUs with compute capability 2.0 the "instructions issued" and
     "instructions executed" counter values are incorrect for some cases.
     
   . Prof triggers are not working on G80 (Compute Capability 1.0). 
   
   . The following counters can cause GPU kernels to run longer than the 
     the driver's watchdog timeout limit. 
      "gld instructions 8bit"
	  "gld instructions 16bit"
	  "gld instructions 32bit"
	  "gld instructions 64bit"
	  "gld instructions 128bit"
	  "gst instructions 8bit"
	  "gst instructions 16bit"
	  "gst instructions 32bit"
      "gst instructions 64bit"
      "gst instructions 128bit"
     In this case the driver will terminate the GPU kernel resulting in an application error 
     and the profiling data will not be available. Setting the X Config option 'Interactive' 
     to false is recommended when these counters are selected.
  
3) If some OpenCL resources (contexts, events, etc.) are not released in the program, 
   the profiler output may be incomplete or empty and Visual profiler will report 
   the message Error in reading profiler output'. The program needs to be
   modified to properly free up all OpenCL resources before termination.

4) In cases where multiple threads use the same context the API trace is incorrect. 
   The API trace is correct in cases where a single  thread accesses a single context.

5) For certain kernel launches or memory transfers, the GPU start timestamp value in the
   CUDA Visual Profiler output can have an incorrect value of zero. In such a case a warning 
   is displayed and all profiler rows having an incorrect gpu start timestamp value are dropped. 
   This will result in incomplete profiling output.

6) For GPUs with compute capability 2.0 and higher the potential occupancy is calculated assuming 
   the default cache configuration i.e. 48 KB of shared memory. If the cache configuration is changed,
   it will lead to a difference between 'Occupancy' in the profiler table and potential occupancy. 
   It also affects the following fields in the Kernel Occupancy Analysis: Shared Memory Ratio, Potential 
   Occupancy, Occupancy limiting factor. The 'Occupancy' in the profiler table is correctly calculated 
   based on the actual cache configuration setting used for each kernel launch. The command line 
   profiler option "cacheconfigexecuted" can be used to determine the cache configuration for each kernel 
   launch. Refer to the latest Visual Profiler User Guide for details about this command line profiler option.

7) When the profiler gathers performance signals on G80-based products, the driver reduces
   the clock rate on the device. If the CUDA app crashes or otherwise exits uncleanly, the
   clocks will not be reset to their previous values. The system must be rebooted to restore
   the original clock rate.

8) Due to improved memory coalescing hardware, the gld_incoherent and gst_incoherent signals will 
   always be zero on GTX 280 and GTX 260. This issue is specific to command line profiling. These 
   counters are not enabled for GTX 280 and GTX 260 in Visual Profiler.

9) Profiler data is flushed to the file only at some synchronization points (such as cudaDeviceSynchronize() 
   and cudaStreamSynchronize()) or when the profiler memory buffer is full. If the application exits without 
   any such synchronization call than the profiler output can be empty or incomplete.
   
10) You need to use the command line argument "--noprompt" for running most
   of the CUDA/OpenCL SDK samples. You can enable the "Run in separate window"
   checkbox in the Session settings dialog to open a separate window.
   Only with this option you can give some keyboard input for console-based
   CUDA/OpenCL programs.

11) In some cases resizing, maximizing or cascading of windows for multiple sessions 
   can result in a Compute Visual Profiler crash.
