= Name =
sxsort3d_new - 3D Clustering - SORT3D_NEW: UNDER DEVELOPMENT. Reproducible 3D Clustering of heterogeneous dataset. Sort out 3D heterogeneity of 2D data whose 3D reconstruction parameters have been determined already.


= Usage =

'' usage in command line''

sxsort3d_new.py  --refinement_dir=refinement_dir  --output_dir=output_dir  --niter_for_sorting=num_of_sorting_iterations  --mask3D=mask3d_file   --focus=focus3d_file  --radius=outer_radius  --sym=symmetry  --img_per_grp=num_of_images_per_group  --minimum_grp_size=minimum_grp_size  --memory_per_node=memory_per_node  --nindependent=indenpendent_runs   --comparison_method=comparison_method  --instack=input_stack_file  --nofinal_sharpen   --eqkmeans_angle_step=angle_step  --eqkmeans_tilt1=lower_tilt  --eqkmeans_tilt2=upper_tilt  --post_sorting_sharpen  --stop_eqkmeans_percentage=stop_eqkmeans_percentage  --minimum_ptl_number=minimum_ptl_number  --notapplybckgnoise  --mtf=mtf_file  --B_enhance=B_enhance  --fl=lpf_cutoff_freq  --aa=lpf_falloff  --B_start=B_start  --B_stop=B_stop  --nofsc_adj=nofsc_adj  


=== Typical usage ===
sxsort3d_new.py exists only in MPI version. In addition, it does not support single workstation and the minimum requirement for mpi is two nodes. 

    1. 3D sorting with focus mask

    ''' mpirun -np 48 sxsort3d_new.py --output_dir='outdir_sxsort3d_new' —focus='focus3d.hdf' --nindependent=3 --sym='c1' --radius=52 --img_per_grp=2000 --refinement_dir='outdir_sxmeridien'  --comparison_method='cross' --stop_eqkmeans_percentage=2.0 --minimum_grp_size=50 ''' <<BR>><<BR>>

    Note --focus option is require for focus clustering. The focus mask must be binary. 

    2. 3D sorting without focus mask

    ''' mpirun -np 48 sxsort3d_new.py --output_dir='outdir_sxsort3d_new' --nindependent=3 --sym='c5' --radius=120 --minimum_grp_size=200 --refinement_dir='outdir_sxmeridien' --comparison_method='cross' --stop_eqkmeans_percentage=3.0 --img_per_grp=1500 ''' <<BR>><<BR>>


    3. Do unfiltered reconstructions on sorted clusters and merge them

    ''' mpirun -np 48 sxsort3d_new.py --output_dir='outdir_sxsort3d_new' --post_sorting_sharpen —focus='focus3d.hdf' --nindependent=3 --sym='c1' --radius=52 --img_per_grp=2000 --refinement_dir='outdir_sxmeridien' --comparison_method='cross' --stop_eqkmeans_percentage=2.0 --minimum_grp_size=50 ''' <<BR>><<BR>>

    Note --post_sorting_sharpen is required to reconstruct unfiltered maps independent of sorting.


    How to continue sxmeridien refinement using sorting results:: The command line below continues previous meridien run from the 25th iteration using a subset of data associated to a selected group.

    ''' mpirun -np 80 sxmeridien.py 'outdir_sxmeridien_continue' --memory_per_node=64.0 --ctrefromsort3d --subset=outdir_sxsort3d_new/Cluster0.txt --oldrefdir='outdir_sxmeridien' --ctrefromiter=30 ''' <<BR>><<BR>>
    
    Note the output directory 'outdir_sxmeridien_continue' can be an existing one. The --ctrefromsort3d option is require for meridien continue run. Additional options for this use case are:
    -—subset       : Specify subset of data with a selection text file (i.e. Cluster#.txt) with particle ID numbers in one column (produced by sort3d or by other means)
    -—oldrefdir    : Specify previous meridien refinement directory
    --ctrefromiter : Specify iteration to continue refinement from. One does not have to use final iteration. Typically earlier ones work better, at least for initial sorting.


== Input ==
    refinement_dir:: Input 3D refinement directory: The master output directory of sxmeridien. (default required string)
    niter_for_sorting:: 3D refinement iteration: Specify an iteration number of 3D refinement where the 3D alignment parameters should be extracted for this sorting. By default, it uses iteration achieved best resolution. (default -1)
    mask3D:: 3D mask: File path of the global 3D mask for clustering. (default none)
    focus:: Focus 3D mask: File path of a binary 3D mask for focused clustering. (default none)
    radius:: Outer radius for rotational correlation [Pixels]: Particle radius in pixel for rotational correlation. Generally, use the radius of the particle. The value must be smaller than half the box size.  (default -1)
    sym:: Point-group symmetry: Point group symmetry of the macromolecular structure. It can be inherited from refinement. (default c1) 
    img_per_grp:: Images per group: The number of images per a group. This value is critical for successful 3D clustering. (default 1000) 
    minimum_grp_size:: Smallest group size: Minimum number of members for being identified as a group. This value must be smaller than the number of images per a group (img_per_grp). (default 500) 
    memory_per_node:: Memory per node [GB]: User provided information about memory per node in GB (NOT per CPU). It will be used to evaluate the number of CPUs per node from user-provided MPI setting. By default, it uses 2GB * (number of CPUs per node) (default -1.0)

    * The remaining parameters are optional and default values are given in parenthesis. There is rarely any need to modify them.
    nindependent:: Independent runs: Number of independent runs for Equal Sized K-means clustering. The value must be an odd number larger than 2. (default 3) 
    comparison_method:: Comparison method: Similarity measurement for the comparison between reprojected reference images and particle images. Valid values are 'cross' (cross-correlaton coefficients) and 'eucd' (Euclidean distance). (default cross) 
    instack:: Input images stack: File path of particle stack for sorting provided by user. If specified, sorting starts from a given data stack. This option is not currently supported by SHPIRE GUI (sxgui). (default none)
    nofinal_sharpen:: Do not reconstruct final maps: Do not reconstruct unfiltered final maps for post refinement process. (default False)
    eqkmeans_angle_step:: EQK-means Angular sampling step: Sampling anglular step used for EQKmeans orientation constraints. (default 15.0)
    eqkmeans_tilt1:: EQK-means Lower tilting bound: Lower bound of sampling tilting angle (theta) used for EQKmeans orientation constraints. (default 0.0)
    eqkmeans_tilt2:: EQK-means Upper tilting bound: Upper bound of sampling tilting angle (theta) used for EQKmeans orientation constraints. (default 180.0)
    post_sorting_sharpen:: Sharpen maps of each clusters: Sharpen maps generated from sorted clusters. (default False)
    stop_eqkmeans_percentage:: Stop EQK-means Percentage [%]: Particle change percentage for stopping Equal-Sized K-means. (default 2.0)
    minimum_ptl_number:: Smallest orientation group size: The smallest orientation group size wich equals number_of_groups multiplied by this number. The value have to be an integer. (default 20)
    notapplybckgnoise:: Do not apply noise: Do not apply background noise. (default False)
    mtf:: MTF file: File contains the MTF (modulation transfer function) of the detector used. (default none)
    B_enhance:: B-factor enhancement: -1.0: B-factor is not applied; 0.0: program estimates B-factor from options. B_start (usually 10 Angstrom) to the resolution determined by FSC143; 128.0: program use the given value 128.0 [A^2] to enhance map. (default 0.0)
    fl:: Low-pass filter frequency: 0.0: low-pass filter to resolution; A value > 0.5: low-pass filter to the value in Angstrom; A value > 0.0 and < 0.5: low-pass filter to the value in absolute frequency; -1.0: no low-pass filter. (default 0.0)
    aa:: Low-pass filter fall-off: Low-pass filter fall-off. (default 0.1)
    B_start:: B-factor lower limit [A]: Lower limit for B-factor estimation. (default 10.0)
    B_stop:: B-factor higher limit [A]: Higher limit for B-factor estimation. (default  0.0)
    nofsc_adj:: Do not apply FSC-based low-pass filter: Do not apply an FSC-based low-pass filter (square root of FSC) to the merged volume before the B-factor estimation. (default False)


== Output ==
    output_dir:: Output directory: The master output directory for sorting. The results will be written here. This directory will be created automatically if it does not exist. Here, you can find a log.txt that describes the sequences of computations in the program. (default required string)


= Description =
sxsort3d_new finds out stable members by carrying out two-way comparison of two independent sxsort3d runs.

For small tested datasets (real and simulated ribosome data around 10K particles), it gives 70%-90% reproducibility. However, this rate also depends on the choice of number of images per group and number of particles in the smallest group.


Note - 2017/04/25: About new version 
The new version is better integrated with meridien, and supports only sorting that imports parameters from meridien refinement and the 3D reconstruction in sorting uses all the information inhered from meridien, not just the final optimal X-form parameters. The options for post processing are also included. Continuation option of sxmeridien.py allows to continue refinement of a dataset subset determined by the 3D sorting.

This version does not support single workstation and the minimum requirement for mpi is two nodes. 

Sorting would cost much more time than the last version. 

Important Outputs:
The results are saved in the directory specified as output_dir  ('outdir_sxsort3d_new' in the example above). The final results are partitioned particles IDs saved in text files. Also, unfiltered maps of each cluster are reconstructed in the way of meridien does. One can use postprocess command to merge the two halves of maps of each group.

- Cluster*.txt
Sorting results. The number of cluster files is equal to the number of classes found. These selection files contain one column for particle indexes. Input projection EM data is assumed to be number 0 to n-1.

- vol_unfiltered_0_group*.hdf and vol_unfiltered_1_group*.hdf
Reconstructed halfset maps to be used for postprocessing.

- vol_final_cluster*.hdf 

- vol_final_nomask_cluster*.hdf  


Some examples for timing: 
-  10K data C1 of image size 128*128 with group size 1200 using 3 nodes 48 cpus: 3h 8 minutes 
-  18K data C5 of image size 384*384 with group size 3000 using 3 nodes 48 cpus: 5h 6 minutes 
-  36K data C4 of image size 360*360 with group size 5000 using 3 nodes 48 cpus: 6h 54 minutes 
- 100k  data C1 of image size 360*360 with group size 25000 using 5 nodes 80 cpus: 9h 50 minutes 

In general, reconstruction costs more than 80% of time for each sorting. 


= Method =
K-means, equal K-means, reproducibility, two-way comparison.

= Reference =
Not published yet.

= Author / Maintainer =
Zhong Huang

= Keywords =
    category 1:: APPLICATIONS

= Files =
sxsort3d_new.py

= See also =
[[http://sparx-em.org/sparxwiki/sxsort3d|sxsort3d]]

= Maturity =
    beta::    Under development. It has been tested, The test cases/examples are available upon request. Please let us know if there are any bugs.

= Bugs =
None so far.