Parameters#
Noise2read is a command-line interface (CLI) based tool to eliminate PCR and sequencing errors for short reads. It utilises CLI mode and INI file for configuring the parameters. Noise2read was mainly developed to correct short-read sequencing data, but it also provides several other modules. Therefore, to run noise2read, we must specify the module name from [“simplify_correction”, “correction”, “amplicon_correction”, “mimic_umi”, “real_umi”, “umi_correction”, “simulation”, “evaluation”] first. Then, we set the relevant parameters required by the each module, otherwise noise2read will use the default parameters.
1. Guidance on setting noise2read parameters:#
Noise2read has many parameters. But, most of these parameters do not necessarily need to change for different datasets.
We strongly suggest the users to set the
tree_methodas “gpu_hist” which means using GPU for training the model. If you do not have GPU resources, please use the simplified version of noise2read to do error correction because using CPU training for large datasets may require several days.For the large datasets, we do not suggest use a big number of multiprocessing processes (
num_workers) for noise2Read to run, as we have observed that those situations could suddenly consume a significant amount of memory, and the program ran out of memory and terminated.If noise2read runs out of memory and terminates during searching 1nt- or 2nt-edit-distance edges, please increase the parameter
reads_chunks_num(default 1) and decreasenum_workersto try again. Beware, a biggerreads_chunks_nummay slow down the noise2read.If noise2read runs out of memory and terminates during training, please increase the
chunks_num(default 100) and decreasenum_workersto try again.
2. The parameters required by different modules are summarised as follows:#
simplify_correction
result_dir, input_file, num_workers, chunks_num, reads_chunks_num, iso_change_detail, top_n, min_read_len
high_freq_thre, max_error_freq, save_graph, graph_visualization, drawing_graph_num
correction
All the parameters of “Paths”, “SourceInputData”, “General”, “GraphSetup”, “EmbeddingSetup”, “AmbiguousSetup” and “ModelTuningSetup”
amplicon_correction
All the parameters of “Paths”, “SourceInputData”, “General”, “GraphSetup”, “EmbeddingSetup”, “AmbiguousSetup”, “ModelTuningSetup” and “Amplicon”.
umi_correction
result_dir, input_file, ground_truth_data, num_workers, chunks_num, reads_chunks_num, top_n
high_freq_thre, max_error_freq, save_graph, graph_visualization, drawing_graph_num
real_umi
result_dir, input_file, num_workers
All the parameters of “real_umi”
mimic_umi
result_dir, input_file, ground_truth_data
simulation
All the parameters required by “simplify_correction” and those parameters of “simulation”
evaluation
result_dir, input_file, ground_truth_data, num_workers, chunks_num, reads_chunks_num
3. Parameters Description#
Paths#
result_dir[default=working_dirctory/result/]the directory where all the outputs saved.
SourceInputData#
input_filethe path to the input data required by noise2read
ground_truth_data[default=None]the path to the ground truth data required by noise2read evaluation module
General#
num_workers[default=-1]num_workers is the number of worker processes to use. If num_workers is -1 then the number returned by os.cpu_count() is used.
chunks_num[default=100]chunks_num is the number of worker processes to use. If num_workers is -1 then the number returned by os.cpu_count() is used.
reads_chunks_num[default=1]reads_chunks_num is used to divide list of reads into chunks when searching 1nt- or 2nt-edit-distance edges using multiprocessing. It also used by the multiprocessing module during evaluation process. Beware, if
verbose[default=False]If true, noise2read will save the extracted training instances such as genuine, ambiguous errors and negative reads as csv to
result_dir.
iso_change_detail[default=False]If true, noise2read will save the frquency changing reads of isolated nodes before and after correction as .txt file to
result_dir.
top_n[default=10]During the evaluation process, noise2read saves the frequency changes of the top
top_nsequence frequencies into a sheet of the output .excel file toresult_dir.
min_read_len[default=30]The threshold of the sequence’s minimum length to determine whether to perform 2nt-edit-distance-based error correction.
negative_sample_num[default=300000]When the number of negative samples larger than preseting threshold
negative_sample_num, noise2read will downsample negative samples for training.
GraphSetup#
high_freq_thre[default=4]The threshold of τ to determine whether a read is high-frequency or lwo-frequency.
max_error_freq[default=4]A read is considered as an error read when its frequency is smaller than the predefined maximum frequency threshold
max_error_freq.
save_graph[default=False]If true, noise2read will save the construted graph as ‘graph.gexf’ to
result_dir.
graph_visualization[default=False]If true, noise2read will visualize the connected subgraphs as “.svg” and save them to
result_dir.
drawing_graph_num[default=50]The number of the connected subgraphs to be drawed.
EmbeddingSetup#
entropy_kmer[default=3]The kmer size used to calculate Shannon’s and Tsallis’s entropy during feature extraction.
entropy_q[default=2]The real number q used to calculate Tsallis’s entropy during feature extraction.
kmer_freq[default=3]The kmer frequency used to calculate Shannon’s and Tsallis’s entropy during feature extraction.
read_type[default=DNA]The read type of the sequencing data required to be corrected. Set to
RNAwhen the nitrogenous base Uracil(U) instead of Thymine (T) appears in reads of RNA sequencing data.
AmbiguousSetup#
high_ambiguous[default=True]If Ture, noise2read will correct the potential ambiguous errors between high-frequency reads.
proba_deviation[default=0.95]The mutation observed in high-frequency reads exhibits a bidirectional nature.Therefore, we only consider the prediction result with a higher probability when the bidirectional predictions match. In other words, if the absolute difference between the probabilities of the two-way predictions is less than a specific value
proba_deviation, we discard the prediction; otherwise, we choose the prediction having a higher probability.
iso_neg_high[default=False]If True, the high frequency isolated nodes aslso included as negative samples for high ambiguous prediction. This will rquire quite a lot computational resources (memory) for embeeding and model training.
ModelTuningSetup#
n_trials[default=20]An Optuna trial is a process of evaluating an objective function.
n_trialsrefers to the number of the trials for optimizing the best model.
n_estimators[default=400]Number of boosting rounds.
test_size[default=0.1]test_sizerepresents the proportion of the dataset to serve as independent test for evaluating the models.
random_state[default=42]Controls the shuffling applied to the data before applying the sklearn.model_selection.train_test_split.
The seed used by the random number generator to control the randomization of the algorithm of performing over-sampling using SMOTE.
tree_method[default=auto]The tree construction algorithm used in XGBoost. See description in XGBoost documentation.
Choices: auto, exact, approx, hist, gpu_hist, this is a combination of commonly used updaters. For other updaters like refresh, set the parameter updater directly.
auto: Use heuristic to choose the fastest method.
exact: Exact greedy algorithm. Enumerates all split candidates.
approx: Approximate greedy algorithm using quantile sketch and gradient histogram.
hist: Faster histogram optimized approximate greedy algorithm.
gpu_hist: GPU implementation of hist algorithm.
learning_rate_min[default=1e-3]The minimum learning rate of the setted learning rate intervel. Optuna will choose the learning rate from the predifined intervel to optimize a best XGBoost model. The learning rate is a step size shrinkage used in update to prevents overfitting.
range: (0,1]
learning_rate_max[default=1e-1]The maximum learning rate of the setted learning rate intervel.
learning_rate_max>learning_rate_min.range: (0,1]
max_depth_min[default=3]The minimum of the setted maximum depth of a tree. Optuna will choose the maximum depth from the predifined intervel to optimize a best XGBoost model. XGBoost aggressively consumes memory when training a deep tree.
range: [0,∞]
max_depth_max[default=15]The maximum of the setted maximum depth of a tree.
max_depth_max` > ``max_depth_min.range: [0,∞]
max_depth_step[default=1]The step size for choosing max_depth of tree from the intervel [max_depth_min, max_depth_max].
subsample_min[default=0.8]The minimum of the subsample ratio of the training instances. Optuna will choose the subsample ratio from the predifined intervel to optimize a best XGBoost model.
range: (0,1]
subsample_max[default=1]The minimum of the subsample ratio of the training instances.
subsample_max>subsample_min.range: (0,1]
colsample_bytree_min[default=0.8]The minimum of the subsample ratio of columns when constructing each tree. Optuna will choose the subsample ratio from the predifined intervel to optimize a best XGBoost model. Subsampling occurs once for every tree constructed.
range: (0,1]
colsample_bytree_max[default=1]The maximum of the subsample ratio of columns when constructing each tree.
colsample_bytree_max>colsample_bytree_min.range: (0,1]
verbose_eval[default=False]If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.
xgboost_seed[default=42]Random number seed.
optuna_seed[default=42]Seed for random number generator used in optuna.samplers.TPESampler.
real umi#
umi_in_read[default=False]If true indicates that the UMI sequences are contained in the reads.
umi_start[default=0]When
umi_in_readis true.umi_startrepresents the start position of the UMIs in the reads.
umi_end[default=12]When
umi_in_readis true.umi_endrepresents the end position of the UMIs in the reads.
non_umi_start[default=24]non_umi_startrepresents the start position of the sequenced target fragments which does not include the other sequence such as barcode and UMIs.
group_read_number[default=10]The minimum number of reads in an UMI cluster to be selected for constructing UMI-based ground truth data set.
read_edit_dif[default=2]The edit difference between each low-frequency read and high-frequency read in a UMI culster. If the edit distance <=
read_edit_dif, the low-frequency read will be retained for constructing UMI-based ground truth data.
If false indicates that the UMI sequences are contained in the sequence description. Then we may use two customized separators and indices to split the description and extract the UMIs.
separator1[default=_]The first separtor to split the sequence description.
separator1_idx[default=2]The first index to get the string containing the UMI sequence from the splited string list.
separator2[default=_]The second separtor to split the splited string containing the UMI sequence.
separator2_idx[default=0]The second index to get the UMI sequence from the splited string list.
read_edit_dif[default=2]The edit difference between each low-frequency read and high-frequency read in a UMI culster. If the edit distance <=
read_edit_dif, the low-frequency read will be retained for constructing UMI-based ground truth data.
Amplicon#
amplicon_low_freq[default=50]The threshold to indicate a read is a low-frequency when its frequency <=
amplicon_low_freqfor the additional amplicon sequencing correction.
amplicon_high_freq[default=1500]The threshold to indicate a read is a high-frequency when its frequency >=
amplicon_low_freqfor the additional amplicon sequencing correction.
amplicon_threshold_proba[default=0.85]The probability threshold to determine whether potential amplicon errors mutated from its neighbouring high-frequency reads. If the prediceted probability >=
amplicon_threshold_proba, then noise2read retain this prediction, otherwise discard.
simulation#
min_freq[default=5]The predetermined threshold to filtered out low-frequency reads after correction by noise2read simplify_correction to eliminate noise for simulation.
min_read_count[default=30]The minimum counts of reads to select reads for constituting an error-prone subset. Then 1 or 2 errors are randomly injecting induced within these error-prone reads according to the predefined error rates per read.
error_rate1[default=0.09]The 1nt-based-error rate per read.
error_rate2[default=0.02]The 2nt-based-error rate per read.
Required CLI setting#
Module selection
Using noise2read, you must select the module name from [“correction”, “amplicon_correction”, “mimic_umi”, “real_umi”, “umi_correction”, “simulation”, “evaluation”] first.
-m | --module module_name
Setting configuration file or input dataset
configuration
-c | --config config.iniInput Read dataset
-i | --input data.fastq
Optional CLI setting#
You can set some parameters using CLI mode with/without INI file configuration. INI file configuration can set all the parameters except for module selection. The following parameters settings in the INI file will be invalid when setting them using CLI mode.
-u | --umi_file umi.fastq
-t | --true ground_truth.data.fastq
-r | --rectification corrected.data.fastq
-p | --parallel num_of_cpu_core
-a | --high_ambiguous True/False
-g | --tree_method gpu_hist/auto
-d | --directory */output_dir/