<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>http://wikis.cnic.es/proteomica/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Mtrevisan</id>
		<title>PROTEOMICA - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="http://wikis.cnic.es/proteomica/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Mtrevisan"/>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php/Special:Contributions/Mtrevisan"/>
		<updated>2026-04-25T07:11:33Z</updated>
		<subtitle>User contributions</subtitle>
		<generator>MediaWiki 1.30.1</generator>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=File:SanXoT_commands_for_Unix_shell.zip&amp;diff=645</id>
		<title>File:SanXoT commands for Unix shell.zip</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=File:SanXoT_commands_for_Unix_shell.zip&amp;diff=645"/>
				<updated>2018-09-17T08:31:50Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: Mtrevisan uploaded a new version of &amp;amp;quot;File:SanXoT commands for Unix shell.zip&amp;amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Commands for the unit test used in Unix shell (instead of the DOS batch files).&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=644</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=644"/>
				<updated>2018-08-17T15:42:57Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Unit test for Unix/Linux operating systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Windows==&lt;br /&gt;
&lt;br /&gt;
Links to SanXoT windows standalone executables are provided in this unit test. Additionally, download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
To facilitate the user experience, the Windows executables are ready to be used with no need to install anything (not even Python!).&lt;br /&gt;
&lt;br /&gt;
==Unit test for Unix/Linux operating systems==&lt;br /&gt;
&lt;br /&gt;
Although SanXoT has been mainly tested and used in Windows, SanXoT has been developed in Python. As a multi-platform language, these unit tests can be performed also in Unix/Linux operating systems. To do so, you just need to:&lt;br /&gt;
&lt;br /&gt;
# Install Python 2.7.x,&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/SciPy SciPy],&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/Graphviz Graphviz] (used for interpreting the file *.gv in [https://en.wikipedia.org/wiki/DOT_(graph_description_language) DOT language] generating the '''[[Sanson]]''' clustering graphs in Test 3),&lt;br /&gt;
# Replace the *.bat files within the unit tests by the files contained here: '''[[File:SanXoT commands for Unix shell.zip]]'''.&lt;br /&gt;
# Use the source code from GitHub ('''[https://github.com/CNIC-Proteomics/SanXoT https://github.com/CNIC-Proteomics/SanXoT]'''), rather than the standalone exes provided below.&lt;br /&gt;
&lt;br /&gt;
Step #2 can be done by using the following pip commands:&lt;br /&gt;
&lt;br /&gt;
 pip install numpy&lt;br /&gt;
 pip install matplotlib&lt;br /&gt;
 pip install scipy&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=643</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=643"/>
				<updated>2018-08-17T15:16:36Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Unit test for Unix/Linux operating systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Windows==&lt;br /&gt;
&lt;br /&gt;
Links to SanXoT windows standalone executables are provided in this unit test. Additionally, download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
To facilitate the user experience, the Windows executables are ready to be used with no need to install anything (not even Python!).&lt;br /&gt;
&lt;br /&gt;
==Unit test for Unix/Linux operating systems==&lt;br /&gt;
&lt;br /&gt;
Although SanXoT has been mainly tested and used in Windows, SanXoT has been developed in Python. As a multi-platform language, these unit tests can be performed also in Unix/Linux operating systems. To do so, you just need to:&lt;br /&gt;
&lt;br /&gt;
# Install Python 2.7.x,&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/SciPy SciPy],&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/Graphviz Graphviz] (used for interpreting the file *.gv in [https://en.wikipedia.org/wiki/DOT_(graph_description_language) DOT language] generating the '''[[Sanson]]''' clustering graphs in Test 3),&lt;br /&gt;
# Replace the *.bat files within the unit tests by the files contained here: '''[[File:SanXoT commands for Unix shell.zip]]'''.&lt;br /&gt;
# Use the source code from GitHub ('''[https://github.com/CNIC-Proteomics/SanXoT https://github.com/CNIC-Proteomics/SanXoT]'''), rather than the standalone exes provided below.&lt;br /&gt;
&lt;br /&gt;
This can be done by using the following pip commands:&lt;br /&gt;
&lt;br /&gt;
 pip install numpy&lt;br /&gt;
 pip install matplotlib&lt;br /&gt;
 pip install scipy&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=642</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=642"/>
				<updated>2018-08-17T15:14:25Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Windows==&lt;br /&gt;
&lt;br /&gt;
Links to SanXoT windows standalone executables are provided in this unit test. Additionally, download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
To facilitate the user experience, the Windows executables are ready to be used with no need to install anything (not even Python!).&lt;br /&gt;
&lt;br /&gt;
==Unit test for Unix/Linux operating systems==&lt;br /&gt;
&lt;br /&gt;
Although SanXoT has been mainly tested and used in Windows, SanXoT has been developed in Python. As a multi-platform language, these unit tests can be performed also in Unix/Linux operating systems. To do so, you just need to:&lt;br /&gt;
&lt;br /&gt;
# Install Python 2.7.x,&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/SciPy SciPy],&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/Graphviz Graphviz] (used for interpreting the file *.gv in [https://en.wikipedia.org/wiki/DOT_(graph_description_language) DOT language] generating the '''[[Sanson]]''' clustering graphs in Test 3),&lt;br /&gt;
# Replace the *.bat files within the unit tests by the files contained here: '''[[File:SanXoT commands for Unix shell.zip]]'''.&lt;br /&gt;
# Use the source code from GitHub ('''[https://github.com/CNIC-Proteomics/SanXoT https://github.com/CNIC-Proteomics/SanXoT]'''), rather than the standalone exes provided below.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=641</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=641"/>
				<updated>2018-08-17T10:36:05Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the starting file (170415_Marga_GBS_iTRAQ_PSMs.txt) using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.tsv).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (the file ''SB_Homo_19dic-2017.txt'').&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats1_rels.tsv&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats2_rels.tsv&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats3_rels.tsv&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats4_rels.tsv&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats5_rels.tsv&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -routsqc_tagged.tsv -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.tsv -Cgf -v0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.tsv -uca_outStats.tsv -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -g&lt;br /&gt;
sanxotgauss.exe -agauss_noLegend -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gk&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.tsv -cca_outStats.tsv&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.tsv. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.tsv&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.tsv&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.tsv&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.tsv&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.tsv&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.tsv; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.tsv, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -routsqc_tagged.tsv -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle (see '''[[How to use SanXoT]]''' for more information):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.tsv and ca_outStats.tsv, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.tsv -uca_outStats.tsv -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.tsv; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.tsv (option -r); the list_outList.tsv (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.tsv (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (if you are using Windows, you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
Since in this particular case, the legend covers two changing categories, we have added another version of the graph where we ask SanXoTGauss not to show the legend (using the -k option):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss_noLegend -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gk&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=41&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.tsv -cca_outStats.tsv&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.tsv (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.tsv (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 35&lt;br /&gt;
 Total number of relations pointing to changing categories: 93&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 1&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 18&lt;br /&gt;
 Degree of coordination: 0.828829&lt;br /&gt;
&lt;br /&gt;
Where 0.828829 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (93 - 1) / (93 + 18) = 0.828829&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=File:Gauss_outGraph.png&amp;diff=640</id>
		<title>File:Gauss outGraph.png</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=File:Gauss_outGraph.png&amp;diff=640"/>
				<updated>2018-08-17T10:29:39Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: Mtrevisan uploaded a new version of &amp;amp;quot;File:Gauss outGraph.png&amp;amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Output of SanXoTGauss in Test 3, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=639</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=639"/>
				<updated>2018-08-17T10:29:12Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the starting file (170415_Marga_GBS_iTRAQ_PSMs.txt) using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.tsv).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (the file ''SB_Homo_19dic-2017.txt'').&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats1_rels.tsv&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats2_rels.tsv&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats3_rels.tsv&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats4_rels.tsv&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats5_rels.tsv&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -routsqc_tagged.tsv -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.tsv -Cgf -v0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.tsv -uca_outStats.tsv -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -g&lt;br /&gt;
sanxotgauss.exe -agauss_noLegend -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gk&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.tsv -cca_outStats.tsv&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.tsv. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.tsv&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.tsv&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.tsv&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.tsv&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.tsv&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.tsv; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.tsv, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -routsqc_tagged.tsv -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle (see '''[[How to use SanXoT]]''' for more information):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.tsv and ca_outStats.tsv, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.tsv -uca_outStats.tsv -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.tsv; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.tsv (option -r); the list_outList.tsv (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.tsv (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (if you are using Windows, you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=638</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=638"/>
				<updated>2018-08-17T10:12:51Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the starting file (170415_Marga_GBS_iTRAQ_PSMs.txt) using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.tsv).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats1_rels.tsv&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats2_rels.tsv&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats3_rels.tsv&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats4_rels.tsv&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 -x&amp;quot;dbcats5_rels.tsv&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -rdbcats_rels.tsv -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -routsqc_tagged.tsv -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.tsv -Cgf -v0&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.tsv -uca_outStats.tsv -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -g&lt;br /&gt;
sanxotgauss.exe -agauss_noLegend -p%workingFolder% -rdbcats_rels.tsv -clist_outList.tsv -zqaMerged_outStats.tsv -gk&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.tsv -cca_outStats.tsv&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=637</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=637"/>
				<updated>2018-08-17T10:11:32Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Last step for experiment integration */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the starting file (170415_Marga_GBS_iTRAQ_PSMs.txt) using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.tsv).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=636</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=636"/>
				<updated>2018-08-17T10:10:56Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Modifications needed */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the starting file (170415_Marga_GBS_iTRAQ_PSMs.txt) using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=635</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=635"/>
				<updated>2018-08-17T10:10:25Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once (such as hundreds of technical or biological replicates).&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.tsv&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.tsv&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.tsv file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=634</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=634"/>
				<updated>2018-08-17T10:05:09Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.tsv -rsprels_table.tsv -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.tsv -rspouts115vs119_tagged.tsv -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.tsv -rpqouts115vs119_tagged.tsv -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.tsv -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once.&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.xls&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.xls&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.xls file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=633</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=633"/>
				<updated>2018-08-17T10:03:30Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Getting the quantification of proteins */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once.&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.xls&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.xls&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.xls file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=632</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=632"/>
				<updated>2018-08-17T10:03:02Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file quantified using [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer 2.1], but any other tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files, or results from [https://en.wikipedia.org/wiki/Mascot_(software) Mascot]).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first '''[[Aljamia]]''' operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.tsv'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.tsv'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. The generated table will include two columns only: the first (-i) generating a non-ambiguous identifier of the higher level for the peptides (in this case, concatenating the columns &amp;quot;Annotated Sequence&amp;quot; and &amp;quot;Modifications&amp;quot;), and the second (-j) merging the data from &amp;quot;Spectrum File&amp;quot;, &amp;quot;First Scan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level). Note that the column headers are case-sensitive.&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=16&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Annotated Sequence&amp;quot; + &amp;quot;Modifications&amp;quot;) are in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here UniProt's accession numbers as protein identifier. Proteome discoverer includes several proteins where the peptide is found, but we use in this case only the first protein in each case. To do so, we use Python-style commands to split the string in the field &amp;quot;Master Protein Accessions&amp;quot; (using &amp;quot;;&amp;quot;), and then taking just the first element (hence the &amp;quot;[0]&amp;quot;). To tell '''[[Aljamia]]''' that this is an operation, and not just text to include, we add the argument -A&amp;quot;i&amp;quot; (to indicate that only column &amp;quot;i&amp;quot; contains operations). We have also included a filter to obtain only the identifications that, according to Proteome Discoverer, have confidence either &amp;quot;Medium&amp;quot; or &amp;quot;High&amp;quot; (excluding this way bad identifications).&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''170415_Marga_GBS_iTRAQ_PSMs.txt'', to generate two data files, one for the 114 vs 118 iTRAQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=19&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a unique text identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). As we did before to split the column with protein accession number, we use -A&amp;quot;jk&amp;quot; to indicate that there are operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum (see '''[[How to use SanXoT#Calibrating the data|How to use SanXoT]]''' for more information on why this is important). This task is performed by the module '''[[Klibrate]]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.tsv and sprels_table.tsv, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration (using the GIA, or ''Generic Integration Algorithm''), which is done by '''[[SanXoT]]''' with the help of '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid overwrite the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, the variance can be set directly with the option -v (lowercase), for example as -v0.02, if want to use a specific, known variance.&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parameter (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=33&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.tsv), and 2) the relations file (pqrels_table.tsv) contains the protein identifiers in the first column (in our case we use the accession number or any other unique text string can do the job; in other cases we can use the whole FASTA header, which makes proteins more human-readable), and the peptide identifiers (the amino acid sequence with the post-translational modifications).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare proteins to the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.tsv); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.tsv, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.tsv, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once.&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.xls&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.xls&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.xls file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=631</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=631"/>
				<updated>2018-08-17T09:20:35Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Windows==&lt;br /&gt;
&lt;br /&gt;
Links to SanXoT windows standalone executables are provided in this unit test. Additionally, download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Unix/Linux operating systems==&lt;br /&gt;
&lt;br /&gt;
Although SanXoT has been mainly tested and used in Windows, SanXoT has been developed in Python. As a multi-platform language, these unit tests can be performed also in Unix/Linux operating systems. To do so, you just need to:&lt;br /&gt;
&lt;br /&gt;
# Install Python 2.7.x,&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/SciPy SciPy],&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/Graphviz Graphviz] (used for interpreting the file *.gv in [https://en.wikipedia.org/wiki/DOT_(graph_description_language) DOT language] generating the '''[[Sanson]]''' clustering graphs in Test 3),&lt;br /&gt;
# Replace the *.bat files within the unit tests by the files contained here: '''[[File:SanXoT commands for Unix shell.zip]]'''.&lt;br /&gt;
# Use the source code from GitHub ('''[https://github.com/CNIC-Proteomics/SanXoT https://github.com/CNIC-Proteomics/SanXoT]'''), rather than the standalone exes provided below.&lt;br /&gt;
&lt;br /&gt;
To facilitate the user experience, the Windows executables are ready to be used with no need to install anything (not even Python!).&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=630</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=630"/>
				<updated>2018-08-16T14:33:49Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
==Source code==&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Unit test for Unix/Linux operating systems==&lt;br /&gt;
&lt;br /&gt;
Although SanXoT has been mainly tested and used in Windows, SanXoT has been developed in Python. As a multi-platform language, these unit tests can be performed also in Unix/Linux operating systems. To do so, you just need to:&lt;br /&gt;
&lt;br /&gt;
# Install Python 2.7.x,&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/SciPy SciPy],&lt;br /&gt;
# Install [https://en.wikipedia.org/wiki/Graphviz Graphviz] (used for interpreting the file *.gv in [https://en.wikipedia.org/wiki/DOT_(graph_description_language) DOT language] generating the '''[[Sanson]]''' clustering graphs in Test 3),&lt;br /&gt;
# Replace the *.bat files within the unit tests by the files contained here: '''[[File:SanXoT commands for Unix shell.zip]]'''.&lt;br /&gt;
# Use the source code from GitHub ('''[https://github.com/CNIC-Proteomics/SanXoT https://github.com/CNIC-Proteomics/SanXoT]'''), rather than the standalone exes provided below.&lt;br /&gt;
&lt;br /&gt;
To facilitate the user experience, the Windows executables are ready to be used with no need to install anything (not even Python!).&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=File:SanXoT_commands_for_Unix_shell.zip&amp;diff=629</id>
		<title>File:SanXoT commands for Unix shell.zip</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=File:SanXoT_commands_for_Unix_shell.zip&amp;diff=629"/>
				<updated>2018-08-16T14:20:36Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: Commands for the unit test used in Unix shell (instead of the DOS batch files).&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Commands for the unit test used in Unix shell (instead of the DOS batch files).&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=628</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=628"/>
				<updated>2018-08-14T15:23:31Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
 set startingFile=170415_Marga_GBS_iTRAQ_PSMs.txt&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -x%startingFile% -i&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -j&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -x%startingFile% -i&amp;quot;'[Master Protein Accessions]'.split(';')[0]&amp;quot; -j&amp;quot;[Annotated Sequence]_[Modifications]&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;i&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([114]/[118], 2)&amp;quot; -k&amp;quot;max([114], [118])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -x%startingFile% -i&amp;quot;[Spectrum File]-[First Scan]-[Charge]&amp;quot; -j&amp;quot;math.log([115]/[119], 2)&amp;quot; -k&amp;quot;max([115], [119])&amp;quot; -f&amp;quot;[Confidence]==High || [Confidence]==Medium&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.tsv, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.tsv, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.tsv, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.tsv -rsprels_table.tsv -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.tsv -rspouts114vs118_tagged.tsv -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqrels_table.tsv -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.tsv -rpqouts114vs118_tagged.tsv -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.tsv -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file (despite its extension) quantified by '''[[QuiXoT]]''', but any tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first [[Aljamia]] operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[Sequence]&amp;quot; -j&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.xls'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.xls'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. In this file, headers appear first at line #24, which is specified in option -R24. The generated table will include two columns only: the first (-i) just copying the information included in the &amp;quot;Sequence&amp;quot; header (using it as the identifier of the higher level), and the second (-j) merging the data from &amp;quot;RAWFileName&amp;quot;, &amp;quot;FirstScan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level).&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[FASTAProteinDescription]&amp;quot; -j&amp;quot;[Sequence]&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Sequence&amp;quot;) is in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here the FASTA header as protein identifier (rather than just the accession number), as it is much more human readable, hence facilitating the biological interpretation later.&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''startingFile.xls'', to generate two data files, one for the 114 vs 118 iTARQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_114]/[q_reporterIon_118], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_114], [q_reporterIon_118])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_115]/[q_reporterIon_119], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_115], [q_reporterIon_119])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a text unique identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). The option -A&amp;quot;jk&amp;quot; is the way to indicate that there are mathematical operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum. This task is performed by the module [[Klibrate]]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=26&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.xls -rsprels_table.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.xls and sprels_table.xls, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration, which is done by [[SanXoT]] with the help of [[SanXoTSieve]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.xls -rspouts114vs118_tagged.xls -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid modifying the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, a the variance can be set directly with the option -v (lowercase).&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parametre (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=32&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqouts114vs118_tagged.xls -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.xls), and 2) the relations file (pqrels_table.xls) contains the protein identifiers in the first column (in our case we use the FASTA header, but the accession number or any other unique text string can do the job), and the peptide identifiers (the amino acid sequence).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.xls); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.xls, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.xls, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once.&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.xls&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.xls&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.xls file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=627</id>
		<title>Exploring SanXoT features</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Exploring_SanXoT_features&amp;diff=627"/>
				<updated>2018-08-14T15:06:57Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Here we will look at the details of three common workflows that use the SanXoT software package.&lt;br /&gt;
&lt;br /&gt;
For tests check '''[[Unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
For a general overview on how to use SanXoT and how it works, see '''[[How to use SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== Test 1: The fundamental workflow ==&lt;br /&gt;
&lt;br /&gt;
We describe here the '''[[fundamental workflow]]''', a workflow similar to the statistics done by '''[[QuiXoT]]''' using the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;. The following text is available also at a [https://en.wikipedia.org/wiki/Batch_file batch file] in '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]''':&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test1.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 1, FUNDAMENTAL WORKFLOW (SPECTRUM, PEPTIDE, PROTEIN)&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM create the relation files&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[Sequence]&amp;quot; -j&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[FASTAProteinDescription]&amp;quot; -j&amp;quot;[Sequence]&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM create the data file at the scan level&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_114]/[q_reporterIon_118], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_114], [q_reporterIon_118])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_115]/[q_reporterIon_119], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_115], [q_reporterIon_119])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM The following tab-separated text files (despite their extensions) are generated:&lt;br /&gt;
REM sprels_table.xls, with two columns, having scan-to-peptide identifiers (rawfile-scanNumber-charge for scans, sequence for peptides)&lt;br /&gt;
REM pqrels_table.xls, with two columns, having peptide-to-protein identifiers (text of their fasta header as protein identifier)&lt;br /&gt;
REM sdata_table.xls, with three columns having {scan identifier, log2(foldchange), weight}&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.xls -rspouts114vs118_tagged.xls -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqouts114vs118_tagged.xls -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_1:_The_fundamental_workflow|Test 1]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Modifications needed ===&lt;br /&gt;
&lt;br /&gt;
The only lines that need to be modified by the user are these (to match the user's file locations):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=6&amp;gt;&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In the first operations, we will parse the data from the ''startingFile.xls'' using '''[[Aljamia]]''' to parse the relations files. In this case, the input file is a tab-separated text file (despite its extension) quantified by '''[[QuiXoT]]''', but any tab-separated text file which arranges information in columns with headers can be used as input (for example [https://en.wikipedia.org/wiki/SQLite SQLite] query results from Proteome Discoverer's MSF files).&lt;br /&gt;
&lt;br /&gt;
=== Parsing the data ===&lt;br /&gt;
&lt;br /&gt;
Let us check the first [[Aljamia]] operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
aljamia.exe -asprels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[Sequence]&amp;quot; -j&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In a common workflow, the SanXoT software package can generate many output files. To make them easier to track, by default all the files generated in a single step will have the same prefix, while the type of file will have the same common suffix. The prefix is set by the ''-a'' argument (it is a good practice to include ''always'' this argument). For example, in this case two files are generated: ''sprels_table.xls'' (the relations table for matching '''s'''cans and '''p'''eptides) and ''sprels_log.txt'' (a log files). Unless overriden (check '''[[Aljamia]]''''s help for more information) the filename for the relations table generated will always end as '''_table.xls'''.&lt;br /&gt;
&lt;br /&gt;
The option -p is used for the working folder: this means that all output and input files are assumed to be in that folder, unless a full path is specified. In this file, headers appear first at line #24, which is specified in option -R24. The generated table will include two columns only: the first (-i) just copying the information included in the &amp;quot;Sequence&amp;quot; header (using it as the identifier of the higher level), and the second (-j) merging the data from &amp;quot;RAWFileName&amp;quot;, &amp;quot;FirstScan&amp;quot; and &amp;quot;Charge&amp;quot; columns (an example of unique identifier for scans, for the lower level).&lt;br /&gt;
&lt;br /&gt;
The operation to generate the file for the relations between proteins and peptides is similar:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=15&amp;gt;&lt;br /&gt;
aljamia.exe -apqrels -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[FASTAProteinDescription]&amp;quot; -j&amp;quot;[Sequence]&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that the prefix here is ''pqrels'' (some researchers use '''q''' for proteins, to differentiate from '''p'''eptides); now the peptide identifiers (&amp;quot;Sequence&amp;quot;) is in the second column (because now it is the lower level), while proteins are in the first column (as they are now the higher level). We use here the FASTA header as protein identifier (rather than just the accession number), as it is much more human readable, hence facilitating the biological interpretation later.&lt;br /&gt;
&lt;br /&gt;
The next two lines parse the information from the same input file, ''startingFile.xls'', to generate two data files, one for the 114 vs 118 iTARQ comparative, and the other for 115 vs 119 (as you can observe in the -a option for the prefix):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
aljamia.exe -asdata114vs118 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_114]/[q_reporterIon_118], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_114], [q_reporterIon_118])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
aljamia.exe -asdata115vs119 -p%workingFolder% -xstartingFile.xls -R24 -i&amp;quot;[RAWFileName]-[FirstScan]-[Charge]&amp;quot; -j&amp;quot;math.log([q_reporterIon_115]/[q_reporterIon_119], 2)&amp;quot; -k&amp;quot;max([q_reporterIon_115], [q_reporterIon_119])&amp;quot; -A&amp;quot;jk&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Data files in SanXoT include a first column (-i) with a text unique identifier (in this case, the scan identifier we have mentioned), a second column (-j) with the log2(A/B) ratio (which is coded in Python-style commands enclosed), and a third column (-k) for the weight of the scan (in iTRAQ we just select the intensity of the most intense peak in the couple). The option -A&amp;quot;jk&amp;quot; is the way to indicate that there are mathematical operations to be performed in the second and third column (but not the first, as those are treated as text).&lt;br /&gt;
&lt;br /&gt;
=== Calibrating the spectra ===&lt;br /&gt;
&lt;br /&gt;
The first step to integrate the data is calibrating the weights assigned to each spectrum. This task is performed by the module [[Klibrate]]:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=26&amp;gt;&lt;br /&gt;
klibrate.exe -astart114vs118 -p%workingFolder% -dsdata114vs118_table.xls -rsprels_table.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As in all SanXoT programs, there is a prefix assigned to the operation (-a) and the default working folder is specified (-p). Additionally, -d and -r set the file names of the data and relations files generated by Aljamia (sdata114vs118_table.xls and sprels_table.xls, respectively). The -g option here tells the program not to stop to show the sigmoid graph with the comparison to the standard normal distribution.&lt;br /&gt;
&lt;br /&gt;
=== Integration from spectrum to peptide ===&lt;br /&gt;
&lt;br /&gt;
We are ready for the first integration, which is done by [[SanXoT]] with the help of [[SanXoTSieve]].&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=27&amp;gt;&lt;br /&gt;
sanxot.exe -asp114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts114vs118 -p%workingFolder% -dstart114vs118_calibrated.xls -rsprels_table.xls -Vsp114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp114vs118_noOuts -p%workingFolder% -dstart114vs118_calibrated.xls -rspouts114vs118_tagged.xls -Vsp114vs118_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(We have already described what is the meaning of the -a, -p, -d, -r and -g options.)&lt;br /&gt;
&lt;br /&gt;
In the first of these three commands, SanXoT integrates the lower level data (in this case, spectra) into a higher level (here peptides). When doing so, SanXoT calculates the variance of the lower respect the higher level, i.e., how the fold-changes of the different spectra is distributed when assigned to different peptides. The result is given both in console and in the [prefix]_infoFile.txt (sp114vs118_infoFile.txt).&lt;br /&gt;
&lt;br /&gt;
Once the variance is known, SanXoTSieve tags the outlier spectra for each peptide (if any), by adding the tag &amp;quot;out&amp;quot; to the relations file (in a third column), saving a new relations file (in order to avoid modifying the original relations file). To give the variance to SanXoTSieve (essential to define which spectra are outliers), SanXoTSieve uses as input the sp114vs118_infoFile.txt (option -V), reading it from the line &amp;quot;Variance = ...&amp;quot;; alternatively, a the variance can be set directly with the option -v (lowercase).&lt;br /&gt;
&lt;br /&gt;
For SanXoTSieve, -f is used to define the FDR under which a spectrum will be considered outlier. This means that spectra with FDR &amp;lt; 0.01 have a probability &amp;lt; 1% to have a quantification statistically significant respect the other spectra where the same peptide has been identified (i.e., to be an outlier).&lt;br /&gt;
&lt;br /&gt;
When the relations file with tagged outliers is given by SanXoTSieve, SanXoT finally integrates the data, but in this case, instead of calculating the variance, the variance calculated when the outliers were included is used (options -V to provide the file, or -v to provide it as a number). We say in this case that ''the variance is forced'', and is indicated using the -f parametre (note that -f has a different meaning here, compared to SanXoTSieve).&lt;br /&gt;
&lt;br /&gt;
The option --tags=&amp;quot;!out&amp;quot; means that relations in the relations file being tagged as &amp;quot;out&amp;quot; (by SanXoTSieve) will be excluded (they will have the same effect as if they had been cut). More tags can be combined (see the specific help for more information).&lt;br /&gt;
&lt;br /&gt;
=== Integration from peptide to protein ===&lt;br /&gt;
&lt;br /&gt;
The integration of peptides to proteins is very similar as the previous one:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=32&amp;gt;&lt;br /&gt;
sanxot.exe -apq114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts114vs118 -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq114vs118_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq114vs118_noOuts -p%workingFolder% -dsp114vs118_noOuts_higherLevel.xls -rpqouts114vs118_tagged.xls -fg -Vpq114vs118_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The only differences are: 1) instead of taking as input the output file of Klibrate (which contained the information of the spectra, calibrated) it takes the output of the integration (sp114vs118_noOuts_higherLevel.xls), and 2) the relations file (pqrels_table.xls) contains the protein identifiers in the first column (in our case we use the FASTA header, but the accession number or any other unique text string can do the job), and the peptide identifiers (the amino acid sequence).&lt;br /&gt;
&lt;br /&gt;
The result will be a data file containing protein identifiers, fold-change at protein level and weights of proteins, with no more reference to the peptides or spectra.&lt;br /&gt;
&lt;br /&gt;
=== Getting the quantification of proteins ===&lt;br /&gt;
[[File:Qa114vs118 outGraph.png|thumb|300px|Output file of SanXoT (''qa114vs118_outGraph.png'') in Test 1, line 38, showing the sigmoid with protein expression changes (blue) compared to the theoretical sigmoid of the standard normal distribution.]]&lt;br /&gt;
&lt;br /&gt;
In the last step we compare the set of proteins as a whole. As you can see, the output data file from the previous integration is used as input (pq114vs118_noOuts_higherLevel.xls); here there is no relations file, as in this case we are not integrating proteins into different groups (we will see this in test2 for merging experiments, and test3 for systems biology).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanxot.exe -aqa114vs118 -p%workingFolder% -dpq114vs118_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Instead of providing a relations file, we include option -C (for ''confluence''), which has the same effect as a relations file with a dummy first column filled entirely with a &amp;quot;1&amp;quot; for every row, and a second column for each protein identifier.&lt;br /&gt;
&lt;br /&gt;
The output will generate a higher level file (qa114vs118_higherLevel.xls, where '''q''' means &amp;quot;protein&amp;quot; and '''a''' means &amp;quot;all&amp;quot;) with only one row where the fold-change is the weighted arithmetic mean of al proteins (i.e. the grand mean of the whole set). You can see the protein changes in file qa114vs118_outStats.xls, where the FDR of changing proteins is presented, alongside their fold-change (Xinf), normalised fold-change (Z) and grand mean (Xsup).&lt;br /&gt;
&lt;br /&gt;
== Test 2: Experiment merging ==&lt;br /&gt;
&lt;br /&gt;
Here we make a very simple experiment merging, in which we will merge two experiments with comparable information (i.e. if the first experiment was &amp;quot;treated vs control&amp;quot;, the second experiment can be a biological replicate, such as another &amp;quot;treated vs control&amp;quot; for different treated patient and a different control).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test2.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 2, EXPERIMENT MERGING&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM experiment merging: 115vs119 + 114vs118&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_2:_Experiment_merging|Test 2]]'''.&lt;br /&gt;
&lt;br /&gt;
=== A parallel anaysis ===&lt;br /&gt;
&lt;br /&gt;
In this example we merge just two experiments for simplicity, but it is possible to merge even hundreds of them at once.&lt;br /&gt;
&lt;br /&gt;
Command lines 6-8 have already been discussed, they must be adapted to the paths of the user in his/her computer. Lines 14-29 are actually doing the same as discussed in the previous section, but for the 115 vs 119 comparative:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=14&amp;gt;&lt;br /&gt;
REM now, same for 115vs119&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
klibrate.exe -astart115vs119 -p%workingFolder% -dsdata115vs119_table.xls -rsprels_table.xls -g&lt;br /&gt;
sanxot.exe -asp115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -aspouts115vs119 -p%workingFolder% -dstart115vs119_calibrated.xls -rsprels_table.xls -Vsp115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -asp115vs119_noOuts -p%workingFolder% -dstart115vs119_calibrated.xls -rspouts115vs119_tagged.xls -Vsp115vs119_infoFile.txt -fg --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM peptide to protein for 115vs119&lt;br /&gt;
sanxot.exe -apq115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -g&lt;br /&gt;
sanxotsieve.exe -apqouts115vs119 -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqrels_table.xls -Vpq115vs119_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -apq115vs119_noOuts -p%workingFolder% -dsp115vs119_noOuts_higherLevel.xls -rpqouts115vs119_tagged.xls -fg -Vpq115vs119_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM see protein changes for 115vs119&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqa115vs119 -p%workingFolder% -dpq115vs119_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Generating the files for the experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The module for merging experiments, '''[[Cardenio]]''', actually does a very simple thing: it just takes protein identifiers, concatenates a suffix for each experiment (specified in an external file called tagFile) and generates both the data file and relations files, ready for SanXoT to integrate in the usual way:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=36&amp;gt;&lt;br /&gt;
cardenio.exe -amerge -p%workingFolder% -ttagFile.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We have already discussed the -a and -p options across the SanXoT software package. Cardenio needs also -t, a tab-separated file with this structure:&lt;br /&gt;
&lt;br /&gt;
 tag		file&lt;br /&gt;
 114vs118	qa114vs118_lowerNormV.xls&lt;br /&gt;
 115vs119	qa115vs119_lowerNormV.xls&lt;br /&gt;
&lt;br /&gt;
The column &amp;quot;tag&amp;quot; includes the suffix that will be appended to each protein (or other) identifier, while the column &amp;quot;file&amp;quot; contains the filenames of the data files; these files are similar, but distinct, from _higherLevel files: they are their centred versions (for example, the one with prefix qa114vs118 has been generated in the last step of the 114vs118 integration, and the suffix _lowerNormV tells us it is a data file where the fold-changes have been centred subtracting the grand mean, X'inf, and the weights have been kept unchanged; files with suffix _lowerNorm'''W''' have identical first and second column, but the third column, for weights has been modified to include the general variance of the protein level, as described in the WSPP statistical model).&lt;br /&gt;
&lt;br /&gt;
These identifiers are rearranged in relations and data files as, for example, here:&lt;br /&gt;
&lt;br /&gt;
In the original _lowerNormV the first file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.420637682685	20.8442070633&lt;br /&gt;
 protId2	0.0309318252993	10.405882717&lt;br /&gt;
 protId3	0.235807597358	13.6742537388&lt;br /&gt;
 protId4	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
And the second file looks like:&lt;br /&gt;
&lt;br /&gt;
 idinf	X'inf		Vinf&lt;br /&gt;
 protId1	0.12726383076	9.37044627116&lt;br /&gt;
 protId2	0.274864264955	4.64481829573&lt;br /&gt;
 protId3	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4	0.40863124663	4.57552314502&lt;br /&gt;
 protId5	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
In the output of Cardenio, data is rearranged in merge_newData.txt:&lt;br /&gt;
&lt;br /&gt;
 idinf			X'inf		Vinf&lt;br /&gt;
 protId1'''_114vs118'''	0.420637682685	20.8442070633&lt;br /&gt;
 protId2'''_114vs118'''	0.0309318252993	10.405882717&lt;br /&gt;
 protId3'''_114vs118'''	0.235807597358	13.6742537388&lt;br /&gt;
 protId4'''_114vs118'''	-0.939886653794	10.3887308051&lt;br /&gt;
 protId5'''_114vs118'''	0.136998763785	111.910628299&lt;br /&gt;
 ...&lt;br /&gt;
 protId1'''_115vs119'''	0.12726383076	9.37044627116&lt;br /&gt;
 protId2'''_115vs119'''	0.274864264955	4.64481829573&lt;br /&gt;
 protId3'''_115vs119'''	-0.261668207812	5.73886230845&lt;br /&gt;
 protId4'''_115vs119'''	0.40863124663	4.57552314502&lt;br /&gt;
 protId5'''_115vs119'''	0.36753759182	46.0017027453&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
Along with the corresponding relations file merge_newRels.txt:&lt;br /&gt;
&lt;br /&gt;
 idsup	idinf&lt;br /&gt;
 protId1	protId1'''_114vs118'''&lt;br /&gt;
 protId2	protId2'''_114vs118'''&lt;br /&gt;
 protId3	protId3'''_114vs118'''&lt;br /&gt;
 protId4	protId4'''_114vs118'''&lt;br /&gt;
 protId5	protId5'''_114vs118'''&lt;br /&gt;
 ...&lt;br /&gt;
 protId1	protId1'''_115vs119'''&lt;br /&gt;
 protId2	protId2'''_115vs119'''&lt;br /&gt;
 protId3	protId3'''_115vs119'''&lt;br /&gt;
 protId4	protId4'''_115vs119'''&lt;br /&gt;
 protId5	protId5'''_115vs119'''&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
As can be seen, the relations file is such that for the lower level (second column) proteins are linked to the experiment (via suffix, associated to the corresponding data in the merge_newData.txt), while for the higher level (first column), proteins are independent from the experiment. This means that, after integrating, we will get only protein-level information, experiment-independent (the information of each protein in each experiment averaged with its corresponding weight).&lt;br /&gt;
&lt;br /&gt;
=== Last step for experiment integration ===&lt;br /&gt;
&lt;br /&gt;
The last lines perform the integration of the two experiments taking the data and relations files from Cardenio (merge_newData.txt and merge_newRels.txt), as seen in the previous section.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxot.exe -amerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -g&lt;br /&gt;
sanxotsieve.exe -aoutsMerged -p%workingFolder% -dmerge_newData.txt -rmerge_newRels.txt -Vmerged_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -amerged_noOuts -p%workingFolder% -dmerge_newData.txt -routsMerged_tagged.xls -fg -Vmerged_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqaMerged -p%workingFolder% -dmerged_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The approach here is the same as for the integration from spectrum to peptide or peptide to protein (but this case we integrate from &amp;quot;experiment-dependent protein data&amp;quot; to &amp;quot;experiment-independent protein data&amp;quot;):&lt;br /&gt;
:* the first line calculates the variance (storing it in the merged_infoFile.txt file)&lt;br /&gt;
:* then SanXoTSieve tags the outliers (generating the outsMerged_tagged.xls file with the same relations as merge_newRels.txt, but incliding the tag &amp;quot;out&amp;quot; in the third column),&lt;br /&gt;
:* the third line integrates the proteins using the relations file and the variance from the two previous commands;&lt;br /&gt;
:* finally, all the experiment-independent proteins are compared to each other, getting the variance (stored in merged_noOuts_infoFile.txt) and providing an FDR that can be used to filter changing proteins (in file merged_noOuts_outStats.xls).&lt;br /&gt;
&lt;br /&gt;
The last step will also be used in the Systems Biology Triangle method (SBT)&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; which we will see now (the &amp;quot;protein-to-all&amp;quot; operation is the &amp;quot;bottom side&amp;quot; of the triangle).&lt;br /&gt;
&lt;br /&gt;
== Test 3: Systems biology ==&lt;br /&gt;
&lt;br /&gt;
In this final test we make a systems biology analysis of the previous merged experiment using the Systems Biology Triangle method. For that we start with the files generated by the Test 2, plus a tab-separated table directly downloaded from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]'''.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div class=&amp;quot;toccolours mw-collapsible mw-collapsed&amp;quot;&amp;gt;&lt;br /&gt;
contents of ''commands_test3.bat''&lt;br /&gt;
&amp;lt;div class=&amp;quot;mw-collapsible-content&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line&amp;gt;&lt;br /&gt;
REM PART 3, SYSTEMS BIOLOGY&lt;br /&gt;
REM&lt;br /&gt;
REM&lt;br /&gt;
REM commands&lt;br /&gt;
&lt;br /&gt;
set unit=D:&lt;br /&gt;
set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
%unit%&lt;br /&gt;
cd %programFolder%&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM conversion of ontological data from DAVID&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&lt;br /&gt;
REM&lt;br /&gt;
REM prot to cat&lt;br /&gt;
REM&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&lt;br /&gt;
REM cat to all&lt;br /&gt;
&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For files and details about how to run this test, check '''[[Unit_tests_for_SanXoT#Test_3:_Systems_biology|Test 3]]'''.&lt;br /&gt;
&lt;br /&gt;
=== Getting the protein-to-category relations file ===&lt;br /&gt;
&lt;br /&gt;
We start generating the relations file we are going to use in the systems biology analysis. It is a simple relations file, where the first colum contains a string with the identifier of the category, and the second column a string with the identifier of the protein (with the same philosophy as the tab-separated table used to indicate the PSMs in the spectrum-to-peptide table, and the peptide-to-protein assignations).&lt;br /&gt;
&lt;br /&gt;
Note that, although we are using DAVID to obtain the ontological categories, we can provide these relations using any tool, or even create custom relations, as far as we use the format mentioned. We can also concatenate relations (such as our custom relations) to previous relations files.&lt;br /&gt;
&lt;br /&gt;
As we said, a good way to get the information is using databases such as Gene Ontology or KEGG is using DAVID as an intermediary; in this case, we use the module '''[[Camacho]]''', dedicated to this operation:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=17&amp;gt;&lt;br /&gt;
camacho.exe -adbcats1 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -q1 -c5 -fBioCarta_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here we use, as in all SanXoT modules, option -a and -p as usual (to define a prefix for the output files in this operation and setting the working directory, respectively). Then we give the filename of the tab-separated file downloaded from DAVID (using -d), we indicate the column number (one-based) containing the protein accession number (option -q, in this case -q1 indicates it is the first column), and the column containing the category we want to parse to the relations file (using -c, in this case we want to parse the data from BioCarta, which are in column 5), and we define a prefix for the category (as these are categories from BioCarta, we use -f to set this to &amp;quot;BioCarta_&amp;quot;, where the underscore is used as separator to make it more human-readable). The prefix is not mandatory, but highly recommended when we import data from different databases, as some category names may have the same name but different protein sets; using a prefix not only avoids overlapping, but also helps track back from which database are the categories in use.&lt;br /&gt;
&lt;br /&gt;
The file fastaHeadersNoSequences.fasta is not mandatory. In this case, as we want to use as protein identifier the full FASTA header (for using more human-readable results), we include it. Obviously, this must contain the same identifiers as the ones used for DAVID. Typically, the FASTA file used to identify the proteins should be used, but in this case we have removed the sequences to avoid giving the users a too large file in this test.&lt;br /&gt;
&lt;br /&gt;
As a result, we get the file dbcats1_rels.xls. Now we concatenate the relations related to the databases for [http://www.geneontology.org Gene Ontology] (for BP = biological process, CC = cellular component, and MF = molecular function), [http://www.genome.jp/kegg KEGG] and [https://pir.georgetown.edu PIR]. For each of them we use Camacho, changing only the -c option (to set the column), -f (for the different category prefix) and including the file to which they concatenate (using the option -x). Other users might just prefer to overwrite the original file, but, in order to keep track of the different operations, we just use a different file for each output.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=18&amp;gt;&lt;br /&gt;
camacho.exe -adbcats2 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats1_rels.xls&amp;quot; -c6 -fGOBP_&lt;br /&gt;
camacho.exe -adbcats3 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats2_rels.xls&amp;quot; -c7 -fGOCC_&lt;br /&gt;
camacho.exe -adbcats4 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats3_rels.xls&amp;quot; -c8 -fGOMF_&lt;br /&gt;
camacho.exe -adbcats5 -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats4_rels.xls&amp;quot; -c9 -fKEGG_&lt;br /&gt;
camacho.exe -adbcats -p%workingFolder% -d&amp;quot;SB_Homo_19dic-2017.txt&amp;quot; -q1 --fasta=&amp;quot;fastaHeadersNoSequences.fasta&amp;quot; -x&amp;quot;dbcats5_rels.xls&amp;quot; -c10 -fPIR_&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We end up with the final file dbcats_rels.xls; this relations file can now be used for as many Systems Biology Triangle analyses as needed, without needing to use Camacho again (unless we want to add databases or an updated ontology from DAVID).&lt;br /&gt;
&lt;br /&gt;
=== From proteins to categories ===&lt;br /&gt;
&lt;br /&gt;
We start using the relations file just generated, dbcats_rels.xls, to integrate proteins into categories. Every protein has a different weight, depending on the quality of the scans, the number of peptides detected and quantified for the proteins, and perhaps even the different experiments where they have been found. As a result, the information from each protein has a different weight in the category where they have been classified (in the same way as happened in the previous levels). Getting the category-level information (rather than a lengthy list of quantified proteins) will help us getting the big picture of what is going on in the proteome. Additionally, using the Systems Biology Triangle, we integrate the information of all the proteins quantified, not just a list of proteins whose quantification exceeds a defined threshold (as in Over-Representation Algorithms).&lt;br /&gt;
&lt;br /&gt;
We integrate in the usual manner, first getting the variance of the proteins, then tagging outliers (proteins that behave differently than other proteins classified in the same group) and finally performing the integration with the formerly calculated variance and excluding the outlier proteins:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=28&amp;gt;&lt;br /&gt;
sanxot.exe -aqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -g&lt;br /&gt;
sanxotsieve.exe -aoutsqc -p%workingFolder% -dmerged_noOuts_higherLevel.xls -rdbcats_rels.xls -Vqc_infoFile.txt -f0.01&lt;br /&gt;
sanxot.exe -aqc_noOuts -p%workingFolder% -dmerged_noOuts_higherLevel.xls -routsqc_tagged.xls -fg -Vqc_infoFile.txt --tags=&amp;quot;!out&amp;quot;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This operation is the &amp;quot;top-left&amp;quot; side of the Systems Biology Triangle.&lt;br /&gt;
&lt;br /&gt;
The last integration, where categories are compared to each other, is the &amp;quot;top-right&amp;quot; side of the triangle:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=34&amp;gt;&lt;br /&gt;
sanxot.exe -aca -p%workingFolder% -dqc_noOuts_higherLevel.xls -Cg&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As we did in line 38 of Test 1, and line 41 of Test 2, this is a confluence, meaning that we don't need a relations file, because the whole set of categories is going to be considered the same group.&lt;br /&gt;
&lt;br /&gt;
=== Filtering the most relevant categories using SanXoTSqueezer ===&lt;br /&gt;
&lt;br /&gt;
Now we have the information needed to get some details about the analysis. Sometimes we have too many categories, so the analysis becomes difficult. We could make &amp;quot;supercategories&amp;quot; using a relations file between category and supercategory, but in this case we will just filter out the categories with '''[[SanXoTSqueezer]]''', getting those with a minimum number of 3 proteins (using the option -n) quantified and a maximum FDR of 5% (using the option -f). We also need the two outStats files for the protein-to-category (&amp;quot;qc&amp;quot;, using option -l), and the category-to-all (&amp;quot;ca&amp;quot;, using the optioni -u) operations, using files from previous steps (qc_noOuts_outStats.xls and ca_outStats.xls, respectively):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=37&amp;gt;&lt;br /&gt;
sanxotsqueezer.exe -alist -p%workingFolder% -lqc_noOuts_outStats.xls -uca_outStats.xls -f0.05 -n3&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is a list of categories, with some information associated, in file list_outList.xls; we will analyse the categories in this list.&lt;br /&gt;
&lt;br /&gt;
=== Clustering together similar categories using Sanson ===&lt;br /&gt;
&lt;br /&gt;
The point of integrating proteins into categories is managing less, but more meaningful information. However, databases can have so many categories that sometimes we can have many categories, even too many to interpret the results biologically.&lt;br /&gt;
&lt;br /&gt;
In some cases we have many duplicated categories that contain the same proteins but different name, especially when we combine different databases. This is easy to detect, but most commonly categories have ''similar'' but not ''identical'' sets of proteins, leading to a more subtle problem. The module '''[[Sanson]]''' helps detecting similar categories, generating clusters of similar sets. It can be set a threshold of &amp;quot;similarity&amp;quot; (where the similarity is given by an FNumber, depending on how many proteins are shared by two given categories), but the algorithm uses a matrix of protein-to-category items (the similarity matrix), and maximises the number of categories in each cluster (the CNumber) and the number of clusters, partitioning them with a [https://en.wikipedia.org/wiki/Durfee_square Durfee square]-based calculation:&lt;br /&gt;
&lt;br /&gt;
 With this similarity-threshold, are there at least 3 clusters with at least 3 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 4 clusters with at least 4 categories? --&amp;gt; true&lt;br /&gt;
 With this similarity-threshold, are there at least 5 clusters with at least 5 categories? --&amp;gt; false&lt;br /&gt;
 With this similarity-threshold, are there at least 6 clusters with at least 6 categories? --&amp;gt; false&lt;br /&gt;
&lt;br /&gt;
In this example, it could happen that there are (for example) 5 clusters with 4 categories, or 3 clusters with 6 categories, but not 5 clusters with 5 categories, so the best CNumber is this case would be 4 (it can be seen this same logic is used to calculate the h-index metric).&lt;br /&gt;
&lt;br /&gt;
In this test, Sanson is executed in this commandline:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=38&amp;gt;&lt;br /&gt;
sanson.exe -asanson -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -gpdf&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Where we already know the relations file dbcats_rels.xls (option -r); the list_outList.xls (option -c) generated in the previous step is important to provide the categories that will be considered; and the outStats file qaMerged_outStats.xls (option -z) is the file providing the protein-level, category-independent quantifications (generated in the last command of Test 2), which is important to provide the protein information for each category.&lt;br /&gt;
&lt;br /&gt;
The output of Sanson is a .gv file (here sanson_simGraph.gv), where category clusters are given in [[http://en.wikipedia.org/wiki/DOT_language DOT language]]; they can be interpreted by open-source programs like [http://graphviz.org/ Graphviz] to generate an image file. If Graphviz is installed, Sanson can launch itself the interpretation (you only need to specify in the dot.ini file the location where the dot.exe file used by Graphviz is). In this test, we have used as output the pdf file sanson_simGraph.pdf (using the -g option), but it can be also interesting generating an [https://en.wikipedia.org/wiki/Scalable_Vector_Graphics svg] file, which (when opened by suitable browsers) can give back the list of protein identifiers present in each category just by hovering the mouse over the category box.&lt;br /&gt;
&lt;br /&gt;
As you will see, each category box has a number of customisable colour columns. Each column represent a protein present in the category, and its colour depends on the standarised logfold-change (Z, the standardised version of X=logfold, which means that X has been centred and corrected according to the variance) of its protein. By default, white colour means no change, green means there is more quantity from sample B than from sample A (as logarithm(A/B) &amp;lt; 0), and, conversely, red means there is more quantity from sampe A that from sample B. The arrows have numbers that indicate the total numbers of proteins in a category that are present in the category the arrow is pointing to. Absence of arrow does not mean they do not share proteins, it just means that the ratio of proteins shared does not reach the calculated threshold specified above.&lt;br /&gt;
&lt;br /&gt;
=== Generating gaussians using SanXoTGauss ===&lt;br /&gt;
[[File:Gauss outGraph.png|thumb|300px|left|Output of SanXoTGauss in Test 3, line 37, showing the theoretical sigmoid for the standard normal distribution (red), the experimental sigmoid for all the proteins in the experiment (blue), and the sigmoid of proteins grouped by ontology (selecting with SanXoTSqueezer only categories that are changing expression with FDR &amp;gt;= 5% and having at least 3 proteins).]]&lt;br /&gt;
Representing sigmoids (here, cumulative gaussians) of the different categories, and comparing them to a theoretical normal distribution (thick, red line) and the bulk of proteins (blue line), can be helpful to have a better insight on the meaning of a change. A coordinated proteome is a proteome whose proteins belonging to the same categories change in a similar way, together. We can observe this behaviour by representing the proteins classified in the same category in a separated sigmoid. Since we usually represent several categories at once, we draw each with a different colour to better identify them. Categories with a coordinated change will be &amp;quot;parallel&amp;quot; to the normal distribution, but displaced (according to the category-level fold change), while categories behaving uncoordinately will tend to cross the normal distribution (as they have both proteins with increasing and decreasing expression changes).&lt;br /&gt;
&lt;br /&gt;
This operation is done by '''[[SanXoTGauss]]''', using a command like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=39&amp;gt;&lt;br /&gt;
sanxotgauss.exe -agauss -p%workingFolder% -rdbcats_rels.xls -clist_outList.xls -zqaMerged_outStats.xls -g&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The options are the same as in the previous Sanson command, excluding -g (which here is used to ask the program not to show the graph, just save it, which is important to keep the workflow going).&lt;br /&gt;
&lt;br /&gt;
=== Calculating the degree of coordination using the Coordinometer ===&lt;br /&gt;
&lt;br /&gt;
We have defined a parameter called '''''degree of coordination''''', which tells us objectively how much a proteome is coordinated. This parameter takes into account the number of outlier proteins (proteins that are expressed differently, compared to proteins classified within the same category), and how much are categories changing. The '''[[Coordinometer]]''' calculates this number using this command:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;dos&amp;quot; line start=40&amp;gt;&lt;br /&gt;
coordinometer.exe -acoordination -p%workingFolder% -qqc_noOuts_outStats.xls -cca_outStats.xls&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Coordinometer extracts the number of relations tagged as outliers (using the &amp;quot;out&amp;quot; tag in the third column) in file qc_noOuts_outStats.xls (using option -q, for proteins expressed differently compared to other proteins in the same categories, a calculation from the &amp;quot;top-left&amp;quot; side of the triangle) and gets the number of categories changing using file ca_outStats.xls (using option -c, and reading directly the FDR column). The latter is set by default to 5%, but a different threshold can be set using the --cafdr option.&lt;br /&gt;
&lt;br /&gt;
The result is given both printed in the console, and in the corresponding log file, coordination_logFile.txt, along with the information used to calculate it:&lt;br /&gt;
&lt;br /&gt;
 Total number of changing categories: 66&lt;br /&gt;
 Total number of relations pointing to changing categories: 106&lt;br /&gt;
 Total number of outlier relations pointing to changing categories: 2&lt;br /&gt;
 Total number of outlier relations pointing to non-changing categories: 46&lt;br /&gt;
 Degree of coordination: 0.684211&lt;br /&gt;
&lt;br /&gt;
Where 0.684211 is the result of:&lt;br /&gt;
 (numRelsChangingCats - numOutliersChangingCats) / (numRelsChangingCats + numOutliersNonChangingCats) =&lt;br /&gt;
 (106 - 2) / (106 + 46) = 0.684211&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=626</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=626"/>
				<updated>2018-08-14T15:02:28Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5342688/ Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome]'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=625</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=625"/>
				<updated>2018-08-14T15:01:30Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''[http://www.mcponline.org/content/15/5/1740.long A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics]'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=624</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=624"/>
				<updated>2018-08-14T15:00:56Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''[https://pubs.acs.org/doi/abs/10.1021/pr4006958 General statistical framework for quantitative proteomics by stable isotope labeling]'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;., using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=623</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=623"/>
				<updated>2018-08-14T14:56:23Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''General statistical framework for quantitative proteomics by stable isotope labeling'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;, using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics'''. ''Molecular &amp;amp; Cellular Proteomics'', 15(5):1740-60.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=622</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=622"/>
				<updated>2018-08-14T14:54:23Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Test 3: Systems biology */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''General statistical framework for quantitative proteomics by stable isotope labeling'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;, using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The '''[[SanXoT Sowftware Package]]''' can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the resulting data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=621</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=621"/>
				<updated>2018-08-14T14:50:28Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Test 2: Experiment merging */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''General statistical framework for quantitative proteomics by stable isotope labeling'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;, using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the resulting data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The SanXoT Sowftware Package can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the text file '''''fastaHeadersNoSequences.fasta''''' (not essential, but important to provide fasta headers as protein descriptors instead of accession numbers, making the results more human readable for the biological interpretation; the whole FASTA file used for the identification can be used, but here we have removed the amino acid sequences for file size reasons)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=620</id>
		<title>Unit tests for SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_tests_for_SanXoT&amp;diff=620"/>
				<updated>2018-08-14T14:47:59Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Test 1: The fundamental workflow */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We present here three unit tests for windows that can be performed using the following programs in the '''[[SanXoT software package]]''':&lt;br /&gt;
&lt;br /&gt;
* '''[[Aljamia]]''' (data parser)&lt;br /&gt;
* '''[[Klibrate]]''' (weight calibrator)&lt;br /&gt;
* '''[[SanXoT]]''' (integration and variance calculation)&lt;br /&gt;
* '''[[SanXoTSieve]]''' (removal of outliers)&lt;br /&gt;
* '''[[Cardenio]]''' (merger of experiments)&lt;br /&gt;
* '''[[SanXoTSqueezer]]''' (category filter)&lt;br /&gt;
* '''[[SanXoTGauss]]''' (category cumulative gaussian graphs)&lt;br /&gt;
* '''[[Sanson]]''' (category clustering)&lt;br /&gt;
* '''[[Coordinometer]]''' (calculator of the degree of coordination)&lt;br /&gt;
&lt;br /&gt;
For detailed explanations, check: '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Note that links to SanXoT windows standalone executables are provided in this unit test, but download links to the source code and up-to-date versions of SanXoT are available at: '''[[SanXoT software package]]'''.&lt;br /&gt;
&lt;br /&gt;
==Test 1: The fundamental workflow==&lt;br /&gt;
&lt;br /&gt;
The '''[[fundamental workflow]]''' consists of the steps to quantify proteins using PSMs and quantitative information, based on the WSPP (''Weighted Spectrum, Peptide and Protein'') statistical model, as described&amp;lt;ref&amp;gt;Navarro, P., Trevisan-Herraz, M., Bonzon-Kulichenko, E., Núñez, E., Martínez-Acedo, P., Pérez-Hernández, D., Jorge, I., Mesa, R., Calvo, E., Carrascal, M., Hernáez, M.L., García, F., Bárcena, J.A., Ashman, K., Abian, J., Gil, C., Redondo, J.M. and Vázquez, J. (2014) '''General statistical framework for quantitative proteomics by stable isotope labeling'''. ''Journal of proteome research'', 13, 1234-1247.&amp;lt;/ref&amp;gt;, using the following the steps:&lt;br /&gt;
&lt;br /&gt;
* calibration of spectra&lt;br /&gt;
* integrating from spectra to peptides&lt;br /&gt;
* integrating from peptides to proteins&lt;br /&gt;
* quantifying proteins&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Aljamia]]''', '''[[Klibrate]]''', '''[[SanXoT]]''', and '''[[SanXoTSieve]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) download the windows executables, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]'''. Unzip the whole content in a folder specific for the program.&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test1.zip SanXoT_test1.zip]'''. Unzip to a working folder. You should have two folders, one with the input for this test, and another with the expected results. In the former, you will find two text files:&lt;br /&gt;
&lt;br /&gt;
::* '''''commands_test1.bat''''', a windows batch file with command lines to run this sample analysis.&lt;br /&gt;
::* '''''170415_Marga_GBS_iTRAQ_PSMs.txt''''', a tab-separated-values text file with identifications and quantitative data from a proteomics experiment&amp;lt;ref&amp;gt;Mateos-Hernández, L., et al. (2016) '''Quantitative proteomics reveals Piccolo as a candidate serological correlate of recovery from Guillain-Barré syndrome'''. ''Oncotarget'', 7(46): 74582–74591.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
3) get a) the path of your working folder (where you have unzipped the file '''''170415_Marga_GBS_iTRAQ_PSMs.txt'''''), b) the path of the folder where you have unzipped the windows executables, and c) the unit of the latter (C:, D:, etc); modify the following lines in the command_test1.bat file accordingly:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test1.bat''''' file, copy and paste (or drag and drop) the whole commands_test1.bat text into a [https://en.wikipedia.org/wiki/Cmd.exe command prompt window] (you can also just double click the bat file; however, if an error arises, the CMD window created will close immediately, so the text of the error will not be available).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. It should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) compare your results with the data in folder SanXoT_test1_results, at SanXoT_test1.zip (for example using a comparison software such as [https://en.wikipedia.org/wiki/Beyond_Compare Beyond Compare]). Only the files ending in *_log.txt and *_infoFile.txt should have differences (and only due to the different path for files, and the timestamp, all other features being identical, such as variances, Levenberg-Marquardt steps, and options).&lt;br /&gt;
&lt;br /&gt;
==Test 2: Experiment merging==&lt;br /&gt;
&lt;br /&gt;
Protein quantifications from different technical or biological replicates can be merged into experiment-independent quantifications, giving the biological and technical variance. In this test we&lt;br /&gt;
&lt;br /&gt;
* use the data from Test 1&lt;br /&gt;
* perform the fundamental workflow for another parallel experiment&lt;br /&gt;
* we merge both experiments&lt;br /&gt;
&lt;br /&gt;
This test makes use of four programs: '''[[Klibrate]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', and '''[[Cardenio]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this unit test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test2.zip SanXoT_test2.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1&lt;br /&gt;
::* the '''''tagFile.txt''''' (for '''[[Cardenio]]''')&lt;br /&gt;
::* the '''''commands_test2.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test2.bat''''' file, as in step 3 for Test 1:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test2&amp;quot;&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test2.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. As in Test 1, this test should take 30-60 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1, the only files that are not identical should be the merge_logFile.txt from Cardenio, and the *_infoFile.txt from SanXoT, SanXoTSieve and Klibrate.&lt;br /&gt;
&lt;br /&gt;
==Test 3: Systems biology==&lt;br /&gt;
&lt;br /&gt;
The SanXoT Sowftware Package can be used to perform systems biology analises using the SBT (''Systems Biology Triangle'') as described.&amp;lt;ref&amp;gt;García-Marqués, F., Trevisan-Herraz, M., Martínez-Martínez, S., Camafeita, E., Jorge, I., Lopez, J.A., Méndez-Barbero, N., Méndez-Ferrer, S., del Pozo, M.A., Ibáñez, B., Andrés, V., Sánchez-Madrid, F., Redondo, J.M., Bonzon-Kulichenko, E. and Vázquez, J. (2016) '''A novel systems-biology algorithm for the analysis of coordinated protein responses using quantitative proteomics'''. ''Molecular &amp;amp; Cellular Proteomics''.&amp;lt;/ref&amp;gt; In this test we:&lt;br /&gt;
&lt;br /&gt;
* generate a relations file using data from the '''[https://en.wikipedia.org/wiki/DAVID DAVID bioinformatics resource]''' (you can download yourself the data from them, but for your convenience we already included it)&lt;br /&gt;
* use the data from Test 2&lt;br /&gt;
* calculate a category-level fold-change&lt;br /&gt;
* make the analysis of the systems biology.&lt;br /&gt;
&lt;br /&gt;
This test makes use of seven programs: '''[[Camacho]]''', '''[[SanXoT]]''', '''[[SanXoTSieve]]''', '''[[SanXoTSqueezer]]''', '''[[Sanson]]''', '''[[SanXoTGauss]]''', and '''[[Coordinometer]]'''.&lt;br /&gt;
&lt;br /&gt;
To run this test, follow these steps:&lt;br /&gt;
&lt;br /&gt;
1) use the windows executables unzipped in step 1 of Test 1 ('''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT.zip SanXoT.zip]''').&lt;br /&gt;
&lt;br /&gt;
2) download the files for the unit test, '''[ftp://ftp.cnic.es/ftpsvc/pub/SanXoT_test3.zip SanXoT_test3.zip]'''. This contains:&lt;br /&gt;
&lt;br /&gt;
::* all the files generated in Test 1 and Test 2&lt;br /&gt;
::* the tab-separated text file '''''SB_Homo_19dic-2017.txt''''' (a set of data downloaded from DAVID)&lt;br /&gt;
::* the text file '''''fastaHeadersNoSequences.fasta''''' (not essential, but important to provide fasta headers as protein descriptors instead of accession numbers, making the results more human readable for the biological interpretation; the whole FASTA file used for the identification can be used, but here we have removed the amino acid sequences for file size reasons)&lt;br /&gt;
::* the '''''commands_test3.bat'''''.&lt;br /&gt;
&lt;br /&gt;
3) remember to change the following lines in the '''''command_test3.bat''''' file, as in step 3 for Test 1 and Test 2:&lt;br /&gt;
&lt;br /&gt;
 set unit=D:&lt;br /&gt;
 set programFolder=&amp;quot;D:\SSP&amp;quot;&lt;br /&gt;
 set workingFolder=&amp;quot;D:\SanXoT_test3&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Additionally, if you want to get the similarity graph the category clustering algorithm ('''[[Sanson]]'''), you will need:&lt;br /&gt;
&lt;br /&gt;
* to have installed '''[https://en.wikipedia.org/wiki/Graphviz Graphviz]''' (an open source external software, which includes the interpreter for the [https://en.wikipedia.org/wiki/DOT_(graph_description_language)] DOT language, used by Sanson to generate the graphs for the clusters)&lt;br /&gt;
* from the program folder, modify the ''dot.ini'' file changing the following line to include the path to the ''bin''-folder of Graphviz, which could be different for each user:&lt;br /&gt;
&lt;br /&gt;
 dotlocation = C:\Program Files (x86)\Graphviz2.36\bin&lt;br /&gt;
&lt;br /&gt;
4) execute the '''''commands_test3.bat''''' file (do the same as in step 4 in Test 1).&lt;br /&gt;
&lt;br /&gt;
5) wait until it finishes. This test should take 10-20 seconds (for a regular PC with 64-bit Windows 10, 3.4 GHz, 32 GB RAM).&lt;br /&gt;
&lt;br /&gt;
6) as in Test 1 and Test 2, the only files that are not identical should be the *_logFile.txt from Camacho, SanXoTSqueezer, Sanson, SanXoTGauss and the Coordinometer, as well as the *_infoFile.txt from SanXoT and SanXoTSieve. Additionally, if you use different versions of Graphviz (this test used Graphviz 2.36) you might find differences in the file sanson_simGraph.pdf (which is a pdf containing the final graph with the category clusters).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:SanXoT software package]]&lt;br /&gt;
[[Category:unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=Unit_test&amp;diff=619</id>
		<title>Unit test</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=Unit_test&amp;diff=619"/>
				<updated>2018-08-14T13:05:47Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: Redirected page to Unit tests&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#redirect [[unit tests]]&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=618</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=618"/>
				<updated>2018-08-13T09:14:00Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Building the workflows */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), which allow to estimate the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and also to obtain the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Although outlier removal is useful to increase confidence of quantitative analysis, the proportion of outliers should be low (typically lower than 1%&amp;lt;ref name=navarro2014/&amp;gt;). A higher proportion may be indicative of technical problems in one of the quantification levels.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data. For instance, when spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation from their proteins (protein digestion, peptide desalting,...), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See '''[[How to use SanXoT#advanced applications|advanced applications]]''' and '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=617</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=617"/>
				<updated>2018-08-13T09:12:16Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Interpreting the results of GIA integrations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), which allow to estimate the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and also to obtain the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Although outlier removal is useful to increase confidence of quantitative analysis, the proportion of outliers should be low (typically lower than 1%&amp;lt;ref name=navarro2014/&amp;gt;). A higher proportion may be indicative of technical problems in one of the quantification levels.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data. For instance, when spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation from their proteins (protein digestion, peptide desalting,...), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=616</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=616"/>
				<updated>2018-08-13T09:03:54Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Performing integrations and removing outliers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), which allow to estimate the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and also to obtain the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Although outlier removal is useful to increase confidence of quantitative analysis, the proportion of outliers should be low (typically lower than 1%&amp;lt;ref name=navarro2014/&amp;gt;). A higher proportion may be indicative of technical problems in one of the quantification levels.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=615</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=615"/>
				<updated>2018-08-13T08:53:22Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Performing integrations and removing outliers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), which allows to estimate the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=614</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=614"/>
				<updated>2018-08-13T08:52:21Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Performing integrations and removing outliers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=613</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=613"/>
				<updated>2018-08-13T08:51:09Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Calibrating the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but is also used as statistical weight to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contains true prior weights and does not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=612</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=612"/>
				<updated>2018-08-13T08:50:25Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Calibrating the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since this parameter not only reflects the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=611</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=611"/>
				<updated>2018-08-13T08:49:38Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Calibrating the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=610</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=610"/>
				<updated>2018-08-13T08:49:09Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Calibrating the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only in the first step of each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=609</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=609"/>
				<updated>2018-08-13T08:47:53Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Preparing the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the maximum intensity of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=608</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=608"/>
				<updated>2018-08-13T08:45:28Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column position (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2 of the ratio of intensities of two iTRAQ reporter ions, third column are the prior weights (inverse of the local variance of the quantitative value)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=607</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=607"/>
				<updated>2018-08-13T08:33:44Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''', with the addition of PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the data are processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively the GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=606</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=606"/>
				<updated>2018-08-10T10:18:27Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Advanced applications */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''' including PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|Integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=605</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=605"/>
				<updated>2018-08-10T10:18:12Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Advanced applications */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''' including PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of '''[[Cardenio]]''' (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
Systems biology information can be downloaded, for example, from [https://david.ncifcrf.gov/ DAVID]. The SanXoT Software Package includes the '''[[Camacho]]''' module to convert such DAVID tab-separated files into SanXoT relations files that can be used by the SBT model. Alternatively, users can generate relations files for systems biology (either custom or downloaded) just by creating tab-separated tables with two columns: the first one for categories, and the second one for proteins.&lt;br /&gt;
&lt;br /&gt;
Protein categories can be managed using '''[[SanXoTSqueezer]]''' and '''[[Sanson]]''' and '''[[SanXoTGauss]]'''. For more detailed information see '''[[exploring SanXoT features]]'''.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=604</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=604"/>
				<updated>2018-08-10T10:08:56Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Performing integrations and removing outliers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''' including PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the '''[[SanXoT]]''' module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the '''[[SanXoTSieve]]''' module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is done with '''[[SanXoT]]''', which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=603</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=603"/>
				<updated>2018-08-10T10:07:23Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''' including PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (rather than absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the '''[[How to use SanXoT#Data structure|data structure]]''' of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|how to prepare the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input tab-separated text tables&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
Note that SanXoT checks the column number (not the column name). The first row containing the header is removed, so the user can use column names freely (as far as one header row is included).&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates tab-separated text tables, one of the SanXoT modules ('''[[Aljamia]]'''), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=602</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=602"/>
				<updated>2018-08-10T09:58:26Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Figure 1.png|thumb|400px|Examples of quantitative workflows constructed with modules from the SanXoT package: a '''[[fundamental workflow]]''' including PTM analysis and Systems Biology (above) and an integration of biological (or technical) replicates (below).]]&lt;br /&gt;
The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=File:Figure_1.png&amp;diff=601</id>
		<title>File:Figure 1.png</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=File:Figure_1.png&amp;diff=601"/>
				<updated>2018-08-10T09:54:34Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=600</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=600"/>
				<updated>2018-08-10T09:49:57Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Preparing the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# '''an uncalibrated data file''' containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# '''the relations tables''' of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT.&lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=599</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=599"/>
				<updated>2018-08-10T09:48:49Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Building the workflows */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# an uncalibrated data file containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# the relations tables of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT. &lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using '''[[Aljamia]]''' or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed using '''[[Klibrate]]'''.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially '''[[SanXoT]]''' (to calculate the general variance), '''[[SanXoTSieve]]''' (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=598</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=598"/>
				<updated>2018-08-10T09:47:17Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: /* Calibrating the data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# an uncalibrated data file containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# the relations tables of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT. &lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the '''[[Klibrate]]''' module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using Aljamia or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed usink kallibrate.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially SanXoT (to calculate the general variance), SanXoTSieve (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=597</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=597"/>
				<updated>2018-08-10T09:46:23Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section '''Calibrating the data''']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# an uncalibrated data file containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# the relations tables of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT. &lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the kallibrate module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using Aljamia or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed usink kallibrate.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially SanXoT (to calculate the general variance), SanXoTSieve (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	<entry>
		<id>http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=596</id>
		<title>How to use SanXoT</title>
		<link rel="alternate" type="text/html" href="http://wikis.cnic.es/proteomica/index.php?title=How_to_use_SanXoT&amp;diff=596"/>
				<updated>2018-08-10T09:45:41Z</updated>
		
		<summary type="html">&lt;p&gt;Mtrevisan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The '''[[SanXoT Software Package]]''' (SSP) allows the statistical analysis of quantitative proteomics data (obtained using other programs such as [http://www.thermofisher.com/order/catalog/product/OPTON-30795 Proteome Discoverer]), using the WSPP model&amp;lt;ref name=navarro2014&amp;gt;Navarro, P., et al., ''General statistical framework for quantitative proteomics by stable isotope labeling''. J Proteome Res, 2014. 13(3): p. 1234-47.&amp;lt;/ref&amp;gt;. This model analyses relative (and not absolute) quantitative values, e.g. the ratios between two quantitative measurements.&lt;br /&gt;
&lt;br /&gt;
The SSP contains a set of modules that can be used to assemble user-configurable quantitative workflows with very high flexibility.&lt;br /&gt;
&lt;br /&gt;
In the '''[[fundamental workflow]]''' the quantitative results at the scan level are integrated to quantify each one of the peptides. The quantitative data at the protein level are then integrated to obtain quantitative protein values. Finally the quantitative data at the protein level are integrated providing the statistical significance that the abundance of each one of the proteins are significantly increased or decreased. An example of this basic workflow is provided in the '''[[unit tests for SanXoT]]'''.&lt;br /&gt;
&lt;br /&gt;
Among other possibilities, SanXoT allows building up workflows to automatically process very large number of quantitative experiments, to integrate results obtained from technical or biological replicates or to perform Systems Biology analysis using the SBT model&amp;lt;ref name=garciamarques2016&amp;gt;Garcia-Marques, F., et al., ''A Novel Systems-Biology Algorithm for the Analysis of Coordinated Protein Responses Using Quantitative Proteomics''. Mol Cell Proteomics, 2016. 15(5): p. 1740-60.&amp;lt;/ref&amp;gt;. Examples of these applications are provided in the unit tests for SanXoT.&lt;br /&gt;
&lt;br /&gt;
== General design ==&lt;br /&gt;
All SanXoT workflows contain two initial steps:&lt;br /&gt;
&lt;br /&gt;
# a first step where the raw quantitative information is extracted from the output of the quantification program and organized according to the data structure of SanXoT. You can see more about '''[[How to use SanXoT#preparing the data|preparing the data]]''' for SanXoT analysis.&lt;br /&gt;
# a second step where the raw quantitative information is processed to obtain prior weights. This process is called calibration. For more details, check [[How to use SanXoT#Calibrating the data|the section ''Calibrating the data'']].&lt;br /&gt;
&lt;br /&gt;
SanXoT then performs WSPP-statistics by applying iteratively the Generic Integration Algorithm (GIA)&amp;lt;ref name=garciamarques2016/&amp;gt; on the calibrated data. Each GIA integration performs an independent statistics and its output may be used as input for the following integration step.&lt;br /&gt;
&lt;br /&gt;
In the fundamental workflow The GIA is applied to integrate scan-level data to peptide-level data, to integrate peptide-level data to protein-level and finally to integrate protein-level data.&lt;br /&gt;
&lt;br /&gt;
This modular structure allows users to design their own quantitative workflows applying iteratively GIA integrations. You can check some examples of advanced applications developed using this  modular structure.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Fig1 calibration.PNG|thumb|Calibration of spectra using '''[[Klibrate]]'''.]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&lt;br /&gt;
[[File:Fig1 lowerLevel-to-higherLevel.PNG|thumb|210px|Integration from any lower level to any higher level, using the GIA.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&lt;br /&gt;
&amp;lt;tr&amp;gt;&amp;lt;td colspan=2&amp;gt;[[File:Fig1 fundamentalWorkflow.PNG|thumb|550px|The Fundamental Workflow, which uses the GIA from spectra to proteins.]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Data structure ==&lt;br /&gt;
Each GIA integration needs as input two plain-text files&amp;lt;ref name=garciamarques2016/&amp;gt;: &lt;br /&gt;
# '''a data file''' containing the quantitative data of the lower level. This file contains three columns: identifier (a text string that is used to identify the element), quantitative value (2-base logarithm of the ratio of the two quantitative measurements) and prior weight (a parameter that measures the accuracy of the quantitative value before performing the integration).&lt;br /&gt;
# '''a relations table''', which establishes the correspondence of the identifiers in the lower level with those in the upper level. This file contains two columns, this first one contains the identifiers of the upper level and the second one the identifiers of the lower level.&lt;br /&gt;
&lt;br /&gt;
Each GIA integration generates an output data file containing the quantitative data of the upper level. This file can be used as input for the next GIA integration. The relations table for the next integration, however, needs to be prepared in advance.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;table&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;[[File:Peptide dataFile.PNG|thumb|320px|Example of data file for peptides. First column are peptide identifiers (in this case, amino acid sequences using SEQUEST-style characters for PTM modifications), second column are the log2(A/B) of the intensities of two iTRAQ reporter ions, third column are the weights of each quantification (in this case, the max(A,B), ie the maximum weight of the two reporter ions being compared)]]&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;[[File:Pep2prot relationsTable.PNG|thumb|390px|Example of peptide-to-protein relations table. First column is the higher level identifiers (in this case, the FASTA header of each protein), second column are the lower level identifiers (in this case, the same sequences used in the data table).]]&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Preparing the data ==&lt;br /&gt;
&lt;br /&gt;
The data obtained by the software used to obtain the quantitative values must be converted to generate:&lt;br /&gt;
&lt;br /&gt;
# an uncalibrated data file containing three columns, identifier, quantitative values (2-base logarithm of the ratios of the two quantitative measurements) and quality parameter. The quality parameter is a number that is used to rank the quantitative values according to their quality (in terms of variance, higher quality meaning a lower variance). In the case of isobaric labeling the third column contains the intensity of the most intense of the two reporter ions used to calculate the ratio; other kind of quantifications may need to use other parameters to account for the quality of quantitative values&amp;lt;ref name=navarro2014/&amp;gt;.&lt;br /&gt;
# the relations tables of all the GIA integrations planned in the workflow. In the fundamental workflow these tables describe which scans will be integrated into each peptide and which peptides will be integrated into each protein.&lt;br /&gt;
These tables are easy to construct, due to their simple plain-text format. If the software used for quantification generates plain-text files, one of the SanXoT modules (Aljamia), may be used for this task. In the unit tests for SanXoT we include a workflow that uses Aljamia to generate all the files required from the output of Proteome Discoverer 2.1. Quantitative values produced by the software package in other formats may require other simple scripts to generate the files required by SanXoT. &lt;br /&gt;
&lt;br /&gt;
== Calibrating the data ==&lt;br /&gt;
Calibration is the process where the quality parameter (the parameter which is used to rank quantitative values according to their variance) is transformed into the prior weight, a parameter with full statistical meaning defined as the inverse of the local variance of the quantitative value in log2-ratio scale. This process is done using the kallibrate module and needs to be performed only once in each workflow. For more information about calibration, see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
Using prior weights is essential in GIA integrations, since these parameters not only reflect the local variances of the measurement, but are also used as statistical weights to calculate the weighted averages. Besides, GIA integrations follow error propagation theory, so that the output datafile of each integration contain true prior weights and do not need to be recalibrated.&lt;br /&gt;
&lt;br /&gt;
== Performing integrations and removing outliers ==&lt;br /&gt;
Each GIA integration takes the input data file from the lower level and the relations table, calculates the general variance of the integration  (using a robust iterative method) and generates as output a data file containing the integrated quantifications of the upper level. The integration is done using the SanXoT module.&lt;br /&gt;
&lt;br /&gt;
In addition to the quantitative data, each GIA integration may generate several additional files which contain information about the integration. More detailed information may be found in exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
One of the advantages of the WSPP model is that it provides a straightforward method to eliminate outliers&amp;lt;ref name=navarro2014/&amp;gt; in each integration step. This is done by calculating standardized log2-ratios (z-values), calculating the probability that each element of the lower level is a significant outlier of the z distribution (e.g., N(0,1)) and calculating the associated FDR. The most extreme outliers may be removed sequentially and the integration repeated, until all outliers below a user-defined FDR-level are removed.&lt;br /&gt;
&lt;br /&gt;
Outlier removal is performed by the SanXoTSieve module by tagging the outlier elements in the relations table. &lt;br /&gt;
The entire process is done in three steps:&lt;br /&gt;
&lt;br /&gt;
# a first integration is donde with SanXoT, which serves to calculate the variance&lt;br /&gt;
# outliers are iteratively tagged in the relations table at a user-defined FDR-level using SanXoTSieve&lt;br /&gt;
# a second integration is done with SanXoT using the fixed variance calculated in the first step and removing from the integration the outliers tagged in the relations table.&lt;br /&gt;
&lt;br /&gt;
== Interpreting the results of GIA integrations ==&lt;br /&gt;
GIA integrations basically generate the quantitative data and their associated variances at the upper level, and detect and eliminate outliers that can affect the values at the upper level. However, each integration also generates various information that may be very useful to interpret the results. The most important is the general variance of the integration.&lt;br /&gt;
&lt;br /&gt;
The general variance is an estimate of the error introduced when the elements of the lower level are experimentally generated from those of the upper level. This parameter may have different meaning depending on the nature of the data.&lt;br /&gt;
&lt;br /&gt;
When spectra are integrated into peptides, the general variance estimates the error associated to the measurement (independently of peptide and protein manipulation), and hence depends on the LC-MS setup. When peptides are integrated into proteins, the variance estimates the error produced at the time of peptide generation (protein digestion, peptide desalting), independently of LC-MS analysis. When proteins are integrated into the grand mean, the variance mainly estimates the error produced during protein manipulation (extraction,…). Similarly, integrating technical and biological replicates allow to estimate separately the different error sources at the protein level. Integration of proteins into categories gives a useful insight into protein coordination&amp;lt;ref name=garciamarques2016/&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Building the workflows ==&lt;br /&gt;
Typically, workflows contain three main parts:&lt;br /&gt;
&lt;br /&gt;
# Data parsing to generate the uncalibrated data files and all relations tables needed. This may be performed using Aljamia or other scripts provided by the user.&lt;br /&gt;
# Data calibration to generate the (calibrated) data files. This is performed usink kallibrate.&lt;br /&gt;
# A list of concatenated (or branched) GIA integrations. As explained above, each GIA is performed by running sequentially SanXoT (to calculate the general variance), SanXoTSieve (to tag outliers) and SanXoT (to integrate without outliers)&lt;br /&gt;
&lt;br /&gt;
Note that some specific GIA integrations need the use of other modules. These include experiment merging and Systems biology analysis. See advanced applications and exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== Advanced applications ==&lt;br /&gt;
[[File:Fig1 replicateIntegration.PNG|thumb|600px|integration of technicand or biological replicates using the GIA.]]&lt;br /&gt;
SanXoT also allows to use GIA to integrate technical or biological replicates. In this case scans are integrated to peptides and then to proteins separately for each replicate and then the protein-level data are integrated to obtain protein averages (grouped data). A further GIA integration allows calculation of statistical significance of averaged protein changes.&lt;br /&gt;
&lt;br /&gt;
Merging of experiments may be performed by the aid of Cardenio (or using scripts prepared by the user). Cardenio is used to generate suitable relations tables to make protein averages from technical or biological replicates, and to normalize (&amp;quot;center&amp;quot;) the data to take into account the systematic quantitative error of each experiment before integration.&lt;br /&gt;
&lt;br /&gt;
[[File:Fig1 SystemsBiologyTriangle.PNG|thumb|250px|The Systems Biology Triangle.]]&lt;br /&gt;
Systems biology analysis can be performed using the SBT model&amp;lt;ref name=garciamarques2016/&amp;gt; by building up a triangular workflow. Proteins are integrated into categories and categories are integrated to provide statistical significance.&lt;br /&gt;
&lt;br /&gt;
The protein-to-category relations files may be performed using Camacho (or using scripts provided by the user). Protein categories can be managed using SanXoTSqueezer and Sanson and SanXoTGauss. For more detailed information see exploring SanXoT features.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mtrevisan</name></author>	</entry>

	</feed>