研究领域
bioinformatics,biostatistics,dataminingandmachinelearning.Wearecurrentlyfocusingonutilizinganddevelopingpowerfulcomputationalandstatisticalalgorithmsandsoftwaretoolsformassspectrometrybasedproteomicsstudies,e.g.,proteinidentification,post-translationalmodificationidentification,discoveryandlocalization,proteinquantification,proteotypicpeptideprediction,falsediscoveryratecontrol,multiplehypothesistesting,etc.
Ourresearchareaisinterdisciplinaryresearchofstatistics,computationandbiology,withcurrentfocusoncomputationalandstatisticalproteomics.Ourrepresentativeresearchresultsaresummarizedbelow.
1.Algorithmsandsoftwareforproteinandpost-translationalmodificationidentificationandquantification
Searchingmassspectrometrydataagainstproteindatabasestoidentifyproteinsequencesandpost-translationalmodificationsiscentraltoproteomicsresearch.In2004,weproposedanewscoringfunctionnamed"kernelizedSpectralVectorDotProduct(KSDP)",anddevelopedpFind1.0,thefirstproteinidentificationsearchengineinChina(Bioinformatics,2004,20:1948~1954).Sincethen,pFindhasbeendevelopedcontinuouslyforyearsandevolvedintothewell-knownpFindproteinidentificationsystemandpFindresearchgroup(http://pfind.ict.ac.cn).
Thehugenumberofunexpectedpost-translationalmodificationsonproteinsareconsideredtobethe"darkmatter"inproteomicdata.Wehavedevelopedavarietyofmodificationdiscoveryalgorithms.WeproposedtheopenmasslibrarysearchalgorithmpMatchtodiscoverunexpectedmodificationsbycomparingthesimilaritiesbetweenmodifiedandunmodifiedspectra.ThepaperofpMatchwasacceptedandreportedinISMB(2010),oneofthetopconferencesofbioinformatics,andmeanwhilepublishedinBioinformatics(2010).Atpresent,pMatchhasbecomeanalgorithmfrequentlycitedandreferencedinthefieldofmasslibrarysearchandmodificationdiscovery.BasedonpMatch,wehaverecentlydevelopedaglycosylationmodificationidentificationalgorithmpMatchGlyco(BioMedResearchInternational,2018).
WedevelopedDeltAMT,analgorithmformassspectraclusteringusingpeptidemassandretentiontimeinformationtodiscoverhigh-abundancemodificationtypes(Molecular&CellularProteomics,2011).InthecorefucosylatedglycoproteinidentificationresearchcollaboratedwiththeStateKeyLaboratoryofProteomicsofChina,DeltAMTaswellasotherdataanalysismethodswereusedtosuccessfullyidentifythelargestsetofcoreucosylatedsitesatthattime(Molecular&CellularProteomics,2010).
WedevelopedPTMiner,ahigh-accuracyprobabilisticalgorithmformodificationlocalizationandqualitycontrolforopen(masstolerant)databasesearch(Molecular&CellularProteomics,2019).Thealgorithmautomaticallylearnsthepriorprobability,themass-matchingerrordistributionandthematching-peakintensitydistributionfromthemassspectraldatathroughaniterativeprocess,andusesthecontinuouslyupdatedpriorprobabilityandthetwotypesofdistributionstomoreaccuratelyestimatetheposteriorprobabilityofthemodifiedsite.WeusedPTMinertoanalyzethemodificationspresentinthemassivedataofhumanproteomedraft,andlocalizedmorethanonemillionmodificationsat1%FDR,systematicallycharacterizingknownandunknownmodificationsinthehumanproteome.Thepaperwasoncethesecond‘mostread’paperwhenpublishedonline.BasedonthePTMineralgorithm,WedevelopedSAVControl,aqualitycontrolmethodforproteinaminoacidmutations(canbetreatedasaspecialtypeofmodification),whichwaspublishedinJournalofProteomics(2018).
Inproteinquantification,massspectrometryusuallyhaslargerandomnesssuchas:1)somepeptidescanbedetectedwhilesomecannotbe,and2)peptidesofthesameconcentrationsmayhavealargedifferenceinmassspectrometrysignalintensity.Theserandomnessseriouslyreducetheaccuracyofproteinquantification.Inordertosolvetheaboveproblems,weproposedtheconceptofquantitativemass-spectrometryefficiencyofpeptides,anddevelopedanewproteinabsolutequantificationalgorithm,namedLFAQ,basedonthepredictedpeptidequantitativeefficiencies(AnalyticalChemistry,2019a).ThenweproposedtoincorporatethedigestibilityofpeptidesintopeptidedetectabilitypredictionmodelanddevelopedAP3,apeptidedetectabilitypredictionalgorithmbasedontherandom-forestmachinelearningmethod(AnalyticalChemistry,2019b).
2.ProteomicsdataFDRcontrolmethodsandapplications
Whilebigdataaregivingusbigopportunitiestodiscovernewknowledge,therearealsomanybigrisksandpitfallsoffalsediscoveries.Falsediscoveryrate(FDR)analysisinhigh-dimensionalstatisticalinferenceisconsideredasoneofthemostimportantprogressofstatistics.Inmultiplehypothesistesting,theFDRisdefinedastheexpectationoftheproportionoffalselyrejectedhypothesesamongallrejectedhypotheses.Theinitialpaper(BenjaminiandHochberg,J.R.StatSocietyB,1995)proposingtheFDRhasbeencitedmorethan57,000times,showingitsimportanceandinfluence.ThemainresearchersofFDRincludefamousstatisticiansBradleyEfron,JohnStoreyandEmmanuelCandes.
Specially,howtoaccuratelyestimatetheFDRofsubgroupsofhypothesistestsisadifficultproblem,whichwasproposedinitiallybyBradleyEfron(Ann.Appl.Stat.2:197-223,2008).Thisproblemispracticallyimportantinproteomics.Forthefirsttime,wehavemathematicallystudiedtheproblemofFDRestimationforsubgroupsofpeptideidentifications(suchasmodifiedpeptides)inproteomicdataanalysis.ViaBayesiananalysiswetheoreticallyprovedthatthesubgroupFDRandthecombinedFDRarenotequaltoeachotherunderthesamescoringthreshold,andthusproposedtheprincipleofseparatesubgroupfilteringandFDRestimationandderivedaseriesofinsightfultheoreticalresults(StatisticsandItsInterface,2012).
Basedontheabovetheoreticalanalysis,weproposedasimplerbutmoreintuitiverelationshipbetweenthesubgroupFDRandcombinedFDR,andfurtherdevelopedTransferFDR,anaccurateFDRestimationmethodforsmallsubgroupsofpeptideidentifications(Molecular&CellularProteomics,2014).TherationalofTransferFDRisasfollows.Whentheabundanceofthemodificationtobeidentifiedislow,thedirectFDRestimationwouldbeseverelyinaccurateduetoinsufficientdatasamplesize.Basedontheobservationandanalysisofrealdata,weinventedaestimationmethodfortheconditionalprobabilityofanerroneouslyidentifiedpeptidebeingamodifiedpeptide.Basedonthisestimation,aquantitativerelationshipbetweenthesubgroupFDRofmodifiedpeptidesandthecombinedFDRofallpeptidesisobtained.Throughthisrelationship,thesubgroupFDRcanbeindirectlypredictedfromthecombinedFDR,whichcanusuallybeaccuratelyestimated.ThisovercomesthedifficultyofsmallsubgroupFDRestimationduetothelackofsamplesize.
WeappliedtheabovesubgroupFDRanalysisandTransferredFDRmethodstoanumberofspecialidentificationproblems.Forexample,inthestudyofFDRestimationofnovelgenesidentifiedbysix-frametranslationinproteogenomics,itwasfoundthatifthecombinedFDRwereused,thegeneannotationratioisthedominantfactoraffectingtherealFDRofnewgenes(newpeptides)(Bioinformatics,2015).Also,theTransferFDRmethodwassuccessfullyappliedtothequalitycontrolofopenmodificationsearch(Molecular&CellularProteomics,2019)andaminoacidmutationidentification(JournalofProteomics,2018).Inaddition,theTransferFDRmethodwassuccessfullyusedinacollaborativestudyofprimate-specificgeneidentification(GenomeResearch,2019).
3.Statisticalinferenceanddatamining
Intheprocessofanalyzingbiologicaldata,wedevelopedseveralgeneralstatisticalinferenceanddataminingmethods,goingonestepforwardfromappliedresearchtomethodologicalandtheoreticalresearch.
Thetarget-decoycompetition(TDC)strategyisthegoldstandardmethodforFDRcontrolofproteomicdata.Thismethodhasbeenusedformanyyears,butitisstillanempiricalmethodandlackstheoreticalfoundation.Inthismethod,theratioofthenumbersofdecoyandtargetresultsisusuallyusedasanestimateofFDR,butwhetherthiscancontrolFDR(thatis,tomaketherealFDRlessthanaspecifiedthreshold)isstillunknown.Wefoundthata+1correctiontotheaboveestimate(decoynumberplus1)canstrictlycontrolFDR,andgavetheoreticalproofforthisconclusion(arXiv,2015).
Furtherandmoreimportant,weextendedtheabovecorrectedTDCmethodtothegeneralmultiplehypothesistestingproblem(arXiv,2018).ThepreviousFDRcontrolmethodsinmultiplehypothesistestingwereusuallybasedonanulldistributionoftheteststatistic.However,alltypesofnulldistributions,includingtheoretical,permutation-basedandempiricalones,havesomeinherentdrawbacks.Forexample,thetheoreticalnulldistributionwillfailiftheassumptionsonthesampledistributionarewrong.Inaddition,manyFDRcontrolmethodsrequiretheestimationoftheproportionoftruenullhypotheses,whichisdifficultandhasnotbeenverywellresolved.WeproposedageneralTDC-basedFDRcontrolmethodusingrandompermutations.Ourmethoddoesnotneedtoestimatethenulldistributionofthestatisticortheproportionoftruenullhypotheses,butisonlybasedontherankofthetestsbysomestatistic/score.Itconstructscompetitivedecoyhypothesesfromrandomsamplepermutations.WeprovedthatthismethodcanrigorouslycontrolFDR.SimulationexperimentsshowthatourmethodcancontrolFDRmoreeffectivelythantheBayesandEmpiricalBayesmethods,andhasgreaterstatisticalpower.】
近期论文
查看导师新发文章
(温馨提示:请注意重名现象,建议点开原文通过作者单位确认)
QingboShu#,MengjieLi#,LianShu#,ZhiwuAn,JifengWang,HaoLv,MingYang,TanxiCai,TonyHu,YanFu*andFuquanYang*.Large-scaleIdentificationofN-linkedGlycopeptidesinHumanSerumusingHILICEnrichmentandSpectralLibrarySearch.Molecular&CellularProteomics,19:672–689,2020.
ZhiqiangGao#,ChengChang#,JinghanYang,YunpingZhu*,YanFu*.AP3:AnAdvancedProteotypicPeptidePredictorforTargetedProteomicsbyIncorporatingPeptideDigestibility.AnalyticalChemistry,2019,91,8705−8711.
ZhiwuAn#,LinhuiZhai#,WantaoYing,XiaohongQian,FuzhouGong*,MinjiaTan*andYanFu*.PTMiner:LocalizationandQualityControlofProteinModificationsDetectedinanOpenSearchandItsApplicationtoComprehensivePost-translationalModificationCharacterizationinHumanProteome.Molecular&CellularProteomics,2019,18(2)391-405.
ChengChang#,ZhiqiangGao#,WantaoYing#,YanFu*,YanZhao,SongfengWu,MengjieLi,GuibinWang,XiaohongQian*,YunpingZhu*,FuchuHe*.LFAQ:towardsunbiasedlabel-freeabsoluteproteinquantificationbypredictingpeptidequantitativefactors.AnalyticalChemistry,2019,91,1335−1343.
YiShao,Chunyan,ChenHao,Shen,BinZHe,DaqiYu,ShuaiJiang,ShileiZhao,ZhiqiangGao,ZhenglinZhu,XiChen,YanFu,HuaChen,GeGao,ManyuanLong,YongEZhang.GenTree,anintegratedresourceforanalyzingtheevolutionandfunctionofprimate-specificcodinggenes.GenomeResearch,20190412;29(4):682-696.
XinpeiYi#,BoWang#,ZhiwuAn,FuzhouGong*,JingLi*,YanFu*,Qualitycontrolofsingleaminoacidvariationsdetectedbytandemmassspectrometry,JournalofProteomics,187:144–151,2018.
ZhiwuAn#,QingboShu#,HaoLv,LianShu,JifengWang,FuquanYang*,YanFu*,N-LinkedGlycopeptideIdentificationBasedonOpenMassSpectralLibrarySearch,BioMedResearchInternational,doi.org/10.1155/2018/1564136,2018.
YanFu,DataAnalysisStrategiesforProteinModificationIdentification,InKlausJung(Ed.):StatisticalAnalysisinProteomics,HumanaPress,NewYork,NY,pp1362:265-75,2016.
KunZhang#,YanFu*,Wen-FengZeng,KunHe,HaoChi,ChaoLiu,Yan-ChangLi,YuanGao,PingXu*,Si-MinHe*,Anoteonthefalsediscoveryrateofnovelpeptidesinproteogenomic,Bioinformatics,2015.06.14,3249~3253
ShanLu,Sheng-BoFan,BingYang,Yu-XinLi,Jia-MingMeng,LongWu,PinLi,KunZhang,Mei-JunZhang,YanFu,Jin-CaiLuo,Rui-XiangSun,Si-MinHe,Meng-QiuDong,Mappingnativedisulfidebondsataproteomescale,NatureMethods,2015.01.01,12:329~331
YanFu*,XiaohongQian,Transferredsubgroupfalsediscoveryrateforrarepost-translationalmodificationsdetectedbymassspectrometry,Molecular&CellularProteomics,2014.01.01,13(5):1359~1368
YanFu,KernelMethodsandApplicationsinBioinformatics.InKasabov,NikolaK.(Ed.):HandbookofBio-/Neuro-Informatics,Springer-VerlagBerlinandHeidelbergGmbH&Co.K,pp275-285,2013.
YanFu*,Bayesianfalsediscoveryratesforpost-translationalmodificationproteomics,StatisticsandItsInterface,2012.01.01,5(1):47~59
YanFu*,Li-YunXiu,WeiJia,DingYe,Rui-XiangSun,Xiao-HongQian,Si-MinHe,DeltAMT:AStatisticalAlgorithmforFastDetectionofProteinModificationsFromLC-MS/MSData,Molecular&CellularProteomics,2011.5.01,10(5):1~15
YanFu#*,RongPan,QiangYang,WenGao.Query-AdaptiveRankingwithSupportVectorMachinesforProteinHomologyPrediction.InProceedingsofthe7thInternationalSymposiumonBioinformaticsResearchandApplications(ISBRA2011).LectureNotesinBioinformatics,6674:320–331,2011
DingYe#,YanFu*,Rui-XiangSun*,Hai-PengWang,Zuo-FeiYuan,HaoChi,Si-MinHe,OpenMS/MSspectrallibrarysearchtoidentifyunanticipatedpost-translationalmodificationsandincreasespectralidentificationrate.InProceedingsofthe18thAnnualInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB2010).Bioinformatics,26(12):i399-i406,2010
JiaWei#,LuZhuang#,YanFu#,Hai-PengWang,WangLe-Heng,HaoChi,Zuo-FeiYuan,Zhao-BinZheng,Li-NaSong,Huan-HuanHan,Yi-MinLiang,Jing-LanWang,YunCai,Yu-KuiZhang,Yu-LinDeng,Wan-TaoYing*,Si-MinHe*,Xiao-HongQian*,AStrategyforPreciseandLargeScaleIdentificationofCoreFucosylatedGlycoproteins,MOLECULAR&CELLULARPROTEOMICS,2009.5.01,8(5):913~923
YanFu#*,WeiJia,ZhuangLu,HaipengWang,ZuofeiYuan,ZuofeiYuan,HaoChi,YouLi,LiyunXiu,WenpingWang,ChaoLiu,LehengWang,RuixiangSun,WenGao,XiaohongQian,Si-MinHe,Efficientdiscoveryofabundantpost-translationalmodificationsandspectralpairsusingpeptidemassandretentiontimedifferences.InProceedingsofthe7thAsia-PacificBioinformaticsConference(APBC2009),BMCBioinformatics,2009.01.01,10:S50~S50
安志武#,付岩*,基于质谱的蛋白质修饰定位算法,生命的化学2017.2.01,37(1):104~112