XsltStream

Last commit

XSLT transformation for large XML files. xslt is only applied on a given subset of nodes.

Usage

This program is now part of the main jvarkit tool. See jvarkit for compiling.

Usage: java -jar dist/jvarkit.jar xsltstream  [options] Files

Usage: xsltstream [options] Files
  Options:
    -h, --help
      print help and exit
    --helpFormat
      What kind of help. One of [usage,markdown,xml].
    -n, --tag, --name, -tag, -name
      XML node name. name has syntax '{ns}prefix:localName' or 
      'prefix:localName' or 'localName' or '{ns}localName'
      Default: []
    -o, --output
      Output file. Optional . Default: stdout
    -skip, --skip
      Ignore those names
      Default: []
    --version
      print version and exit
  * -t, -template
      XSLT template file.

Keywords

  • xml
  • xslt
  • xsl
  • stylesheet

See also in Biostars

Source code

https://github.com/lindenb/jvarkit/tree/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/XsltStream.java

Contribute

License

The project is licensed under the MIT license.

Citing

Should you cite xsltstream ? https://github.com/mr-c/shouldacite/blob/master/should-I-cite-this-software.md

The current reference is:

http://dx.doi.org/10.6084/m9.figshare.1425030

Lindenbaum, Pierre (2015): JVarkit: java-based utilities for Bioinformatics. figshare. http://dx.doi.org/10.6084/m9.figshare.1425030

Example:

Dumping the Orcid from pubmed:

 java  -jar dist/pubmeddump.jar 'orcid[AUID]' |\
    java -jar dist/xsltstream.jar -t pubmed2orcid.xsl -n "PubmedArticle" 

The XSLT stylesheet:

<?xml version='1.0'  encoding="UTF-8" ?>
<xsl:stylesheet
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
    version='1.0'
    >

<xsl:output method="text" />


<xsl:template match="/">
<xsl:apply-templates select="PubmedArticle"/>
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:apply-templates select="MedlineCitation/Article/AuthorList/Author[Identifier/@Source='ORCID']"/>
</xsl:template>

<xsl:template match="Author">
<xsl:value-of select="LastName"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="ForeName"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Initials"/>
<xsl:text>  </xsl:text>
<xsl:call-template name="orcid"><xsl:with-param name="s" select="Identifier[@Source='ORCID']"/></xsl:call-template>
<xsl:text>  </xsl:text>
<xsl:for-each select="Affiliation"><xsl:text> </xsl:text></xsl:for-each>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../../PMID[1]"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../../DateCreated/Year"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../Journal/ISOAbbreviation"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../ArticleTitle"/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template name="orcid">
<xsl:param name="s"/>
<xsl:choose>
    <xsl:when test="starts-with($s,'http://orcid.org/')">
        <xsl:call-template name="orcid">
            <xsl:with-param name="s" select="substring($s,18)"/>
        </xsl:call-template>
    </xsl:when>
    <xsl:when test="starts-with($s,'https://orcid.org/')">
        <xsl:call-template name="orcid">
            <xsl:with-param name="s" select="substring($s,19)"/>
        </xsl:call-template>
    </xsl:when>
    <xsl:when test="starts-with($s,'https://')">
        <xsl:call-template name="orcid">
            <xsl:with-param name="s" select="substring($s,9)"/>
        </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
        <xsl:value-of select="translate($s,'-','')"/>
    </xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

output:

Kerkis  I   I   0000000344337580        28618452    2017    Cell Prolif.    Murine melanoma cells incomplete reprogramming using non-viral vector.
Zhang   Shuijun S   0000000205993289        28618450    2017    Cell Prolif.    SAV1 represses the development of human colorectal cancer by regulating the Akt-mTOR pathway in a YAP-dependent manner.
Nguyen  Ha Trong    HT  0000000222408942        28618448    2017    Health Econ Out of sight but not out of mind: Home countries' macroeconomic volatilities and immigrants' mental health.
Lee Jeongmi J   0000000299487554        28618213    2017    J Sep Sci   Solid-phase-extraction-assisted dispersive liquid-liquid microextraction based on solidification of floating organic droplet to determine sildenafil and its analogues in dietary supplements.
Kwon    Sung Won    SW  0000000171614737        28618213    2017    J Sep Sci   Solid-phase-extraction-assisted dispersive liquid-liquid microextraction based on solidification of floating organic droplet to determine sildenafil and its analogues in dietary supplements.
Villaverde  Juan J  JJ  000000025911792X        28618212    2017    Pest Manag. Sci.    Quantum chemistry in environmental pesticide risk assessment.
Pollard Thomas D    TD  0000000217852969        28618211    2017    Cytoskeleton (Hoboken)  Tribute to Fumio Oosawa the pioneer in actin biophysics.
Xiao    Bingxiu B   0000000285929251        28618205    2017    J. Clin. Lab. Anal. Reduced expression of circRNA hsa_circ_0003159 in gastric cancer and its clinical significance.
Heal    M Elisabeth ME  0000000150571141        28618202    2017    Congenit Heart Dis  Effects of persistent Fontan fenestration patency on cardiopulmonary exercise testing variables.
Hrubec  Terry C TC  0000000239619201        28618200    2017    Birth Defects Res   Ambient and dosed exposure to quaternary ammonium disinfectants causes neural tube defects in rodents.
Somri   Mostafa M   0000000238141402        28618198    2017    Int J Paediatr Dent Effect of intravenous paracetamol as pre-emptive compared to preventive analgesia in a pediatric dental setting: a prospective randomized study.
Haggblom    Max M   MM  0000000163077863        28618195    2017    Environ Microbiol Rep   Novel Reductive Dehalogenases from the Marine Sponge Associated Bacterium Desulfoluna spongiiphila.
Ehl Stefan  S   0000000162861234        28618194    2017    Insect Sci. Sexual dimorphism in the alpine butterflies Boloria pales and Boloria napaea: Differences in movement and foraging behaviour (Lepidoptera: Nymphalidae).
Gautam  Nischal K   NK  0000000224916705        28618193    2017    Paediatr Anaesth    Introduction of color-flow injection test to confirm intravascular location of peripherally placed intravenous catheters.
Kaymaz  Dicle   D   0000000179512065        28618190    2017    Clin Respir J   RELATION BETWEEN UPPER-LIMB MUSCLE STRENGTH WITH EXERCISE CAPACITY, QUALITY OF LIFE, AND DYSPNEA IN PATIENTS WITH SEVERE CHRONIC OBSTRUCTIVE PULMONARY DISEASE.
Erill   Ivan    I   0000000272807191        28618189    2017    Environ. Microbiol. Comparative genomics of the DNA-damage inducible network in the Patescibacteria.
Ii  Satoshi S   0000000254285385        28618187    2017    Int J Numer Method Biomed Eng   Physically consistent data assimilation method based on feedback control for patient-specific blood flow analysis.
Arzi    Boaz    B   0000000272898994        28618186    2017    Stem Cells Transl Med   Therapeutic Efficacy of Fresh, Allogeneic Mesenchymal Stem Cells for Severe Refractory Feline Chronic Gingivostomatitis.
Clark   Kaitlin C   KC  0000000260959382        28618186    2017    Stem Cells Transl Med   Therapeutic Efficacy of Fresh, Allogeneic Mesenchymal Stem Cells for Severe Refractory Feline Chronic Gingivostomatitis.
Friedrich   Anja    A   0000000297356286        28618185    2017    J Sleep Res Let's talk about sleep: a systematic review of psychological interventions to improve sleep in college students.
Lee Yun Hee YH  0000000150273988        28618180    2017    Clin Respir J   Neutrophil-lymphocyte ratio and a dosimetric factor for predicting symptomatic radiation pneumonitis in non-small-cell lung cancer patients treated with concurrent chemoradiotherapy.
Dhatariya   Ketan   K   0000000336199579        28618177    2017    Int. J. Clin. Pract.    Assessing the quality of primary care referrals to surgery of patients with diabetes in the East of England: A multi-centre cross-sectional cohort study.
Tougeron    Kevin   K   0000000348973787        28618174    2017    Insect Sci. Intraspecific maternal competition induces summer diapause in insect parasitoids.
De Paepe    Kim K   0000000279486765        28618173    2017    Environ. Microbiol. Inter-individual differences determine the outcome of wheat bran colonization by the human gut microbiome.
Daria   Dzema   D   0000000194181022        28618171    2017    J Sep Sci   Highly fluorinated polymers with sulfonate, sulfamide and N,N-diethylamino groups for the capillary electromigration separation of proteins and steroid hormones.
Tinoco  Adelita A   0000000221905824        28618169    2017    Ann Noninvasive Electrocardiol  ECG-derived Cheyne-Stokes respiration and periodic breathing in healthy and hospitalized populations.
Doyle   Zelda   Z   0000000186481383        28618161    2017    Aust J Rural Health Prevention of osteoporotic refractures in regional Australia.
Locker  Jacomine Krijnse    JK  0000000186582977        28618160    2017    Cell. Microbiol.    VACCINIA VIRUS A11 IS REQUIRED FOR MEMBRANE RUPTURE AND VIRAL MEMBRANE ASSEMBLY.
Davis   Adam S  AS  0000000271961197        28618159    2017    Pest Manag. Sci.    Are herbicides a once in a century method of weed control?
Nelson  C E CE  0000000325253496        28618153    2017    Environ. Microbiol. Cascading influence of inorganic nitrogen sources on DOM production, composition, lability and microbial community structure in the open ocean.
(...)

Example:

Sample identifiers from NCBI biosamples

the xslt:

<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method="text"  encoding="UTF-8"/>
<xsl:template match="BioSample">
<xsl:copy>
<xsl:apply-templates select="Ids"/>
</xsl:copy>
</xsl:template>

<xsl:template match="Ids">
<xsl:value-of select="Id[@db='BioSample']/text()"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="Id[@db='SRA']/text()"/>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

execute:

curl -s "ftp://ftp.ncbi.nlm.nih.gov/biosample/biosample_set.xml.gz" |\
  gunzip -c |\
  java -jar dist/xsltstream.jar -n BioSample -t transform.xsl |\
  nl


(....)
7211789 SAMN07945461    SRS2643051
7211790 SAMN07945462    SRS2643052
7211791 SAMN07945463    SRS2643049
7211792 SAMN07945464    SRS2643050
7211793 SAMN07945465    
7211794 SAMN07945466    
7211795 SAMN07945467    
7211796 SAMN07945468    

Example:

rs / ss list from dbsnp

<?xml version='1.0'  encoding="UTF-8" ?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' xmlns:r="https://www.ncbi.nlm.nih.gov/SNP/docsum">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="//r:Rs/r:Ss"/>
</xsl:template>

<xsl:template match="r:Ss">rs<xsl:value-of select="../@rsId"/> ss<xsl:value-of select="@ssId"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

execute:

 wget -O - -q  "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz" |\
  gunzip -c |\
  java -jar dist/xsltstream.jar -t transform.xsl --tag '{https://www.ncbi.nlm.nih.gov/SNP/docsum}Rs'

output:

rs171 ss41715810
rs171 ss43026199
rs171 ss96405203
rs242 ss242
rs242 ss287669350
rs242 ss326012704
rs242 ss326012781
rs242 ss498801024
rs242 ss550913725
rs242 ss552749651
(...)

Example:

rough exploration of non-coding variants with pathogenic consequences in clinvar:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="text"/>
<xsl:template match="/"><xsl:apply-templates/></xsl:template>
<xsl:template match="ReleaseSet"><xsl:apply-templates/></xsl:template>
<xsl:template match="ClinVarSet">
<xsl:if test="count(.//Attribute[@Type='MolecularConsequence'])=1">
<xsl:variable name="csq" select=".//Attribute[@Type='MolecularConsequence']/text()"/>
<xsl:if test="($csq = 'non-coding transcript variant' or $csq  = 'intergenic variant'  or $csq  = '2kb upstream variant'  or $csq  = '5 prime utr variant' ) and .//Description[contains(text(),'athogenic')]">
<xsl:value-of select="ReferenceClinVarAssertion/ClinVarAccession/@Acc"/>
<xsl:text>  </xsl:text>
<xsl:for-each select=".//Citation/ID[@Source = 'Pubmed']">pmid<xsl-value-of select="text()"/>;</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:if>
</xsl:template>
</xsl:stylesheet>

invoke:

$ curl -s  "ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz" | gunzip -c |\
   java -jar dist/xsltstream.jar  -t transform.xsl -n ClinVarSet

RCV000203227    
RCV000256194    
RCV000256207    
RCV000000913    
RCV000000914    
RCV000006518

Example

convert drugbank xml to TSV

<?xml version='1.0'  encoding="UTF-8" ?>
<xsl:stylesheet xmlns:d="http://www.drugbank.ca" xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method="text"/>

<xsl:template match="d:drugbank">
<xsl:apply-templates select="d:drug"/>
</xsl:template>

<xsl:template match="d:drug">
<xsl:value-of select="d:name/text()"/>
<xsl:text>  </xsl:text>
<xsl:for-each select="d:groups/d:group">
    <xsl:if test='position()>1'>-&gt;</xsl:if>
    <xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:text>  </xsl:text>
<xsl:for-each select="d:calculated-properties/d:property[d:kind/text()='InChIKey']/d:value">
    <xsl:if test='position()>1'> </xsl:if>
    <xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:text>  </xsl:text>
<xsl:for-each select="d:external-identifiers/d:external-identifier[d:resource/text()='ChEMBL']/d:identifier">
    <xsl:if test='position()>1'> </xsl:if>
    <xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:text>  </xsl:text>
<xsl:for-each select="d:external-identifiers/d:external-identifier[d:resource/text()='PubChem Compound']/d:identifier">
    <xsl:if test='position()>1'> </xsl:if>
    <xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:text>  </xsl:text>
<xsl:for-each select="d:external-identifiers/d:external-identifier[d:resource/text()='PubChem Substance']/d:identifier">
    <xsl:if test='position()>1'> </xsl:if>
    <xsl:value-of select="./text()"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

and

$ java -jar dist/xsltstream.jar \
    -n '{http://www.drugbank.ca}drug' \
    -t drugbank2tsv.xsl \
    full_database.xml

Example

https://www.biostars.org/p/335867/#335885

How to download database of Human protein sequences with sub cellular locations?

see https://gist.github.com/lindenb/b3c726adecde90e37acd92bc940dfdd5

Example

https://www.biostars.org/p/365479/ "Bioinformatics word cloud to use in classes bioinformatics"

<?xml version='1.0'  encoding="UTF-8" ?>
<xsl:stylesheet  xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0' >
<xsl:output method="text" encoding="UTF-8"/>

<xsl:template match="/">
<xsl:apply-templates select="*"/>
</xsl:template>

<xsl:template match="*">
<xsl:apply-templates select="PubmedArticle"/>
</xsl:template>

<xsl:template match="PubmedArticle">
<xsl:variable name="year" select="MedlineCitation/Article/Journal/JournalIssue/PubDate/Year/text()"/>
<xsl:for-each select="MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName">
<xsl:value-of select="$year"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="./text()"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

see https://gist.github.com/lindenb/5d7773a93d8c2b0edbd4c01bf8834919