Journal of Diabetology

: 2020  |  Volume : 11  |  Issue : 3  |  Page : 198--203

Muscle biopsies differ in relation to expression of fiber-type specific genes

Rakesh Kumar1, Krishna Kumar Ojha1, Pooja Kushwaha2, Vijay Kumar Singh1,  
1 Department of Bioinformatics, Central University of South Bihar, Gaya, Bihar, India
2 Department of Statistics, Central University of South Bihar, Gaya, Bihar, India

Correspondence Address:
Dr. Vijay Kumar Singh
Department of Bioinformatics, School of Earth, Biological and Environmental Sciences, Central University of South Bihar, SH-7, Gaya Panchanpur Road, Village – Karhara, Post. Fatehpur, Gaya, Bihar.


Background: Skeletal muscle transcriptome has been analyzed to report muscle-specific biomarkers, but muscle heterogeneity has largely been overlooked in pursuit of formulating a balanced design. Given the heterogeneity of muscle tissue in terms of both function and fiber-type composition, there could be several unaccounted sources of variation affecting the gene expression profile of skeletal muscles. Categorization of muscle transcriptome according to the source of variation will not only improve the power of transcriptome comparison tests but also will help to identify unaccounted biological sources of variation. Materials and Methods: Gene expression profile of normal skeletal muscle subjects (GSE18732) were analyzed with R-statistical software and Bioconductor packages. Gene-sets were prepared by grouping Affymetrix probes according to biological processes they were annotated. Coherence score and associated P values were calculated for each gene-set. All gene-sets having P < 0.05 were selected as coherent gene-set. Results: We have analyzed gene-sets and used coherence scores to measure the degree of coregulation between genes of a gene-set. We have shown that coherent gene-sets have a better chance to classify samples into biologically relevant subgroups as compared to noncoherent gene-sets. Further, we have applied the developed method to the muscle gene expression profiles and found that muscle fiber-type proportion in collected biopsies is one of the most prominent unaccounted “source of variations” affecting gene expression measurements. Conclusion: The sample classification produced based on the expression profile of genes belonging to coherent gene-sets has a better chance to result in biologically meaningful clusters.

How to cite this article:
Kumar R, Ojha KK, Kushwaha P, Singh VK. Muscle biopsies differ in relation to expression of fiber-type specific genes.J Diabetol 2020;11:198-203

How to cite this URL:
Kumar R, Ojha KK, Kushwaha P, Singh VK. Muscle biopsies differ in relation to expression of fiber-type specific genes. J Diabetol [serial online] 2020 [cited 2020 Oct 24 ];11:198-203
Available from:

Full Text


The transcriptional programs of skeletal muscles are known to be affected by heterogeneous cell population[1],[2],[3] and different conditions to which muscles are exposed.[4],[5],[6] Due to multilayered control over gene expression, there tend to be source of variations that are unmeasured, unknown, or simply unaccounted. This can lead to reduced statistical power or untoward correlation between genes in addition to counterfeit expression signals. This behavior is true even for well-balanced, randomized studies. Muscle gene expression profile (GEP) comparison has been the experiment of choice to identify altered genes and pathways in disorders affecting muscle physiology.[7],[8],[9],[10],[11] The current practice of reporting muscle biomarkers primarily focuses on the formulation of balanced design, but do not takes into account within-class heterogeneity. Therefore, given the GEPs, a method that can identify a potential source of variation and group muscle biopsies accordingly would be of great interest. This will not only improve the power of transcriptome comparison test but also will help to identify unaccounted source of variations that cast definite expression patterns. The objective of this study was to examine the muscle GEPs for unaccounted source of variation. The biological source of variation, affecting gene expression, has its characteristic coherent expression signature, and based on this belief we have hypothesized that grouping of GEPs according to “source to variation” can be achieved by identifying coherent gene-sets. Here we have described “bottom–up approach” which uses a coherence score to identify gene-sets capable of producing biologically relevant sample classification with unsupervised clustering. We have shown that coherent gene-sets have a better chance to classify samples into biologically relevant subgroups as compared to noncoherent gene-sets. Further, we have applied “bottom–up approach” to the muscle GEPs and found that muscle fiber-type proportion in collected muscle biopsies is one of the most prominent unaccounted “source of variations” affecting gene expression measurements.

 Materials and Methods

The datasets GSE18732 and GSE25462 were analyzed in this study as test and validation datasets, respectively. Details about the samples and experimental conditions are given in studies conducted by Gallagher et al.[12] and Lerin et al.[13] All analysis was performed with R-statistical software, version 3.5.3 using Bioconductor packages. Biologically relevant subclusters of muscle samples were obtained by applying a bottom–up approach to analyze GEPs. The downloaded CELL files were loaded into R and RMA normalised with help of R-package “Affymetrix.”[14] The differential expression and gene ontology analysis were performed with help of R-packages “limma”[15] and “GOstats.”[16]

Bottom–up approach for analysis of gene expression profiles

The GEPs record expression values of genes (1…m) across samples (1…n) in an m × n matrix. The samples usually have a class label associated with them and the data are analyzed in a top–down approach to find differentially expressed genes between the classes. The approach works only if the class labels of samples are known. The class discovery using the unsupervised clustering approach has been proven difficult as clustering algorithms produce results that are heavily dependent on gene-set selected for classification.[17],[18] A gene-set in principle could be any collection of genes and therefore for 100 genes the possible combination of gene-sets could be ≅1030. Testing each of these gene-sets to find one which produces biologically relevant clusters would demand huge computational power and time. The problem becomes even difficult when we consider modern-day GEPs that usually contain 2k–5k informative genes. Therefore, a heuristic method that provides a good solution to the above problem within the reasonable time is required. Here, we have proposed bottom–up approach of analyzing gene expression data to discover biologically relevant classes of samples given gene-set a priori. The bottom–up approach works by selecting a set of genes to classify samples into biologically relevant subgroups [Figure 1]. Because genes generally work in a coordinated manner to accomplish a biological process (BP)/function, we have defined the obtained subgroups as biologically relevant if most of the genes annotated to a BP are coordinately up/downregulated in one of the clusters. Therefore, despite creating all possible combination of gene-sets, the bottom–up approach creates gene-sets by grouping genes that are annotated to a BP. The method calculates the coherence score (Sc) of the gene-set given the gene expression data. We have hypothesized that a coherent gene-set will classify samples into groups that will show differential regulation of associated BP or, in other words, the coherent gene-sets will have a better chance to classify GEPs into biologically relevant clusters compared to noncoherent gene-sets.{Figure 1}

Gene-set preparation

Gene-set was defined as any collection of genes based on some common properties. The common properties used to group genes could be gene function, the involvement of genes in BP/pathways, coexpression, or any other properties assigned to genes and gene products. Gene-sets were created based on gene ontology (GO) BP terms.[19],[20] The R-bioconductor package “GSEABase” was used to create gene-sets corresponding to GO BP terms by grouping Affymetrix probes together that were annotated to the particular BP terms. The “GSEABase” requires microarray platform-specific annotation packages (available at Bioconductor) to obtain mapping of probes to GO terms. The package “GSEABase” depends on bioconductor package “GO.db” for information about the structure of GO terms and their ontology relations. It was important to keep only one probe per gene for further analysis as the presence of multiple probes for a gene in a gene-set may result in an inflated coherence score (these probes may have similar expression pattern and thus higher correlation). The function “featureFilter” as provided by package “genefilter” was used to select a probe with the highest variance among all probes corresponding to a particular Entrez gene identifier. The function featureFilter used array-specific annotation package to obtain the mapping of all probes to its corresponding Entrez gene ID. The probes that have interquartile range (IQR) <0.4 across all samples or log intensity value <6.64 for 25% of the total sample were defined as noninformative and were removed from further analysis. The informative probes thus selected were grouped into gene-set using GSEABase package. In this work, we have focused our attention on gene-sets that correspond to BP GO terms and the rest were discarded.

Gene-set internal coherence score (Sc)

A gene-set was defined as coherent if Sc was found significantly higher than expected by chance (P < 0.05). The Sc of a gene-set was defined as the median of all pairwise correlations between the genes of a gene-set.[21] The following equation summarizes the Sc of a gene-set:


where gi is the ith gene of the gene-set, gj is the jth gene of the gene-set, and N is the total number of genes in a gene-set.

Assessing the significance of internal coherence score

A null distribution of Sc was created with steps listed below.

A test gene-set was prepared by randomly assigning an equal number of genes, to the test gene-set, as contained by gene-set for which P value has to be calculated.

The Sc of test gene-set was calculated with the method as discussed in the section “Gene-set internal coherence score (Sc).”

Steps 1 and 2 were repeated 2000 times to generate a null distribution for the concerned gene-set.

The nominal P value associated with Sc of gene-set was calculated by counting the number of times a test gene-set obtained Sc greater than the observed internal coherence score of the concern gene-set and divining this number by the total number of times the permutation test was run. Gene-sets having P < 0.05 were considered as coherent gene-set.


The GEPs of 47 normal glucose tolerant (NGT) subjects of GSE18732 were analyzed. After applying the filtering criteria described in the section “Gene-set preparation”, a total of 3221 genes were selected for gene-set preparation. Finally, the genes were grouped into 513 gene-sets each corresponding to a unique GO term belonging to the BP domain only.

Coherent gene-set relates to muscle-specific biological processes

A total of 70 gene-sets, of 513 analyzed, were found as coherent based on criterion defined in the section “Assessing the significance of internal coherence score” [Supplementary Table S1]. A directed acyclic graph (DAG) shows that a good number of coherent gene-sets contain genes involved in cellular metabolic processes [Supplementary Figure S1]. This indicates that metabolic profiles are not uniform for NGT samples and there may be a possible classification of NGT samples based on metabolic profile.{Table 1} {Figure 5}

Coherent gene-sets classify samples into biologically relevant subgroups

To prove that coherent gene-set has a better chance to classify GEPs into biologically relevant subclusters, we conducted an experiment where a gene-set was randomly selected to classify 47 NGT subjects of GSE18732 into three groups using k-means classifier. The three groups were labeled as L, M and H groups based on the median expression of genes belonging to selected gene-set [Supplementary Section S1]. We wanted to separate samples into groups having substantial fold change for genes differentially expressed between them and therefore clustering algorithm was asked to produce three clusters instead of two (L and H groups). We assumed that the third cluster (M group) will contain samples having expression values in between the samples of the L and H groups and thus will increase the observed fold change for differential expression analysis. We used a score Vk to assess the biological relevance of the created subgroups. The score Vk was defined as the percentage of genes from the selected gene-set, upregulated (adjusted P < 0.05) in group-H subjects compared to group-L subjects. The score


was obtained for coherent and


for noncoherent gene-sets. A Mann–Whitney U test for unpaired data was carried out to test the null hypothesis H0:




against the alternate hypothesis H1:




. The Mann–Whitney U test P value was found to be 2.932e–09 confirming that the score Vk was significantly higher for coherent gene-sets compared to noncoherent gene-sets [Figure 2]A.{Figure 2}

To further support the findings, we conducted another experiment where we have randomly selected a coherent gene-set and tested if the score Vk is greater than a fixed cutoff, say Zi. If the score Vk of the selected coherent gene-set was found to be greater than Zi we considered the outcome of the trail as success and failure otherwise. The random variable (Sn), defined as the number of coherent gene-sets having score Vk≥ Zi out of n randomly selected gene-sets, will follow a binomial distribution with parameters, n (number of trails) and P (probability of success in a trial). The probability P was calculated by counting the number of noncoherent gene-sets with score Vk greater than Zi and dividing the value by the total number of noncoherent gene-sets. We tested the null hypothesis H0: π = P against the alternate hypothesis H1: π > P, where π denotes the probability of success when trial involves coherent gene-sets. The hypothesis was tested for different settings of Zi, by calculating the binomial probability P(Sn≥ x), where x was the observed number of success in n trial involving coherent gene-sets. Using significance level 0.05, we found that in each case the parameter π was significantly higher for coherent gene-sets [Figure 3]A.{Figure 3}

Coherent gene-sets classify muscle biopsies into groups differing in expression of fiber-type-related genes

Knowing that human skeletal muscle is a heterogeneous tissue with respect to both function and fiber composition, we speculated muscle biopsies to differ in terms of fiber-type specific gene expression. To test this, genes belonging to the most coherent gene-set were selected to classify 47 NGT subjects of GSE18732 into three groups using k-means classifier. The resultant three groups contain 24, 19, and 4 NGT samples and were named as H-cluster, M-cluster, and L-cluster, respectively [Figure 4]. A close inspection of GO terms enriched in a separate analysis of up- and downregulated genes reveals that H-cluster samples have higher expression of genes related to aerobic respiration, tricarboxylic acid (TCA) cycle, mitochondrion organization, mitochondrial fission, mitochondrial electron transport chain, and energy derivation by oxidation of organic compounds [Supplementary Table S2]. These BPs are characteristic of slow muscle fiber type and therefore it shows that GEPs derived from muscle biopsies differ in terms of fiber-type specific gene expression.{Figure 4} {Table 2}

Validation with another dataset

The procedure was repeated with another independent dataset to validate the results obtained with the analysis of GSE18732. The 40 normoglycemic muscle biopsies of GSE25462 were analyzed to obtain score Vk for each of the prepared gene-sets. The Mann–Whitney U test showed that score


was significantly higher than


confirming the results obtained with GSE18732 [Figure 2B]. In the second experiment with the validation dataset, the probability of success (P) defined as the probability of having score Vk≥ Zi for a randomly selected gene-set was found to be significantly higher for coherent gene-set as compared to noncoherent gene-set [Figure 3B]. The 40 selected biopsies of GSE25462 were classified into three classes following the method as described in the section “Coherent gene-sets classifying muscle biopsies into groups differing in fiber-type-related genes.” Similar to the analysis of GSE18732, a group of samples with elevated expression of genes characteristic of slow fiber type [Supplementary Table S3] was observed.{Table 3}


Transcriptome analysis experiments aim to find set of genes with the coordinated change of expression across the GEPs. To understand the biological meaning associated with set of genes, thus identified, the downstream analyses such as GO, gene-set enrichment, or pathway analysis are performed.[22],[23],[24] Keeping in mind the end result of the transcriptome analysis, we have developed “bottom–up approach” of analyzing GEPs. The bottom–up approach is a feature selection method, which provides educated guesses about the gene-sets which can produce biologically meaningful subclusters. The bottom–up approach operates in reverse order to general procedure applied to analyze GEPs, where given the class labels the set of genes up/downregulated between the classes were identified. “Bottom–up approach” first decides about the gene-set and then determines the class labels of GEPs. The “bottom–up approach” was applied to analyze GEPs of 47 NGT skeletal muscle biopsies of dataset GSE18732 to discover biologically relevant subclusters. Most of the coherent gene-sets identified in this analysis were related to skeletal muscle metabolic process [Supplementary Figure S1]. This indicates that metabolic profiles were not uniform for NGT samples and there may be a possible classification of NGT samples based on expression profiles of genes related to metabolism. Classification of NGT samples using k-means classifier based on TCA cycle genes (most coherent gene-set) resulted in three groups. The body mass index (BMI), myoglobin level, basal insulin, and glucose level were not found different using one-way analysis of variance (ANOVA) between three groups of samples [Supplementary Figure S2]. However, the expression of TCA cycle genes was considerably different between the three groups [Supplementary Figure S3]. Further, we have shown that coherent gene-sets have significantly higher score Vk as compared to noncoherent gene-sets and thus have a better chance to classify NGT samples into biologically relevant groups. To find definitive subcluster of NGT samples, genes annotated to the TCA cycle were used to classify 47 NGT samples of dataset GSE18732. The k-means classification resulted in three groups of samples viz. “L”, “H,” and “M” cluster. The differential expression analysis between “H” and “L” cluster samples showed that genes related to slow fiber-specific BP were upregulated in H-cluster. Chemello et al.[3] in a study analyzing mouse muscle GEPs at single fiber level have shown that type-I fiber has vastly different transcriptome than type-II fibers. This indicates that muscle fiber-type proportion in collected biopsies is one of the most prominent unaccounted “source of variations” affecting gene expression measurements.{Figure 6} {Figure 7}


The two major findings of this study are as follows: first, classification of GEPs over genes belonging to coherent gene-sets has a better chance to result in biologically meaningful clusters. Second, muscle fiber-type proportion in collected biopsies is one of the most prominent unaccounted “source of variations” affecting gene expression measurements.

Data availability statement

The datasets used for this work are publicly available from Gene Expression Omnibus (GEO).


We would like to thank the Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India. Mr. Rakesh Kumar had received financial assistance from UGC, New Delhi in the form of Rajiv Gandhi National Fellowship.

Financial support and sponsorship

This work was supported by the young scientist grant (YSS/2015/01759) of Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India.

Conflicts of interest

Authors declare no conflicts of interest.


1Cho DS, Doles JD Single cell transcriptome analysis of muscle satellite cells reveals widespread transcriptional heterogeneity. Gene 2017;636:54-63.
2Lindholm ME, Huss M, Solnestam BW, Kjellqvist S, Lundeberg J, Sundberg CJ The human skeletal muscle transcriptome: Sex differences, alternative splicing, and tissue homogeneity assessed with RNA sequencing. Faseb J 2014;28:4571-81.
3Chemello F, Bean C, Cancellara P, Laveder P, Reggiani C, Lanfranchi G Microgenomic analysis in skeletal muscle: Expression signatures of individual fast and slow myofibers. PLos One 2011;6:e16807.
4Keller P, Vollaard NB, Gustafsson T, Gallagher IJ, Sundberg CJ, Rankinen T, et al. A transcriptional map of the impact of endurance exercise training on skeletal muscle phenotype. J Appl Physiol (1985) 2011;110:46-59.
5Laughlin MH, Padilla J, Jenkins NT, Thorne PK, Martin JS, Rector RS, et al. Exercise-induced differential changes in gene expression among arterioles of skeletal muscles of obese rats. J Appl Physiol (1985) 2015;119:583-603.
6Hassan-Smith ZK, Jenkinson C, Smith DJ, Hernandez I, Morgan SA, Crabtree NJ, et al. 25-hydroxyvitamin D3 and 1,25-dihydroxyvitamin D3 exert distinct effects on human skeletal muscle function and gene expression. PLos One 2017;12: e0170665.
7Somel M, Khaitovich P, Bahn S, Pääbo S, Lachmann M Gene expression becomes heterogeneous with age. Curr Biol 2006;16:R359-60.
8Haslett JN, Sanoudou D, Kho AT, Bennett RR, Greenberg SA, Kohane IS, et al. Gene expression comparison of biopsies from Duchenne muscular dystrophy (DMD) and normal skeletal muscle. Proc Natl Acad Sci USA 2002;99:15000-5.
9Crimi M, Bordoni A, Menozzi G, Riva L, Fortunato F, Galbiati S, et al. Skeletal muscle gene expression profiling in mitochondrial disorders. Faseb J 2005;19:866-8.
10Strand AD, Aragaki AK, Shaw D, Bird T, Holton J, Turner C, et al. Gene expression in Huntington’s disease skeletal muscle: A potential biomarker. Hum Mol Genet 2005;14:1863-76.
11Sreekumar R, Halvatsiotis P, Schimke JC, Nair KS Gene expression profile in skeletal muscle of type 2 diabetes and the effect of insulin treatment. Diabetes 2002;51:1913-20.
12Gallagher IJ, Scheele C, Keller P, Nielsen AR, Remenyi J, Fischer CP, et al. Integration of microRNA changes in vivo identifies novel molecular features of muscle insulin resistance in type 2 diabetes. Genome Med 2010;2:9.
13Lerin C, Goldfine AB, Boes T, Liu M, Kasif S, Dreyfuss JM, et al. Defects in muscle branched-chain amino acid oxidation contribute to impaired lipid metabolism. Mol Metab 2016;5:926-36.
14Gautier L, Cope L, Bolstad BM, Irizarry RA Affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004;20:307-15.
15Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47.
16Falcon S, Gentleman R Using GOstats to test gene lists for GO term association. Bioinformatics 2007;23:257-8.
17D’haeseleer P How does gene expression clustering work? Nat Biotechnol 2005;23:1499-501.
18Kerr G, Ruskin HJ, Crane M, Doolan P Techniques for clustering gene expression data. Comput Biol Med 2008;38: 283-93.
19Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nat Genet 2000;25:25-9.
20Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res 2017;45: D331-8.
21Montaner D, Minguez P, Al-Shahrour F, Dopazo J Gene-set internal coherence in the context of functional profiling. BMC Genom 2009;10:197.
22Quackenbush J Computational analysis of microarray data. Nat Rev Genet 2001;2:418-27.
23Selvaraj S, Natarajan J Microarray data analysis and mining tools. Bioinformation 2011;6:95-9.
24Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005;102:15545-50.