Climate solutions GPT model
To measure a firm’s engagement in climate solutions, we fine-tune a GPT model to detect climate solutions sentences in the Item 1 Business Description section of 10-K filings from the SEC. We extract textual data for the universe of US public firms that report SEC 10-K filings in the EDGAR database from fiscal year 2005 to 2022. We focus on industries that are pivotal to climate solutions, where our LLM is likely more accurate in identifying climate solutions. Based on reviewing Project Drawdown, we keep 13 (out of 25) GICS industry groups that are central to climate solutions: energy, materials, capital goods, transportation, automobiles & components, consumer durables & apparel, food beverage & tobacco, household & personal products, technology hardware & equipment, semiconductors & semiconductor equipment, utilities, equity real estate investment trusts (REITs), real estate management & development.
We define climate solutions as products and services that develop or deploy new technologies in a transition to a low-carbon economy. We identify climate solution technologies based on guidance from Project Drawdown, which contains a list of technologies that can reduce greenhouse gases in the atmosphere, and are compiled by a network of scientists and researchers56.
To fine-tune our climate solutions GPT model, we label 3508 sentences into climate solution sentences or not as our training set. These sentences are chosen from 10-K Item 1 sentences that are representative of each of the 13 industry groups, as well as sentences that the model deems more difficult to classify through an active learning approach. We also leverage the fine-tuned GPT model to classify each climate solutions sentence into one of 88 climate solutions topics. We provide a detailed description of the procedure involved in creating the Climate Solutions LLM and the topic model in Supplementary Note 1. We provide details of the labeling process to create our training set in Supplementary Note 2.
Our primary climate solutions measure represents the proportion of sentences identified as climate solutions to the total number of sentences in the 10-K Item 1 Business Description, multiplied by 100 to express the measure as a percentage. Variable definitions and summary statistics of the climate solutions measure and other variables used in this study are presented in Supplementary Tables 1, 2, respectively. This measure assumes that the relative proportion of climate solutions sentences reflects a firm’s product or service focus on climate solutions. Our results remain similar when we consider three sets of alternative ways to measure a firm’s climate solutions. The first set is CS measure top 50 and CS measure top 100, where we only keep the percent of climate solutions sentences in the top 50 and 100 sentences in 10-K Item 1, respectively. The second set is CS measure weighted 50 and CS measure weighted 100, where we place a larger weight on earlier sentences in 50 and 100-sentence increments, respectively. The third set is CS measure rolling 50 and CS measure rolling 100, which are the percent of climate solutions sentences in the 50- and 100-sentence-segments with the highest ratio of climate solutions sentences, respectively. We present the correlation matrix of these six alternative measures and our primary climate solutions measure in Supplementary Table 3. Our primary climate solutions measure is highly correlated with each of the six alternative measures at over 90%, providing comfort that our measures are not sensitive to the choice of construction method.
Validation analysis
We conduct three sets of validation analyses to demonstrate that the climate solutions measure reflects companies developing or deploying products or services that help reduce emissions.
In the first two validation analyses in Fig. 2, we conduct the following ordinary least squares (OLS) regression to compare climate solutions measure to MSCI green revenue, green patents data, and innovation measures for firm i:
$${{{{\mathrm{Validation}}}}\,{{{\rm{measures}}}}}_{i,t}=\, {\beta }_{0}+{\beta }_{1}{{{{\mathrm{Climate}}}}\,{{{\rm{Solution}}}}\,{{{\rm{Measure}}}}}_{i,t} \\ +\sum {{{\mathrm{Controls}}}}+\sum {{{\mathrm{Fixed}}}}\,{{{\mathrm{Effects}}}}+\epsilon$$
(1)
In the first validation analysis, the dependent variables are the percent of revenue and patent value that is related to low-carbon products, as classified by MSCI. These two MSCI variables are only available for the latest year using fiscal 2022, resulting in a lower sample size. To directly compare how changes in climate solutions measure relate to changes in MSCI green revenue and green patent percentage, we do not include control variables, but the results remain robust to including control variables for firm size and age. For all regressions, we include one specification without fixed effects and one specification with industry-year fixed effects. Industry-year fixed effects are GICS industry indicators interacted with each year to control for time-varying variation of the outcome variable in each industry. This research design compares the outcome for firms with higher climate solutions measure relative to those with lower climate solutions measure, within each industry-year. Standard errors are clustered at the firm level to address potential correlations across different years within a firm. The results are presented in Supplementary Table 4 Panel A.
In the second validation analysis, we employ four complementary measures of firm-level innovation. R&D expenditures is the research and development expenses (xrd) as disclosed in financial statements, retrieved from Compustat. Following prior literature, we treat missing R&D expenditures as zero and include a dummy variable for firms with missing R&D data57. To mitigate concerns about the prevalence of missing data, we also report results limited to industries where the average rate of missing R&D data is below the median, and we show results using specifications where missing values are not replaced with zero. Knowledge capital quantifies the stock of accumulated R&D investment by capitalizing R&D expenditures using industry-specific depreciation rates, reflecting firms’ long-term innovation assets58. RDC, Research & Development Capitalization, similarly captures the capitalized value of R&D but is estimated based on its expected contribution to future revenues, providing an alternative perspective on the persistence of innovation efforts59. Trade secret is the number of trade secrets explicitly referenced in its 10-K filings, serving as a proxy for proprietary knowledge protection strategies60. We scale all innovation measures with firm size using revenue retrieved from Compustat, and winsorize the variables at the 1st and 99th percentiles. Together, these measures provide a comprehensive assessment of firms’ innovation strategies, spanning both tangible R&D investments and intellectual property protection mechanisms. We control for firm size and age. We use the natural logarithm of revenue in year t-1 to control for firm size. We also control for firm age, measuring the number of years since the firm’s inclusion in Compustat, to compare firms at a similar point in their lifecycle. Standard errors are clustered at the firm level. The results are presented in Supplementary Table 4 Panels B, C.
In the third validation test, we use difference-in-differences models to examine how climate solutions measure changes after two major climate policy interventions for firm i in year t:
$${{{Relevant}}\,{{Climate}}\,{{Solutions}}\,{{Measure}}}_{i,t}= {\beta }_{0}+{\beta }_{1}{{{Post}}}_{t} \\ \times {{{Relevant}}\,{{Industries}}}_{i} \\ +\sum {{Fixed}}\,{{effects}}+\epsilon$$
(2)
For the IRA event, the dependent variable is the IRA climate solutions measure, which is the percent of sentences from 10-K Item 1 containing climate solutions topics that is covered by the IRA. We manually identify topics covered by the IRA, referencing the Guidebook on the Inflation Reduction Act released by the White House61. The variable of interest is the interaction term of Post-IRA × High IRA Industries. Post-IRA is an indicator for fiscal years 2021 and 2022, where financial reports are released in 2022 and 2023, after the announcement of IRA in 2022. To identify industries relevant to IRA, we calculate the average IRA climate solutions measure by industry using observations before fiscal year 2021, and classify industries above the median as High IRA Industries. The coefficient on the interaction term represents the change in the IRA climate solutions measure for high IRA Industries relative to low IRA Industries, and in fiscal years 2021 and 2022, relative to earlier years. We include firm fixed effects and year fixed effects, which control for firm-level time-invariant characteristics and annual time trends, respectively. Standard errors are clustered at the firm level.
We conduct a similar analysis with the RPS events, with modifications as RPS is implemented in a staggered time frame and varies by states. The dependent variable is the renewable energy topic, which is the percent of sentences from 10-K Item 1 containing climate solutions topics under the renewable energy topic group. The variable of interest is the interaction term of post-RPS (weighted) × high renewable industries. Post-RPS (weighted) is a weighted average of indicators for states after they pass renewable portfolio standards, where the weight is based on a firm’s distribution of employees in each state, obtained from InfoGroup. The coefficient on the interaction term represents the change in the renewable energy topic for high renewable industries relative to low renewable industries, as post-RPS (Weighted) increases by one unit. We include firm fixed effects and year fixed effects, and standard errors are clustered at the firm level. The results are presented in Supplementary Table 4 Panel D.
Climate solutions and greenhouse gas emissions
In Fig. 3, we conduct the following OLS regression to compare the climate solutions measure to greenhouse gas emissions for firm i in year t:
$${{{Greenhouse}}\,{{Gas}}\,{{Emissions}}}_{i,t}= {\beta }_{0}+{\beta }_{1}{{{Climate}}\,{{Solutions}}\,{{Measure}}}_{i,t} \\ +\sum {{Controls}}+\sum {{Fixed}}\,{{effects}}+\epsilon$$
(3)
The dependent variables are greenhouse gas emissions measures. We retrieve greenhouse gas emissions data from TruCost. TruCost provides firm-year-level data on absolute and intensity scopes 1, 2, 3 upstream, and 3 downstream emissions based on both firm disclosure and estimates based on a firm’s industry activities. As the greenhouse gas measures are skewed to the right, we take the logarithm to better resemble a normal distribution. As with prior specifications, we control for firm size and age, include specifications with and without industry-year fixed effects, and cluster standard errors at the firm level. The results are presented in Supplementary Table 5 Panels A, B. To address potential concerns about the quality of corporate greenhouse gas emissions data, we repeat the analysis using a subsample of firms that likely report more reliable data. Specifically, we identify firms with a sustainability committee on their board, as this reflects stronger governance over environmental reporting. The results remain similar, as shown in Supplementary Table 5 Panels C, D. In Supplementary Table 5 Panel E, as additional measures related to climate risks, we use emissions scores from Refinitiv and MSCI to capture a firm’s climate risk management. For ease of comparison, we standardize the emissions scores to have a mean of 0 and a standard deviation of 1.
Carbon abatement potential and costs
Project Drawdown provides estimates for each climate solution’s abatement potential and costs to achieve the carbon abatement56. These estimates are based on analytical models backed by extensive literature and data to estimate the relevant impact based on two scenarios for the period of 2020 to 2050. The first scenario is more conservative and estimates a two-degree temperature increase by 2100. The second scenario is more ambitious and estimates a 1.5-degree temperature increase by 2100. We use the estimates from the first scenario to stay on the conservative side.
For each solution, abatement potential is the CO2 equivalent reduction brought by the technology between 2020 and 2050. In terms of costs, Project Drawdown provides the first cost and the lifetime cost. We focus on the first cost, as data on lifetime cost is less available. The first cost reflects the upfront capital expenditure required to implement the climate solution, and is based on the relative cost to implement these climate solutions between 2020 and 2050 compared to a baseline scenario. The baseline scenario is defined as a scenario where such a climate solution does not exist. For example, the baseline for the climate solution, onshore wind turbines, is based on electricity generated using fossil fuel power plants. As such, negative costs can happen if the climate solution results in a lower cost relative to the baseline scenario (such as the use of LED lighting). To ensure that costs are comparable across different climate solutions, we focus on the first cost divided by the abatement potential. This allows us to compare different climate solution’s cost per unit of carbon reduction. Figure 6a plots each topic, where the x-axis shows the average cost per abatement (USD billion per Gt CO2e abatement) and the y-axis shows the abatement potential (Gt CO2e) of the climate solutions technology. Figure 6b plots a similar chart at the industry level. To create this, we first calculate the average climate solutions measure for each topic within an industry. Then, we compute the industry’s weighted average cost per abatement and abatement potential, using each topic’s relative share within the industry as weights.
Application 1: Revenue growth
To examine the revenue growth associated with firms with higher climate solutions measures in Fig. 7, we conduct OLS regression models to estimate the following specification for firm i in year t:
$${{{Revenue}}\,{{Growth}}}_{i,t}= {\beta }_{0}+{\beta }_{1}{{{Climate}}\,{{Solutions}}\,{{Measure}}}_{i,t} \\ +\sum {{Controls}}+\sum {{Fixed}}\,{{effects}}+\epsilon$$
(4)
The dependent variable is the year-over-year revenue growth, calculated using annual revenue data (revt) from Compustat. We include specifications with and without industry-year fixed effects, and cluster standard errors at the firm level. To mitigate the concern that confounding variables that are correlated with climate solutions measure explain the relationship with revenue growth, we control for the following firm characteristics: firm size, age, leverage, capital intensity, and profitability. We winsorize financial variables at the 1st and 99th percentile.
We conduct cross-sectional analysis separating firms in industries with high or low patent protection. For each industry, we create a weighted average patent count using firm-level patent counts in MSCI data for fiscal 2022, and weight each observation based on the firm’s revenue. In other words, this variable is equal to the sum of a firm’s patent counts multiplied by the revenues of a firm over the revenues of the whole industry, across all firms in that industry. We then repeat the baseline revenue growth regression separately for subsamples of firms in industries with low and high patents based on the median value. The baseline and cross-sectional results are presented in Supplementary Table 6 Panel A.
We then repeat the revenue growth analysis, separating topics based on the two climate solutions characteristics: cost per abatement and abatement potential. We first group climate solutions topics into high versus low groups for each of the two characteristics based on the median. We then create scaled measures for each of the four groups of topics, calculated as the percent of sentences from 10-K Item 1 containing climate solutions topics that belong to that group. To enhance comparability across these topic groups, we standardize the measures to have a mean of 0 and a standard deviation of 1. For example, High Cost per Abatement is defined as the percent of sentences from 10-K Item 1 containing climate solutions topics that belong to the high cost per abatement group, standardized. The results are presented in Supplementary Table 6 Panel B.
We conduct four robustness tests to mitigate potential concerns related to the revenue growth results. First, to mitigate concerns about reverse causality, we replace revenue growth with the 1-year forward revenue growth, and 3-year moving average revenue growth from t = 0 to t = 2, and our results remain robust. Second, we include firm fixed effects to control for time-invariant firm characteristics, and the results remain robust when we use the three-year moving average revenue growth. We focus on the three-year moving average because revenue growth tends to be slow-moving, and using firm fixed effects with one-year revenue growth may absorb too much of the variation, limiting our ability to detect meaningful effects. Third, to ensure comparability between firms with and without climate solutions, we apply entropy balancing on the first three moments of the control variables62. As entropy balancing requires discrete treatment variables, we classify firms as having climate solutions if the climate solutions measure is above 1 or 5%, which also helps mitigate concern about the right skew of the main climate solutions measure. Our results remain robust when we repeat the main specifications with these alternative measures for climate solutions, and when using entropy-balanced samples. These robustness tests are presented in Supplementary Table 6 Panels C, D.
Finally, we acknowledge that the R-squared is relatively low in this analysis, which likely reflects the wide range of factors that influence revenue growth. We benchmark our R-squared against prior studies and find that our explanatory power falls within a reasonable range of the literature (R-squared of around 5–15%)63,64. To further assess potential concerns related to confounding variables, we follow Oster (2019) to examine coefficient stability in the presence of unobservable factors65. Specifically, the method compares the relative movements of coefficients and R-squared values from regressions with and without control variables to estimate δ, which captures the proportion of unobservable factors relative to observable factors that will produce a treatment effect of zero. Using an R-squaredmax set to 1.3 times the R-squared in Table 6, Panel A, Column 2, as recommended by Oster (2019), our estimated δ is 5. A δ of 5 suggests that unobservable factors would need to be five times as influential as observable ones to explain away the estimated effect, providing some comfort over confounding concerns.
Application 2: Political affiliation of firm locations
To capture the political environment of a firm’s operations, we develop a weighted measure based on state-level voting patterns weighted by the geographic distribution of its employees. For each state, we measure Republican vote share as the percentage of votes for the Republican candidate in the 2020 presidential election. We also repeat the analysis using the 2016 presidential election outcomes and find similar results. Employee distribution by state is obtained from InfoGroup, and we compute the weighted average of state-level Republican vote shares, with weights given by the proportion of the firm’s employees in each state. We define a firm as being located more in Republican voting states (firms in Republican states) if the weighted measure is above 50%.
To examine how the climate solutions measure differs by the political affiliation of a firm’s operation, we conduct OLS regression models to estimate the following specification for firm i in year t:
$${{{Climate}}\,{{Solutions}}\,{{Measure}}}_{i,t}= {\beta }_{0}+{\beta }_{1}{{{Firms}}\,{{in}}\,{{Republican}}\,{{States}}}_{i,t} \\ +\sum {{Controls}}+\sum {{Fixed}}\,{{effects}}+\epsilon$$
(5)
The dependent variable is the climate solutions measure, and we also show results separately for climate solutions topics in the high or low groups in two characteristics, cost per abatement and abatement potential, as defined above. As with prior specifications, we control for firm size and age, include specifications with and without industry-year fixed effects, and cluster standard errors at the firm level. The results are presented in Supplementary Table 7.
Application 3: Climate solutions topics and industries
For each climate solutions sentence, we plot the two-dimensional embeddings in Fig. 9. To do so, we first generate high-dimensional sentence embeddings using a transformer-based language model from the SentenceTransformers library, trained on large English text corpora to capture semantic meaning. These embeddings are then projected into two dimensions with uniform manifold approximation and projection (UMAP).
In the figure, the reduced two-dimensional embeddings are clustered using hierarchical density-based spatial clustering of applications with noise (HDBSCAN) to identify distinct topic clusters. We identify the 15 climate solutions topics from Project Drawdown with the highest frequency, and then use HDBSCAN to annotate each solution in the cluster where it is more present. Note that for better visualization, we exclude the labels for nuclear power and hydropower in the figure, as these topics are located far from the main concentration of sentences. We also include the topic, plant-rich diets, to illustrate an example of a climate solution more isolated in one industry.
To examine stock return synchronicity for industry group pairs that are becoming more similar in climate solution topics, we estimate the following OLS specification for each pair of industry groups j and k in year-month t:
$${{{Synchronicity}}}_{j,k,t}= {\beta }_{0}+{\beta }_{1}{{{Climate}}\,{{Solutions}}\,{{Similarity}}}_{j,k,t} \\ +\sum {{Fixed}}\,{{effects}}+{\epsilon }$$
(6)
The dependent variable, Synchronicity, is the stock return synchronicity between two industries in a given month, constructed following prior literature46,66. Specifically, we compute value-weighted monthly stock returns at the four-digit GICS industry group level, then estimate a rolling 36-month time-series regression of the focal industry’s returns on the connected industry’s returns. The adjusted R-squared from this regression is transformed using the log odds ratio to create an unbounded continuous variable out of a variable originally bounded by 0 and 1:
$${{{Synchronicity}}}_{j,k,t}=\ln \left(\frac{{R}^{2}}{1-{R}^{2}}\right)$$
This measure is computed for each unique industry pair and month, with higher values reflecting stronger co-movement in returns, suggesting greater alignment in economic fundamentals.
The independent variable, climate solutions similarity, is measured as the cosine similarity between the climate solution topic vectors of each industry pair in each year. Specifically, for each industry-year, we first compute the average share of 10-K Item 1 sentences that contain each of the 88 topics across the firms. We then calculate the cosine similarity between these vectors containing the 88 topics for each pair of industries in the same year. A higher value indicates that the two industries disclose more similar climate solution topics in their 10-K filings. We include one specification without fixed effects, and one with year-month fixed effects to account for common shocks or macroeconomic events that may influence all industries’ stock return co-movements in a given month. We cluster standard errors at the industry group pair level. The results are presented in Supplementary Table 8.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link
