Scientific Text Analysis with Robots applied to observatory proposals

An in-depth analysis of the ESO experiment exploring the impact of AI, specifically Large Language Models like ChatGPT, on scientific proposal writing and evaluation. This study reveals critical insights into AI's capabilities and limitations in academic contexts.

The European Southern Observatory (ESO) conducted an experiment for its P112 Call for Proposals to evaluate the disruptive potential of AI transformers (e.g., ChatGPT) in proposal reviewing and grading. Key findings indicate that ChatGPT-adjusted proposals generally receive lower grades from human reviewers. ChatGPT 3.5 was found unreliable for scientific references, while newer versions showed improvement but remained imperfect. When assessing proposals, ChatGPT appears to generate reasonable summaries but struggles with identifying weaknesses and tends to assign higher grades than human reviewers, often favoring its own AI-generated content. This highlights the need for observatories to adapt to evolving AI technologies and for researchers to exercise caution, particularly regarding scientific rigor and confidentiality.

Schedule Your Strategy Session

Executive Impact: Key Findings at a Glance

The ESO experiment reveals significant insights into AI's current state and future potential in academic research. Here are the core metrics that highlight its immediate implications:

-0.18 Mean Grade Difference (AI vs. Original)

5 Proposals Tested

4 LLM Versions Examined

10 AI Reviews Generated (per proposal)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

AI & Research Integrity AI as a Reviewer ESO Experiment Workflow

Grade Discrepancy: AI vs. Human

ChatGPT-adjusted proposals received significantly lower grades from human reviewers compared to original proposals.

-0.18 Mean Grade Difference (AI vs. Original)

LLM Reliability: Reference Generation
LLM Version	Reference Accuracy	Notes
ChatGPT 3.5	Poor: Creates non-existent references with convincing format.	Likely random association of authors, dates, titles, journals.
ChatGPT 4	Improved: Gives links to actual papers, but not always up-to-date or comprehensive.	Requires human verification; can still be misleading if reviewer lacks field knowledge.
Gemini	Mixed: Correctly identifies papers but often with wrong titles/details.	Admitted to providing alternative links if original was inaccessible, raising concerns about integrity.

Case Study: Abstract Rewriting by LLMs

An example of how ChatGPT and Gemini modify an abstract, highlighting changes in verbosity, content accuracy, and potential misinterpretations.

Original Abstract Snippet

After the recent discovery of prominent spiral arms in the disc of the accreting binary IP Pegasi, we aim to study additional systems using Doppler imaging methods to resolve the detailed behaviour of the accretion disc and the impact region betweem the disc and the mass transfer stream. Using high-resolution spectroscopy, we can use the dynamical information provided in the broad strong emission lines to spatially resolve the structure of the accretion flow. Several recent theoretical efforts on the properties of such spiral waves and on the disc-stream impact region provide various conclusions/predictions which can be directly tested using these observations.

Gemini Version (170 words, +70% increase)

The recent discovery of prominent spiral arms in the accreting binary IP Pegasi has opened a new window into the dynamics of accretion discs and their interaction with mass transfer streams. We propose high-resolution, optical spectroscopy with the XSHOOTER instrument on the ESO VLT to conduct Doppler imaging of these spiral arms. This cutting-edge approach will allow us to spatially resolve the detailed behavior of the accretion disc in IP Pegasi, including the crucial impact region where the disc and the mass transfer stream collide. By analyzing the broad, strong emission lines in the spectra, we can extract crucial dynamical information and map the gas flows within the disc. This high-resolution data will enable us to directly test and differentiate between the various theoretical predictions on the properties of spiral waves and the disc-stream interaction region. Our observations will not only provide groundbreaking insights into the specific case of IP Pegasi but also establish a powerful new technique for studying accretion disc dynamics in a wider range of binary systems.

Note: Gemini made up the part about XSHOOTER based on prior discussions, when the original proposal used UVES. It also shifted the focus from 'additional systems' to 'IP Pegasi' specifically.

ChatGPT 3.5 Version (205 words, +100% increase)

Recent discoveries have revealed striking spiral arms within the accretion disc of the binary system IP Pegasi. This breakthrough opens a new avenue for understanding the intricate dynamics of accretion flows. We propose to extend this exploration to additional systems, employing high-resolution Doppler imaging to capture the fine details of the accretion disc and the critical interaction zone between the disc and the mass transfer stream. Our goal is to harness the power of high-resolution spectroscopy to decode the complex dynamical information embedded in the broad, strong emission lines characteristic of these systems. This approach will allow us to spatially resolve the structure of the accretion flow with unprecedented clarity. Recent theoretical advancements have made significant strides in predicting the properties of spiral waves and the disc-stream impact region. However, these predictions vary and require empirical validation. Our observations will provide the critical data needed to test these theoretical models, bridging the gap between theory and observation. By securing telescope time, we aim to deliver transformative insights into the physics of accretion discs, offering a clearer understanding of the mechanisms driving these cosmic phenomena. The implications of our study extend beyond IP Pegasi, potentially informing models of accretion in a wide range of astrophysical contexts.

Note: Very verbose; implies a 'recent discovery' when the original was already established. The final sentence, while useful for impact, doesn't fully align with the original intent. When asked for a concise version, ChatGPT produced a better summary, highlighting the importance of precise prompting.

AI vs Human Grading Discrepancy

ChatGPT systematically assigns higher grades to proposals than human reviewers, preferring its own AI-generated content.

-0.58 Mean Grade Difference (DPR vs. ChatGPT - Original)

AI Self-Preference in Grading

The difference becomes even greater when comparing human reviews of original proposals to ChatGPT's reviews of its own adjusted proposals.

-0.99 Mean Grade Difference (DPR vs. ChatGPT - Adjusted)

ChatGPT Versions as Reviewers
ChatGPT Version	Grading Consistency	Average Grade Tendency	Key Observation
ChatGPT 3.5	Highly coherent	Lower grades (higher numerical values)	Less advanced, but still coherent.
ChatGPT 4	Highly coherent	Higher grades (lower numerical values)	More advanced counterparts.
ChatGPT 40	Highly coherent	Higher grades (lower numerical values)	More advanced counterparts.

Case Study: ChatGPT-3.5 Proposal Review

An example of a proposal review provided by ChatGPT-3.5, demonstrating its structure, analysis, and assigned grade.

ChatGPT-3.5 Review

Review: The proposal aims to utilize UVES to confirm the presence of certain structures in a specific stellar system, employing a methodology previously tested via theoretical simulations. The scientific rationale is well-grounded, emphasizing the importance of such stellar systems in studying stellar evolution and galactic dynamics. The detection of unique structures around such systems provides valuable insights into their formation and dissolution processes, contributing to our understanding of galactic evolution. The proposal's strength lies in its innovative approach, combining observational data with theoretical simulations to identify and study these structures. By targeting a subset of stars within the system, the authors plan to derive radial velocities and stellar abundances to confirm their membership in these structures. This confirmation would not only validate their methodology but also extend its applicability to other similar systems. However, some potential limitations need addressing. While the proposal acknowledges the role of certain observational data in detecting these structures, it could provide more detailed discussion on how data limitations affect the identification and confirmation of candidate stars. Additionally, the proposal would benefit from a clearer explanation of how the selected subset of stars was chosen and how potential binary systems will be accounted for in the analysis. Overall, the proposal presents a compelling scientific case with the potential for significant contributions to the field of stellar and galactic evolution. With careful consideration of the outlined limitations and further refinement of the experimental approach, this research has the potential for high impact.

Grade: 2.3 (Very Good: No significant weaknesses)

ChatGPT-40 Review

Science Relevance: The proposed study addresses a critical aspect of galactic astronomy: the dynamics and dissolution of stellar clusters within the Milky Way. The confirmation of specific stellar structures using advanced spectroscopic techniques is of high scientific relevance, as it provides insights into the processes governing cluster evolution and the interaction between clusters and the Galactic potential. This research has the potential to significantly enhance our understanding of stellar evolution, cluster dynamics, and the structure of the Milky Way.

Quality: The proposal is well-structured and builds on solid preliminary work, including simulations and data from high-precision astrometric surveys. The choice of the spectrograph for deriving radial velocities and stellar abundances is appropriate, and the methodology appears sound. The sample selection of 38 stars ensures a comprehensive analysis. The proposal addresses the limitations of existing data for faint stars and proposes a feasible observational strategy with a signal-to-noise ratio of 50 for accurate measurements.

Impact: Confirming the proposed stellar structures would provide crucial constraints on cluster dissolution rates and the initial mass function. The validation of the methodology and simulations can be extended to other clusters, offering a broader impact on the field. This work has the potential to advance our knowledge of the physical processes in the interstellar medium and the Galactic potential, contributing to a deeper understanding of the Milky Way's structure and evolution.

Grade Evaluation: The proposal is robust, with a clear scientific case and a well-defined methodology. The potential impact on the field of galactic astronomy is significant, and the use of advanced spectroscopic techniques for this study is well-justified. While there are minor uncertainties regarding the confirmation of some stars as cluster members due to the possible presence of binaries, the overall quality and relevance of the proposal are outstanding.

Grade: 1.2

This grade reflects the proposal's high scientific value, solid methodological approach, and potential for significant impact on our understanding of cluster dynamics and galactic structure.

Grade Discrepancy & Implications

ChatGPT-3.5 gave a grade of 2.3 ('Very Good'), while ChatGPT-40 assigned a grade of 1.2 ('Outstanding'). This significant difference highlights how different AI versions can interpret and evaluate proposals, with the 40 version assigning a much higher priority. This has direct implications for proposal scheduling, with 1.2 likely leading to highest ranked and 2.3 to lower-tier scheduling. The improved structure of the assessment in ChatGPT-40 is also notable.

ESO Proposal Experiment Methodology

The structured approach taken by ESO to evaluate the impact of ChatGPT on scientific proposal processes.

Selection of Test Proposals (5 proposals, diverse topics)

→

Defining Commands for ChatGPT (improve text, cite, structure)

→

Processing Proposals with ChatGPT (adjusted rationales, titles, abstracts)

→

Evaluation (DPR grades of original vs. AI-adjusted proposals)

AI vs. Human in Instrument Knowledge
LLM	FORS2 Details	XSHOOTER Details	Overall Accuracy
ChatGPT 3.5	Proposed high-time resolution mode (decommissioned since 2018).	Recommended over FORS2 for faint targets (incorrect).	Mixed: Provided some reasonable distinctions but with critical factual errors.
Gemini	Correctly mentioned better for faint targets; provided links to web pages and calculators.	Claimed max resolution 55,000 (actual 18,000); limited set of grisms (incorrect, has extended range).	Mixed: Better on links and some details, but still significant inaccuracies and made-up information.

Estimate Your AI-Driven Research Efficiency Gains

See how integrating AI for scientific text analysis and proposal generation could benefit your organization. Adjust parameters to reflect your team's workflow.

Your Industry Sector

Number of Employees Involved in Text Analysis/Proposals

Avg. Hours/Week Spent on Text Analysis/Proposals per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Calculate Your ROI

Strategic AI Integration Roadmap for Observatories

A phased approach to responsibly integrate AI into your proposal workflow, from initial assessment to full-scale, ethical implementation.

Phase 1: Pilot & Policy Definition

Conduct internal pilot programs with LLMs for proposal drafting and initial review. Develop clear guidelines and ethical policies for AI use, focusing on data confidentiality and integrity. Train core teams on best practices and limitations.

Phase 2: Tool Validation & Customization

Rigorously validate AI tools for scientific accuracy and bias detection against real-world data. Explore fine-tuning LLMs with observatory-specific knowledge to improve relevance and reduce factual errors. Integrate AI into internal, secure platforms.

Phase 3: Community Engagement & Adaptation

Engage with the broader scientific community to share findings and evolve best practices. Adapt proposal submission and review processes to account for AI-generated content, potentially introducing AI detection tools and explicit disclosure requirements.

Discuss Your Implementation

Ready to Transform Your Research Workflow with AI?

Leverage our expertise to integrate advanced AI solutions, ensuring efficiency and scientific integrity. Book a personalized strategy session today.

Book a Consultation

Scientific Text Analysis with Robots applied to observatory proposals

An in-depth analysis of the ESO experiment exploring the impact of AI, specifically Large Language Models like ChatGPT, on scientific proposal writing and evaluation. This study reveals critical insights into AI's capabilities and limitations in academic contexts.

Executive Impact: Key Findings at a Glance

Deep Analysis & Enterprise Applications

Grade Discrepancy: AI vs. Human

Case Study: Abstract Rewriting by LLMs

Original Abstract Snippet

Gemini Version (170 words, +70% increase)

ChatGPT 3.5 Version (205 words, +100% increase)

AI vs Human Grading Discrepancy

AI Self-Preference in Grading

Case Study: ChatGPT-3.5 Proposal Review

ChatGPT-3.5 Review

ChatGPT-40 Review

Grade Discrepancy & Implications

ESO Proposal Experiment Methodology

Estimate Your AI-Driven Research Efficiency Gains

Strategic AI Integration Roadmap for Observatories

Phase 1: Pilot & Policy Definition

Phase 2: Tool Validation & Customization

Phase 3: Community Engagement & Adaptation

Ready to Transform Your Research Workflow with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai