No, GPT4 can’t ace MIT: A Rebuttal of a Popular MIT PaperNo, GPT4 can’t ace MIT: A Rebuttal of a Popular MIT Paper
🤦🏽‍♂️

No, GPT4 can’t ace MIT: A Rebuttal of a Popular MIT Paper

Date
Jun 17, 2023
What follows is a critical analysis of “Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models”
 
👤
This is a joint document written by three MIT EECS seniors (Class of 2024): Raunak Chowdhuri, Neil Deshmukh, and David Koplow.
🚨
On June 24th, Armando Solar-Lezama (Professor in EECS and COO and Associate Director of CSAIL, MIT), Tonio Buonassisi (Professor of Mechanical Engineering, MIT), and Yoon Kim (Assistant Professor in EECS and CSAIL, MIT) released a public statement regarding the paper. Please read it below.
notion imagenotion image
📌
Update: we’ve run preliminary replication experiments for all zero-shot testing here — we’ve reviewed about 33% of the pure-zero-shot data set. Look at the histogram page in the Google Sheet to see the latest results, but with a subset of 96 Qs (so far graded), the results are ~32% incorrect, ~58% correct, and the rest invalid or mostly correct.
We’ve also run and released the expert-prompting zero-shot experiments, but haven’t had time to validate these yet. Some of the clear issues with this prompting system are detailed in the section below.
 
⚠️ We wanted to make clear that our grading process involved both our own manual grading as well as crowdsourcing some of the manual grading effort to PhD level experts who reached out to us after our original post. As the manual grading process is still underway, we cannot verify that every question is graded correctly until we have finished grading entirely and double-checked the grades. We’ve made the grading spreadsheet public and set it so anyone can comment so as to welcome any corrections from the community in the meantime. Thank you for your patience!
 
 

Summary

 
A paper seemingly demonstrating that GPT-4 could ace the MIT EECS + Math curriculum recently went viral on twitter, getting over 500 retweets in a single day. Like most, we were excited to read the analysis behind such a feat, but what we found left us surprised and disappointed. Even though the authors of the paper said they manually reviewed the published dataset for quality, we found clear signs that a significant portion of the evaluation dataset was contaminated in such a way that let the model cheat like a student who was fed the answers to a test right before taking it. We think this should call into greater question the recent flurry of academic work using Large Language Models (LLMs) like GPT to shortcut data validation — a foundational principle in any kind of science, and especially machine learning. These papers are often uploaded to Arxiv and widely shared on Twitter before any legitimate peer review. In this case, potentially spreading bad information and setting a poor precedent for future work.
 
🕊️
Several of the authors listed on the discussed paper are undergraduate researchers. Consequently, we believe it's inappropriate to hold these individuals accountable for any lapses present in the work.
Instead, we believe the responsibility should lie with the supervising authors. They are the ones who are expected to ensure that the work meets the rigorous standards of public scholarship within their field.

Background

We discovered the paper on the morning of Friday, June 16, after it started to go viral on Twitter with headlines like "GPT-4 scores 100% on MIT EECS Curriculum." We decided to dig a little deeper. Within an hour, we were skeptical of the paper's methodology. Within two, we had realized that the dataset itself was suspect.
 
In the paper, it's reported that "a test set of 288 questions is randomly selected amongst questions without images and with solutions." This dataset (excluding the training set used to fine-tune the open-source LLMs) was published to GitHub with the paper's release, along with the code used to generate the reported test performance. However, it was thereafter removed by Professor Drori in this commit.
 
We have released an annotated copy of this test set in a Google Sheet.
 
We are confident that this file represents the test set analyzed in the paper, as all file paths to data in the evaluation code point to it, there is no provided code to modify its contents in any way, and it was available in the originally published GitHub repository. Additionally, the file satisfies all schema requirements specified in the paper (number of rows, etc.). This evidence seems very strong to support all claims below, but we would like to acknowledge there is a possibility of this file being swapped with a different one used for testing. If this is the case, we believe the burden of proof lies on the authors to publicly release this data and all analysis done with it.
 
What follows is an analysis of the dataset as well as some of the claims that were made within the research paper. Note, that the computational experiments we were able to run were limited as:
  1. We are undergrads and don't have a lot of GPT-4 Credits to replicate this work
  1. The paper just came out today on arXiv and replicating all of the results would likely take too long and the “hype cycle” would move onto the next paper.
 
Nonetheless, there are many issues we were able to highlight with the elements of the work that we could analyze. Although only a small fraction of the entire 4550 question dataset, this 288 problem test set was used for all experiments and was stated to have been sampled “randomly” from the full dataset, so should be representative of the rest of the larger set. We will aim to follow up with a more detailed analysis of the paper and the dataset as we have more time/resources to dig into it.

Problems with the data

Unsolvable Questions (~4% of the test set)

We were surprised that any form of GPT-4 would be able to produce a perfect score on the test set so we began examining individual data points. It quickly became clear that no perfect score should be possible because at least 10 questions in the data set were unsolvable with the information provided and another few weren’t valid questions in this context at all. At least 4% of the data falls in this category.
 
Let’s begin with some egregious examples:
Below you are given the delays for the different gates you were permitted to use in part above. Compute the propagation delay of your circuit from D.
Which invocations run in parallel? (Assuming there are enough cores.)
Both of these are actual (full) questions from the dataset — not providing the necessary context to give a valid answer.
 
Continuing:
This problem is a variation on problem 2. Once again, suppose we have a disk of radius 3 meters. This time, near a point at distance from the center, the disk has density kilograms per square meter. This time, the densest part of the disk is on the edge, where the density is eq. The least dense part of the disk is at the center, where the density is Before computing the mass of the disk, guess how it compares to the mass of the disk from Problem 2. Which one is bigger or are they equal? Briefly explain why.
 
Since no information about problem 2 is provided, an answer to this question is impossible. But this is a relatively tame example. Maybe one would be satisfied that calculating the mass of the disk gets to the point of this problem. But you can’t defend a problem like this:
 
"At the command prompt, type: traceroute 18.31.0.200 Describe what is strange about the observed output, and why traceroute gives you such an output. Refer to the traceroute man page for useful hints. Copy/paste any of the relevant portions of output below."
 
The true answer can not be found given this information, because the context is too limited, and without access to an interactive terminal (no such access was given in this work), it would be impossible for an LLM agent to answer. If GPT somehow knew about this IP address’ strange output beforehand, it would be evidence of data leakage, since all the references to this IP address we could find online are in the context of the 6.033 (new: 6.1800) MIT class.
 
There are many such examples. We have annotated the dataset linked above in the spreadsheet. Red rows are questions that we believe to be unsolvable with the provided information. Yellow rows for suspect or strange questions that don't quite make sense as part of this dataset or reference information not given in the problem, but might not be necessary to solve. There are doubtless other examples we didn’t notice in our first pass.
 
Here’s all of the unsolvable problems we found (note: each of these quote blocks is the entire question provided to the models).
Now, we will use the sentence you chose in part 5.1.3 as an assumption, also assume that Dr. Evil does not own a dog, and then prove from these assumptions that Dr. Evil is not an animal lover. The first step in doing a resolution proof is to convert the premises to clausal form. You already did part of that! Below, you will select each of the premises that we will need for a proof. Clause 7 We then resolve clauses 3 and 5. Enter a clause as a list of lists of literal strings.
The above question is unsolvable because the model doesn’t have access to the chosen answer in 5.1.3 or even the question to choose an answer.
There is no formula for the antiderivative of $e^{-x^{2}}$, and so there is no formula for integrals like $\int_{0}^{\cdot 1} e^{-x^{2}} d x$. This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
The above question is unsolvable because there is no formula to solve it and no question is being asked. This is a background information blurb for a question.
At the command prompt, type: traceroute 18.31.0.200 Describe what is strange about the observed output, and why traceroute gives you such an output. Refer to the traceroute man page for useful hints. Copy/paste any of the relevant portions of output below.
As described before, the system lacks the ability to access the terminal.
Below you are given the delays for the different gates you were permitted to use in part $\mathrm{D}$ above. Compute the propagation delay of your circuit from D.
No circuits are given.
Now suppose we start to gamble with this spinner. If I spin $x$, I win $x^{2}-4$ dollars. Notice that $x^{2}-4$ could be negative, and then I lose money. For instance, if I spin the number 1 , then I lose 3 dollars. If I spin once, what is the probability that I lose money?
What is the spinner? What possible values could it take?
Which invocations run in parallel? (Assuming there are enough cores.)
What even are the invocations?
Now suppose we start to gamble with this spinner. If I spin $x$, I win $x^{2}-4$ dollars. Notice that $x^{2}-4$ could be negative, and then I lose money. For instance, if I spin the number 1 , then I lose 3 dollars. Estimate the probability that I win at least 4 dollars. Is it closest to $1 / 6$ or $1 / 9$ or $1 / 18 ?$ Explain your reasoning.
Again no definition of the properties of the spinner!
Let's now go back to running our program with N = 16 and A = 0x300. We will use a 64 word cache that is 2-way set associative but now the block size is 2. This means that there are 16 sets, each of which has 2 cache lines with 2 words per line. Using this new cache configuration, what is the index of the j test instruction? Provide your answer in hexadecimal.
This problem presumably has a diagram that explains what the j test instruction is.
Now, we will use the sentence you chose in part 5.1.3 as an assumption, also assume that Dr. Evil does not own a dog, and then prove from these assumptions that Dr. Evil is not an animal lover. The first step in doing a resolution proof is to convert the premises to clausal form. You already did part of that! Below, you will select each of the premises that we will need for a proof. Clause 5 Now, we can actually do the proof, using FOL resoluton, starting with the four clauses above. We get another clause by resolving clauses 1 and 4. Enter a clause as a list of lists of literal strings.
Again, no provided selection of 5.1.3 or even the question for the model to make a selection.
This problem is a variation on problem 2. Once again, suppose we have a disk of radius 3 meters. This time, near a point at distance $r$ from the center, the disk has density $r+1$ kilograms per square meter. This time, the densest part of the disk is on the edge, where the density is $4 \mathrm{~kg} / \mathrm{m}^{2}$. The least dense part of the disk is at the center, where the density is $1 \mathrm{~kg} / \mathrm{m}^{2}$. Before computing the mass of the disk, guess how it compares to the mass of the disk from Problem 2. Which one is bigger or are they equal? Briefly explain why.
No answer from Problem 2 to compare to.
 
Some of the questions in the dataset aren't even actually questions. For example:
There is no formula for the antiderivative of , and so there is no formula for integrals like . This type of integral is actually quite important in probability and statistics as we will see later. In this problem, we use Taylor series to approximate it.
 
Additionally, here is an example question from the dataset which is an assignment to create an NLP project proposal:
As mentioned on the course website and in class, for Phase 0, each student is expected to identify a distinct potential research problem they would like to work on. Even if you already have a self-formed team of classmates you would like to work with, each student must draft a unique idea for Phase 0. The idea should center around a tangible research question you would like to answer. We ask each student to submit a short write-up (plain text or upload a .txt or .pdf) that includes the following details. We encourage you to adopt the style of list format when submitting your items: • research problem (~2 sentences) • dataset(s) you plan to use (1 sentence) • metric(s) to assess the success of your idea/model (1 sentence) • idea(s) for an initial approach or model (1-2 sentences)
Notice no question was actually asked, it was a description of an assignment that would need to be written. Additionally, even assuming the model was supposed to produce an output, there is no valid grading criterion for the model here.
 
Using text similarity, we found that there were 14 questions (7 pairs) that were duplicates in the set of 288 questions examined, in these cases the question strings had extremely minor character-level noise as the only differences between them or were entirely identical.
✏️
We wanted to share a minor correction to the above sentence. We used text similarity to identify “duplicate” questions within the dataset that had only minor differences between them. Upon closer inspection, “the character level noise” referred to above, did actually have some semantic meaning differentiating the questions. However, we still find it odd that, from the original, diverse, dataset of 4500 questions, sampling 288 questions yields 40 questions (~14%) that have a corresponding pair in the sample with >75% text similarity.
 
Given these unsolvable questions, the fact that GPT-4 is able to get 100% accuracy by any means is incredibly suspect. There is either leakage of the solutions into the prompts at some stage or the questions are not being graded correctly. These initial findings motivated us to investigate further, starting with the few shot examples used as a secondary step if the model fails in zero-shot correctness. Eventually, we found that it’s in fact both that there is leakage of solution information and that there are issues with the method used for grading the model’s outputs.

Information Leak in Few Shot Examples

First, let’s understand what the paper means by few-shot examples. In short, they perform a cosine similarity search on OpenAI embeddings for similar questions within the dataset and include those problems and solutions as additional context in the model’s prompt to help the model solve the problem. This would be all right, so long as the examples were sufficiently different and distanced from the problem in question so as not to reveal unfair information.
 
When just scanning the published test dataset at random, we noticed something odd. Many of the "few shot examples" provided to the model were almost the same word for word as the problems themselves. To understand this further, we wrote a simple script to simply look at overlap between the provided few shot example problem statements and the problems that were listed:
 
Plotting this overlap as a histogram, yields the following:
notion imagenotion image
Many of the provided few-shot examples are almost identical if not entirely identical to the problems themselves. This is an issue because it means that the model is being given the answer to the question or a question very very similar to the question. Usually, this comes from multi-part questions with a lot of shared context that is repeated. We believe that for proper evaluation of GPT’s solving ability, other parts of multi-part questions should have been excluded entirely from the few-shot examples for a given question. In fact, in our exploration, we found that these multi-part solutions often directly refer to or give the solution for the other part of the question that the model was tasked with solving.
 
Not only that, in our digging through some of this data, we found instances of the entire exact question being repeated. For example:
Question 42
Suppose that: • two different gene-editing processes A and B are running asynchronously using the same Lab • A and B both call lab.get(dnaX) for the same precursor dnaX • no other asynchronous processes are using lab For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence. A runs lines 1, 4, 5, 6a; then B runs lines 1, 4, 5, 6a; then A finishes lines 6b and 7, then B finishes lines 6b and 7. Solution: impossible; After A runs line 5, there will be at least one tube in tubeMap, so B must proceed from line 1 to line 2.
Question #42 Few Shot (3)
Suppose that: • two different gene-editing processes A and B are running asynchronously using the same Lab • A and B both call lab.get(dnaX) for the same precursor dnaX • no other asynchronous processes are using lab For each of the following interleavings, referring to the line numbers 1-7 in get() in the provided code, decide whether the interleaving is impossible, leads to a race condition or deadlock, or runs safely; then explain your answer in one sentence. A runs lines 1, 4, 5, 6, 7, then B runs lines 1, 4, 5, 6, 7. Solution: impossible; After A runs line 5, there will be at least one tube in tubeMap, so B must proceed from line 1 to line 2.
As you can see the questions are in fact identical, except for some meaningless stray characters: A runs lines 1, 4, 5, 6a; then B runs lines 1, 4, 5, 6a; then A finishes lines 6b and 7, then B finishes lines 6b and 7. vs A runs lines 1, 4, 5, 6, 7, then B runs lines 1, 4, 5, 6, 7.
✏️
Thanks to user @SemicolonExp on Twitter, we realized there is a semantic difference in the above characters that we missed on our first pass. We note, however, that the exact word-for-word solution is still being passed as input in the few-shot prompt for this case, so we believe the criticism in this section is still relevant.
 
Even the answers are identical in these two cases.
 
Here’s a list of some of the worst answer leaks from few-shot prompting we identified, where the question and answer are (essentially) exactly the same:
Question
This problem is a variation on problem 2. Once again, suppose we have a disk of radius 3 meters. This time, near a point at distance $r$ from the center, the disk has density $r+1$ kilograms per square meter. This time, the densest part of the disk is on the edge, where the density is $4 \\mathrm{~kg} / \\mathrm{m}^{2}$. The least dense part of the disk is at the center, where the density is $1 \\mathrm{~kg} / \\mathrm{m}^{2}$.\nBefore computing the mass of the disk, guess how it compares to the mass of the disk from Problem 2. Which one is bigger or are they equal? Briefly explain why.
Few shot question 1:
This problem is a variation on problem 2. Once again, suppose we have a disk of radius 3 meters. This time, near a point at distance $r$ from the center, the disk has density $r+1$ kilograms per square meter. This time, the densest part of the disk is on the edge, where the density is $4 \\mathrm{~kg} / \\mathrm{m}^{2}$. The least dense part of the disk is at the center, where the density is $1 \\mathrm{~kg} / \\mathrm{m}^{2}$.\nIf your guess in part i. was wrong, then reflect on it again and make a second attempt to explain which disk has more mass.
The only diff is Before computing the mass of the disk, guess how it compares to the mass of the disk from Problem 2. Which one is bigger or are they equal? Briefly explain why. vs If your guess in part i. was wrong, then reflect on it again and make a second attempt to explain which disk has more mass.
Solution === Few shot solution 1:
We might guess that they are close to equal since the smallest density on both is 1 and the largest density is 4. However, the area corresponding to the largest density in the first problem was a small disk around the origin (say of radius $0.1$, so the area is $\\left.2 \\pi(0.1)^{2}\\right)$, whereas the area corresponding to density $\\approx 4$ in the second problem is the annulus from $2.9$ to 3 , which has area approximately $2 \\pi(2.9)(.1)$ (so the second disk has high density for a larger area).
 
[Problem 42 from above]:
Question === Few shot solution 3
Solution === Few shot solution 3
 
Question:
"Our eventual goal is to do gradient descent on the logistic regression objective $J_{\\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\\theta$ and $\\theta_{0}$.\nWhat is the derivative of $L_{\\text {nll }}$ with respect to $\\theta_{0}$ ? Enter a Python expression involving $\\mathrm{x}, \\mathrm{y}$, and $\\mathrm{g}$."
Few shot question 1:
"Our eventual goal is to do gradient descent on the logistic regression objective $J_{\\text {nll }}$. In this problem, we'll take the first step toward deriving that gradient update. We'll focus on the gradient of the loss at a single point with respect to parameters $\\theta$ and $\\theta_{0}$.\nWhat is the derivative of $L_{\\mathrm{nll}}$ with respect to $\\theta$ ? Enter a Python expression involving $\\mathrm{x}, \\mathrm{y}$, and $\\mathrm{g}$.\nHint: Use the chain rule and your expression from $5.3$."
The only difference is: \nHint: Use the chain rule and your expression from $5.3$."
Solution === Few shot solution 1:
(g-y)*x
 
 
There’s plenty more leaks that are not so blatant as well.

Approach

Below is the open-sourced Github code for the main experiment that was run in the paper:
 
The first major issue to notice here is how the flow handles grading — checking it with GPT-4 with a) the original question, b) the *ground solution*, and c) GPT’s own answer, as parameters in the grading prompt. In more technical fields, where GPT is more likely to have implicit misunderstandings, it’s more likely that this automated grading will have self-congratulatory results.
 
Additionally, while the prompt cascade is a common technique in many recent GPT papers, the potential for data leakage is abundant here. Each level is not only providing binary information based on the ground truth, but is also prompting until the correct answer is reached.
Though these created prompts don’t see the actual solution, the binary feedback of reprompting for a correct answer until one is reached is enough, especially so in multiple-choice problems representing 16% of the test set, where unlimited tries (nearly) guarantee a correct answer. This is analogous to someone with the answer sheet telling the student if they’ve gotten the answer right or not, until they do.
 

Swapped parameter (system, question) with zero-shot code:

We additionally found some typos/errors in the source code that lead to different prompting than was described in the paper or likely expected by the authors.
These are the function params for the zero_shot function:
This is how it gets called in the code:
So all of the zero-shot results are mis-prompted, with the “expert” prompt being put in the location of the question, and the question goes into You are {question}
Therefore, the end prompt is of the form: You are {question} Your task is to answer the following question: Question: {system}
 

Issues with expert prompting:

Malformed “experts”:
Other than the swap-typo detailed above, the expert selection mechanism also results in incoherent text due to the difficulty in generating text in a specific format (ie. a list of experts separated by commas); therefore, many of the “expert prompts” that get fed through the system are of the form:
It is difficult to pinpoint specific individuals who would be most capable of solving this question, as many computer scientists and mathematicians with expertise in digital logic, computer arithmetic, and number systems would be well-equipped to handle this problem. However, here are three experts who have made significant contributions to the fields of computer science and mathematics and would likely be able to solve this question:
  1. Donald Knuth, a renowned computer scientist and mathematician, known for his work on the analysis of algorithms and the author of "The Art of Computer Programming" series.
  1. John Hennessy, a computer scientist and electrical engineer, who co-authored the widely-used textbook "Computer Organization and Design: The Hardware/Software Interface" and is a pioneer in computer architecture.
  1. Peter Denning, a computer scientist with expertise in operating systems, computer architecture, and performance analysis, who has authored several textbooks and made significant contributions to the field of computer science.
These experts have a deep understanding of the underlying concepts required to solve this problem and would likely be able to provide a solution.
And since the paper code splits this “list of experts” by comma, the first 3 “expert prompts” end up being: a) It is difficult to pinpoint specific individuals who would be most capable of solving this question b) as many computer scientists and mathematicians with expertise in digital logic c) and number systems would be well-equipped to handle this problem. However
[not a typo, the prompts get cut off there]
This makes the expert prompting system extremely hard to control and frequently adds no useful information. While a differently constructed prompt might have functioned, the one described in the paper does not result in consistently structured outputs, making the effect of this “expert prompting” approach unclear.
 
Effects of swapped parameters:
In the original paper, the parameters of the zero-shot function were swapped, meaning that the question text and system (expert) prompt were in the wrong location. This effect can be seen clearly in the generated solutions without fixing, resulting in many generations of the form:
It seems that the question is not related to the problem statement provided. Please provide a relevant question related to the problem statement, and I will be happy to help you with that.
It seems that the question is not related to the integral problem mentioned earlier. Terence Tao is a mathematician who has made significant contributions to various fields, including harmonic analysis, partial differential equations, and combinatorics. He is a professor at UCLA and has received numerous awards, including the Fields Medal in 2006. If you have a specific question about Terence Tao or his work, please feel free to ask.
It seems like you have not provided a question related to the given numpy array. Please provide a question related to the numpy array A = np.array([[5,7,10,14],[2,4,8,9]]) and the expression A[:, 1:2].
It seems like you have provided a name, Jeffrey Dean, instead of a question related to the given information about the MapReduce paper and incast issues. Please provide a clear question related to the topic, and I will be happy to help you with the answer.
And many more, seemingly the majority, are generations where the model returns a confused response. These responses could not be properly graded, as they are not even attempts at solutions. For a full view of these generations, please look through the spreadsheet linked in the update, and scroll through the [Actual Paper implementation] column.

Claims of verification:

The site for the paper states that “[they] double-verify manually that the grading on the test set is correct”, but (assuming that the test set released is the same one referenced here) it makes it unclear how any of the issues brought up in this document were not identified during this verification.
notion imagenotion image
 

Relevant analysis by Davis (2022) about another Drori et al. (2022):

Fifth, and perhaps the most serious: The way in which problems are “automatically” chosen for few-shot learning is altogether unclear and may be illegitimate. The paper says (figure 2) “If zero-shot learning does not work, perform few-shot learning,” and (p. 4) “If the question is not solved [by zero-shot learning], we do the following [description of the few-shot procedure]”. The question is, how does the system know that zero-shot learning has not succeeded? As far as I can see the question is not answered in the paper. Perhaps the system uses some legitimate method; e.g. the Codex system fails to produce executable code. However, if that were the criterion, one would expect that some fraction of the time, zero-shot learning would produce code that executes but is erroneous; and there is no suggestion of that in the paper. What seems much more likely is that the system moves to few shot learning when zero-shot learning has produced an answer that is incorrect. That is, the program is using the recorded correct answer to guide its actions. That would be cheating and if that is the case, then all of the results relative to few-shot learning must be thrown out, or at least interpreted with a very large asterisk.

Conclusions

The observations presented here are just a few of the most glaring problems that we found in a few hours of review. We are certain as others look closer into their methodology more holes will be revealed. We encourage any readers to download the dataset and start examining it themselves. Affirmation can only come through peer evaluation.
 
Our observations about the integrity of the data methods of analysis are worrying. This paper speaks to a larger trend in AI research recently. As the field progresses faster and faster, the timelines for discovery seem to shrink, which often comes with shortcuts. One particularly worrying trend is the technique of evaluating a model’s accuracy using a language-based model like GPT-4. While a useful tool, its conclusions should never be overstated or treated as ground truth. Recent works have shown that GPT-4 evaluators cannot be reliably used for validation without accurate ground truth information. At the very least, a random subset of the dataset should be selected to compare GPT-4’s performance against a human counterpart. Language models cannot yet be treated as ground truth-generating oracles. Additionally, it is extremely important to reevaluate every data point and perform basic sanity checks before using data at all, whether for training, inference, benchmarking, or something else. Given the small size of the dataset in question, a simple manual validation would have been easily within the scope of the work.
 
Our critiques are largely of the methodology and rigor of this study, not about its content. We make no claim about the ability of large language models to actually solve MIT curricula, only that this paper fails to prove it in a scientifically rigorous way. Though, as MIT undergraduates ourselves, we can at the very least say that the test set that we accessed does not, at least in our experience, accurately represent the breadth and depth of understanding required to complete an EECS degree at MIT. 😄