How AI assistants are already changing the way code gets made

via MIT Technology Review https://ift.tt/FKbyXM8

Two weeks into the coding class he was teaching at Duke University in North Carolina this spring, Noah Gift told his students to throw out the course materials he’d given them. Instead of working with Python, one of the most popular entry-level programming languages, the students would now be using Rust, a language that was newer, more powerful, and much harder to learn.

Gift, a software developer with 25 years of experience, had only just learned Rust himself. But he was confident his students would be fine with the last-minute switch-up. That’s because they’d also each get a special new sidekick: an AI tool called Copilot, a turbocharged autocomplete for computer code, built on top of OpenAI’s latest large language models, GPT-3.5 and GPT-4.

Copilot is made by GitHub, a firm that runs an online software development platform used by more than 100 million programmers. The tool monitors every keystroke you make, predicts what you are trying to do on the fly, and offers up a nonstop stream of code snippets you could use to do it. Gift, who had been told about Copilot by someone he knew at GitHub’s parent company, Microsoft, saw its potential at once.

“There’s no way I could have learned Rust as quickly as I did without Copilot,” he says. “I basically had a supersmart assistant next to me that could answer my questions while I tried to level up. It was pretty obvious to me that we should start using it in class.”

Gift isn’t alone. Ask a room of computer science students or programmers if they use Copilot, and many now raise a hand. All the people interviewed for this article said they used Copilot themselves—even those who pointed out problems with the tool.

Like ChatGPT with education, Copilot is upending an entire profession by giving people new ways to perform old tasks. Packaged as a paid-for plug-in for Microsoft’s Visual Studio software (a kind of industry-standard multi-tool for writing, debugging, and deploying code), Copilot is the slickest version of this tech. But it’s not the only tool available to coders. In August, Meta released a free code-generation model called Code Llama, based on Llama 2, Meta’s answer to GPT-4. The same month, Stability AI—the firm behind the image-making model Stable Diffusion—put out StableCode. And, of course, there’s ChatGPT, which OpenAI has pitched from the start as a chatbot that can help write and debug code.

“It’s the first time that machine-learning models have been really useful for a lot of people,” says Gabriel Synnaeve, who led the team behind Code Llama at Meta. “It’s not just nerding out—it’s actually useful.”

With Microsoft and Google about to stir similar generative models into office software used by billions around the world (Microsoft has started using Copilot as a brand name across Office 365), it’s worth asking exactly what these tools do for programmers. How are they changing the basics of a decades-old job? Will they help programmers make more and better software? Or will they get bogged down in legal fights over IP and copyright?

Cranking out code

On the surface, writing code involves typing statements and instructions in some programming language into a text file. This then gets translated into machine code that a computer can run—a level up from the 0s and 1s of binary. In practice, programmers also spend a lot of time googling, looking up workarounds for common problems or skimming online forums for faster ways to write an algorithm. Existing chunks of prewritten code then get repurposed, and new software often comes together like a collage.

But these look-ups take time and let programmers out of the flow of converting thoughts into code, says Thomas Dohmke, GitHub’s CEO: “You’ve got a lot of tabs open, you’re planning a vacation, maybe you’re reading the news. At last you copy the text you need and go back to your code, but it’s 20 minutes later and you lost the flow.”

The key idea behind Copilot and other programs like it, sometimes called code assistants, is to put the information that programmers need right next to the code they are writing. The tool tracks the code and comments (descriptions or notes written in natural language) in the file that a programmer is working on, as well as other files that it links to or that have been edited in the same project, and sends all this text to the large language model behind Copilot as a prompt. (GitHub co-developed Copilot’s model, called Codex, with OpenAI. It is a large language model fine-tuned on code.) Copilot then predicts what the programmer is trying to do and suggests code to do it.

This round trip between code and Codex happens multiple times a second, the prompt updating as the programmer types. At any moment, the programmer can accept what Copilot suggests by hitting the tab key, or ignore it and carry on typing.

The tab button seems to get hit a lot. A study of almost a million Copilot users published by GitHub and the consulting firm Keystone Strategy in June—a year after the tool’s general release—found that programmers accepted on average around 30% of its suggestions, according to GitHub’s user data.

“In the last year Copilot has suggested—and had okayed by developers—more than a billion lines of code,” says Dohmke. “Out there, running inside computers, is code generated by a stochastic parrot.”

Copilot has changed the basic skills of coding. As with ChatGPT or image makers like Stable Diffusion, the tool’s output is often not exactly what’s wanted—but it can be close. “Maybe it’s correct, maybe it’s not—but it’s a good start,” says Arghavan Moradi Dakhel, a researcher at Polytechnique Montréal in Canada who studies the use of machine-learning tools in software development. Programming becomes prompting: rather than coming up with code from scratch, the work involves tweaking half-formed code and nudging a large language model to produce something more on point.

But Copilot isn’t everywhere yet. Some firms, including Apple, have asked employees not to use it, wary of leaking IP and other private data to competitors. For Justin Gottschlich, CEO of Merly, a startup that uses AI to analyze code across large software projects, that will always be a deal-breaker: “If I’m Google or Intel and my IP is my source code, I’m never going to use it,” he says. “Why don’t I just send you all my trade secrets too? It’s just put-your-pants-on-before-you-leave-the-house kind of obvious.” Dohmke is aware this is a turn-off for key customers and says that the firm is working on a version of Copilot that businesses can run in-house, so that code isn’t sent to Microsoft’s servers.

Copilot is also at the center of a lawsuit filed by programmers unhappy that their code was used to train the models behind it without their consent. Microsoft has offered indemnity to users of its models who are wary of potential litigation. But the legal issues will take years to play out in the courts.

Dohmke is bullish, confident that the pros outweigh the cons: “We will adjust to whatever US, UK, or European lawmakers tell us to do,” he says. “But there is a middle balance here between protecting rights—and protecting privacy—and us as humanity making a step forward.” That’s the kind of fighting talk you’d expect from a CEO. But this is new, uncharted territory. If nothing else, GitHub is leading a brazen experiment that could pave the way for a wider range of AI-powered professional assistants.

Code whisperer

GitHub started working on Copilot in June 2020, soon after OpenAI released GPT-3. Programmers have always been on the lookout for shortcuts and speedups. “It’s part of the DNA of being a software developer,” says Dohmke. “We wanted to solve this problem of boilerplate code—can we generate code that is no fun to write but takes up time?”

The first sign they were onto something came when they asked programmers at the company to submit coding tests that they might ask somebody at a job interview: “Here’s some code—finish it off.” GitHub gave these to an early version of the tool and let it try each test 150 times. Given that many attempts, they found that the tool could solve 92% of them. They tried again with 50,000 problems taken from GitHub’s online platform, and the tool solved just over half of them. “That gave us confidence that we could build what ultimately became Copilot,” says Dohmke.

In 2023, a team of GitHub and Microsoft researchers tested the impact of Copilot on programmers in a small study. They asked 95 people to build a web server (a non-trivial task, but one involving the kind of common, boilerplate code that Dohmke refers to) and gave half access to Copilot. Those using Copilot completed the task on average 55% faster.

A powerful AI that replaces the need for googling is useful—but is it a game changer? Opinion is split.

“The way that I would think about it is that you have an experienced developer sitting next to you whispering recommendations,” says Marco Iansiti, a Keystone Strategy cofounder and a professor at Harvard Business School, where he studies digital transformation. “You used to have to look things up on your own, and now—whammo—here comes the suggestion automatically.”

Gottschlich, who has been working on automatic code generation for years, is less impressed. “To be frank, code assistants are fairly uninteresting in the larger scheme of things,” he says, referring to the new wave of tools based on large language models, like Copilot. “They are principally bound by what the human programmer is capable of doing. They’ll never likely at this stage be able to do something miraculous beyond what the human programmer is doing.”

Gottschlich, who claims that Merly’s tech finds bugs in code and fixes them by itself (but who doesn’t shed light on how that works), is thinking bigger. He sees AI one day taking on the management of vast and complex libraries of code, directing human engineers in how to maintain it. But he doesn’t think large language models are the right tech for that job.

Even so, small changes to a task that millions of people do all the time can add up fast. Iansiti, for example, makes a huge claim: he believes that the impact of Copilot—and tools like it—could add $1.5 trillion to the global economy by 2030. “It’s more of a back-of-the-envelope thing, not really an academic estimate, but it could be a lot larger than that as well,” he says. “There’s so much stuff that hinges on software. If you move the needle on how software development really works, it will have an infinite impact on the economy.”

For Iansiti, it’s not just about getting existing developers to produce more code. He argues that these tools will increase the demand for programmers because companies could get more code for less money. At the same time, there will be more coders because these tools lower the barrier to entry. “We’re going to see an expansion in who can contribute to software development,” he says.

Or as Idan Gazit, GitHub’s senior director of research, puts it: imagine if anyone who picked up a guitar could play a basic tune straight away. There would be a lot more guitar players and a lot more music.

Many agree that Copilot makes it easier to pick up programming—as Gift found. “Rust has got a reputation for being a very difficult language,” he says. “But I was pleasantly shocked at how well the students did and at the projects they built—how complex and useful they were.” Gift says they were able to build complete web apps with chatbots in them.

Not everyone was happy with Gift’s syllabus change, however. He says that one of his teaching assistants told new students not to use Copilot because it was a crutch that would stop them from learning properly. Gift accepts that Copilot is like training wheels that you might not want to take off. But he doesn’t think that’s a problem: “What are we trying to do? We’re trying to build complex systems.” And to do that, he argues, programmers should use whatever tools are available.

It is true that the history of computing has seen programmers rely on more and more layers of software between themselves and the machine code that computers can run. They have gone from punch cards and assembly code to programming languages like Python that are relatively easy to read and write. That’s possible because such languages get translated into machine code by software called compilers. “When I started coding in the ’80s and ’90s, you still had to know how a CPU worked,” says Dohmke. “Now when you write a web application, you almost never think about the CPU or the web server.”

Add in a long list of bug-catching and code-testing tools, and programmers are used to a large amount of automated support. In many ways, Copilot and others are just the latest wave. “I used Python for 25 years because it was written to be readable by humans,” says Gift. “In my opinion, that doesn’t matter anymore.”

But he points out that Copilot isn’t a free pass. “Copilot reflects your ability,” he says. “It lifts everyone up a little bit, but if you’re a poor programmer you’ll still have weaknesses.”

Work to be done

A big problem with assessing the true impact of such tools is that most of the data is still anecdotal. GitHub’s study showed that programmers were accepting 30% of suggestions (“30% is out of this world in any kind of industry scenario,” says Dohmke), but it is not clear why the programmers accepted those suggestions and rejected others.

The same study also revealed that less experienced programmers accepted more suggestions and that programmers accepted more suggestions as they grew used to the tool—but, again, not why. “We need to go a lot deeper to understand what that means,” says Iansiti. “There’s work to be done to really get a sense of how the coding process itself is developing, and that work is all TBD.”

Most independent studies of tools like Copilot have focused on the correctness of the code that they suggest. Like all large language models, these tools can produce nonsense. With code it can be hard to tell—especially for less experienced users, who also seem to rely on Copilot the most.

Several teams of researchers in the last couple of years have found that Copilot can insert bugs or security flaws into code. GitHub has been busy improving the accuracy of Copilot’s suggestions. It claims that the latest version of the tool runs code through a second model trained to filter out common security bugs before making a suggestion to users.

But there are other quality issues beyond bugs, says Dakhel. She and her colleagues have found that Copilot can suggest code that is overly complex or doesn’t adhere to what professionals consider best practices, which is a problem because complex or unclear code is harder for other people to read, check, and extend.

The problem is that models are only as good as their training data. And Copilot’s models were trained on a vast library of code taken from GitHub’s online repository, which goes back 15 years. This code contains not only bugs but also security flaws that were not known about when the code was written.

Add to this the fact that inexperienced programmers use the tool more than experienced ones, and it could make more work for software development teams in the long run, says Dakhel. Expert programmers may have to spend more time double-checking the code put through by non-experts.

Dakhel now hopes to study the gap between expert and non-expert programmers more fully. Before Copilot was released, she and her colleagues were using machine learning to detect expert programmers by their code. But Copilot messed with her data because now it was harder to tell whether code had been written by an expert programmer or a less experienced one with AI help.

Now, having played around with Copilot herself, she plans to use her approach to study what kind of boost it gives. “I’m curious to know if junior developers using such a tool will be predicted to be expert developers or if it’s still detectable that they are junior developers,” she says. “It could be a way of measuring how big a level up these tools give people.”

Ultimately, we might not have to wait long before the jury is in. Software development is one of the most well documented and thoroughly measured of business activities. If Copilot works, it will get used. If it doesn’t, it won’t. In the meantime, these tools are getting better all the time.

Yet it is worth noting that programming—typing text onto a screen—is a small part of the overall job of software development. It involves managing multiple parts of a complex puzzle, including designing the code, testing it, and deploying it. Copilot, like many programs before it, can make parts of that job faster, but it won’t reinvent it completely.

“There’s always going to be programmers,” says Synnaeve. “They will get a lot of help, but in the end what matters is understanding which problems need solving. To do that really well and translate that into a program—that’s the job of programmers.”

Tool Boxx

Search This Blog