Sony Publishes Patent For In-Game Asset Generation Using Voice

We may soon able to create enticing in-game assets and backgrounds using our speech alone.

By Shahmeer Sarfaraz November 14, 2022

The PlayStation Logo

Sony is innovating in every industry, and gaming is arguably among the biggest. Sony Entertainment Interactive has published interesting patents that seek to improve every field, such as in-game sound generation based on user environment and revamping haptic feedback with proposed ultrasonic interface technology.

Now, Sony has presented another alluring prospect that may revolutionize the in-game world generation as we ambiguous know it.

We bumped into a newly published Sony patent dubbed “VOICE DRIVEN 3D STATIC ASSET CREATION IN COMPUTER SIMULATIONS” which will utilize the natural human text or speech as an input to generate the requested assets. “Computer Simulations,” is a formal word used for video games in patents.

Major Takeaway

Sony has published a new patent to create in-game assets by taking text or speech as input and letting AI generate the requested assets.
The assets include terrains, static items on a map, furniture, and anything implemented to increase the visual appeal of an in-game environment.
The user input will first be rendered into a 2D image before converting it into a 3D asset to match the mentioned specifications.
Many variables, including length, width, design type, terrain type, physics, and specific details suggested while generating assets using machine learning models.

The patent by Sony will utilize natural human language such as voice or text as an input to aid asset creation in games by letting machine learning models generate them. The word “Asset” means any “common background objects” that are static and usually not interactable in video games.

Assets include terrains, backgrounds, furniture, and any part of the in-game map created within worlds. They make up most of the game maps and enhance the visual appeal of every genre of games, and the patent seeks to create them by using natural human language through neural networks.

Sony’s patent clarifies, “Present principles allow content creators to describe the asset they want as a natural language input, and create a 2D or 3D asset from that (voice) input. Creating initial prototype assets for artists to iterate on is also facilitated.” Artists can also use the generated assets as prototypes to improve upon them instead of outright using them.

An example shows a person entering speech to generate a chair asset of a computer game.

Moreover, artists can improve or modify the generated assets, “The method may include using an artist computer for modifying the 3D asset prior to presenting the 3D asset. A microphone may be used to input modification of the 3D asset to the artist computer.”

Various modifications can be made to the assets, including “changes to size, shape, color, style of certain parts of the asset (but not to all parts of the asset), texture of the surface of the asset, etc.”

The patent also explains its working, first, the user input will be rendered into a two-dimensional image using neural networks. Then, the 2D image will be converted to a three-dimensional asset according to the requested details before being presented.

“A method includes receiving text such as from speech conversion and processing the text using at least one neural network to render a two dimensional (2D) image of a computer simulation asset. The method also includes converting the 2D image to a three dimensional (3D) asset.”

An example of a flow chart describing the overall process of converting speech to text to a 3D asset.

It is necessary to mention the location while inputting demands for asset generation. Moreover, the input can be either by text or speech in a natural human language. Multiple assets can be created at a time if necessary.

Sony clarifies, “The text may be input from a keyboard or from speech and may indicate at least one location and the 3D asset is consistent with the location. The text/speech may indicate at least plural objects and the 3D asset is consistent with the plural objects.”

Furthermore, the system will first search in a library to ensure that the inputted keywords do not exist already in the library. If no match is found in the library, the AI will generate a new asset. The patent states, “The image may be generated from scratch or may be selected by accessing a library of assets.”

Sony also discusses creating background terrains using the proposed system, “an artist may also vocally describe a desired background terrain, e.g., “dirt” or “palace marble” or other terrain.”

Various elements like size, width, and other factors can be suggested in the input. For instance, “the artist may specify a chair that is twenty feet high.” The method will also be capable to deal with problems that arise because of how the various assets interact with each other.

For example, if an interference takes place between the roof and the chair, “the roof may be caused to automatically appear as deforming to accommodate the chair. This may entail human-AI collaborative methods” Artists can also modify any issues that occur, as mentioned above, and collaborate with the AI.

An example diagram describing a scene of a chair and a sofa and then modifying the chair to another style.

The AI engine can also keep track of how the physical properties of various structures react to physics; by imposing constraints. For instance, “if the asset is a piece of furniture, it must be generated with attributes that prevent it from tipping over no matter how top heavy the 3D asset may be emulated to be.”

The AI engines can also ensure various elements like how several factors react to assets.

For instance, “to establish properties of an asset for how the asset absorbs force, e.g., does the asset shatter or crack if hit with a bullet, or does it absorb the bullet. An asset representing a grenade may be designed to have different kinds of explosions in the presence of different assets.”

Implementing a way to vocally create in-game assets may significantly reduce development times for massive AAA games. This feature could also be used by modders or simple players looking to create a new game world to enjoy. Imagine being able to generate dynamically modifiable in-game worlds before playing in them.

What are your thoughts about Sony seeking to implement asset and background generation using natural human speech? Would you like to see it implemented as a feature in games in the near future? Do let us know your opinions in the comments below.

Did you find this helpful? Leave feedback below.

Thanks! Do share your feedback with us. ⚡

How can we make this post better? Your help would be appreciated. ✍