This article walks through practical TensorFlow preprocessing techniques using Keras layers. It covers image data augmentation, normalization of numerical features, one-hot encoding for string and integer categories, the hashing trick for high-cardinality features, and text preprocessing with TextVectorization for embeddings, N-grams, and TF-IDF. Each example includes short code recipes you can adapt directly to your own training pipelines, helping you streamline model preparation and improve performance.This article walks through practical TensorFlow preprocessing techniques using Keras layers. It covers image data augmentation, normalization of numerical features, one-hot encoding for string and integer categories, the hashing trick for high-cardinality features, and text preprocessing with TextVectorization for embeddings, N-grams, and TF-IDF. Each example includes short code recipes you can adapt directly to your own training pipelines, helping you streamline model preparation and improve performance.

The Only TensorFlow Preprocessing Guide You Need

2025/09/19 04:25

Content Overview

  • Quick recipes
  • Image data augmentation
  • Normalizing numerical features
  • Encoding string categorical features via one-hot encoding
  • Encoding integer categorical features via one-hot encoding
  • Applying the hashing trick to an integer categorical feature
  • Encoding text as a sequence of token indices
  • Encoding text as a dense matrix of N-grams with TF-IDF weighting
  • Important gotchas
  • Working with lookup layers with very large vocabularies
  • Using lookup layers on a TPU pod or with ParameterServerStrategy

Quick recipes

Image data augmentation

Note that image data augmentation layers are only active during training (similarly to the Dropout layer).

\

from tensorflow import keras from tensorflow.keras import layers  # Create a data augmentation stage with horizontal flipping, rotations, zooms data_augmentation = keras.Sequential(     [         layers.RandomFlip("horizontal"),         layers.RandomRotation(0.1),         layers.RandomZoom(0.1),     ] )  # Load some data (x_train, y_train), _ = keras.datasets.cifar10.load_data() input_shape = x_train.shape[1:] classes = 10  # Create a tf.data pipeline of augmented images (and their labels) train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) train_dataset = train_dataset.batch(16).map(lambda x, y: (data_augmentation(x), y))   # Create a model and train it on the augmented image data inputs = keras.Input(shape=input_shape) x = layers.Rescaling(1.0 / 255)(inputs)  # Rescale inputs outputs = keras.applications.ResNet50(  # Add the rest of the model     weights=None, input_shape=input_shape, classes=classes )(x) model = keras.Model(inputs, outputs) model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy") model.fit(train_dataset, steps_per_epoch=5) 

\

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 170498071/170498071 [==============================] - 5s 0us/step 5/5 [==============================] - 25s 31ms/step - loss: 9.0505 <keras.src.callbacks.History at 0x7fdb34287820> 

You can see a similar setup in action in the example image classification from scratch.

Normalizing numerical features

\

# Load some data (x_train, y_train), _ = keras.datasets.cifar10.load_data() x_train = x_train.reshape((len(x_train), -1)) input_shape = x_train.shape[1:] classes = 10  # Create a Normalization layer and set its internal state using the training data normalizer = layers.Normalization() normalizer.adapt(x_train)  # Create a model that include the normalization layer inputs = keras.Input(shape=input_shape) x = normalizer(inputs) outputs = layers.Dense(classes, activation="softmax")(x) model = keras.Model(inputs, outputs)  # Train the model model.compile(optimizer="adam", loss="sparse_categorical_crossentropy") model.fit(x_train, y_train) 

\

1563/1563 [==============================] - 3s 2ms/step - loss: 2.1271 <keras.src.callbacks.History at 0x7fda8c6f0730> 

Encoding string categorical features via one-hot encoding

\

# Define some toy data data = tf.constant([["a"], ["b"], ["c"], ["b"], ["c"], ["a"]])  # Use StringLookup to build an index of the feature values and encode output. lookup = layers.StringLookup(output_mode="one_hot") lookup.adapt(data)  # Convert new test data (which includes unknown feature values) test_data = tf.constant([["a"], ["b"], ["c"], ["d"], ["e"], [""]]) encoded_data = lookup(test_data) print(encoded_data) 

\

tf.Tensor( [[0. 0. 0. 1.]  [0. 0. 1. 0.]  [0. 1. 0. 0.]  [1. 0. 0. 0.]  [1. 0. 0. 0.]  [1. 0. 0. 0.]], shape=(6, 4), dtype=float32) 

Note that, here, index 0 is reserved for out-of-vocabulary values (values that were not seen during adapt()).

You can see the StringLookup in action in the Structured data classification from scratch example.

Encoding integer categorical features via one-hot encoding

# Define some toy data data = tf.constant([[10], [20], [20], [10], [30], [0]])  # Use IntegerLookup to build an index of the feature values and encode output. lookup = layers.IntegerLookup(output_mode="one_hot") lookup.adapt(data)  # Convert new test data (which includes unknown feature values) test_data = tf.constant([[10], [10], [20], [50], [60], [0]]) encoded_data = lookup(test_data) print(encoded_data) 

\

tf.Tensor( [[0. 0. 1. 0. 0.]  [0. 0. 1. 0. 0.]  [0. 1. 0. 0. 0.]  [1. 0. 0. 0. 0.]  [1. 0. 0. 0. 0.]  [0. 0. 0. 0. 1.]], shape=(6, 5), dtype=float32) 

Note that index 0 is reserved for missing values (which you should specify as the value 0), and index 1 is reserved for out-of-vocabulary values (values that were not seen during adapt()). You can configure this by using the mask_token and oov_token constructor arguments of IntegerLookup.

You can see the IntegerLookup in action in the example structured data classification from scratch.

Applying the hashing trick to an integer categorical feature

If you have a categorical feature that can take many different values (on the order of 1e4 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing.

\

# Sample data: 10,000 random integers with values between 0 and 100,000 data = np.random.randint(0, 100000, size=(10000, 1))  # Use the Hashing layer to hash the values to the range [0, 64] hasher = layers.Hashing(num_bins=64, salt=1337)  # Use the CategoryEncoding layer to multi-hot encode the hashed values encoder = layers.CategoryEncoding(num_tokens=64, output_mode="multi_hot") encoded_data = encoder(hasher(data)) print(encoded_data.shape) 

\

(10000, 64) 

Encoding text as a sequence of token indices

This is how you should preprocess text to be passed to an Embedding layer.

\

# Define some text data to adapt the layer adapt_data = tf.constant(     [         "The Brain is wider than the Sky",         "For put them side by side",         "The one the other will contain",         "With ease and You beside",     ] )  # Create a TextVectorization layer text_vectorizer = layers.TextVectorization(output_mode="int") # Index the vocabulary via `adapt()` text_vectorizer.adapt(adapt_data)  # Try out the layer print(     "Encoded text:\n",     text_vectorizer(["The Brain is deeper than the sea"]).numpy(), )  # Create a simple model inputs = keras.Input(shape=(None,), dtype="int64") x = layers.Embedding(input_dim=text_vectorizer.vocabulary_size(), output_dim=16)(inputs) x = layers.GRU(8)(x) outputs = layers.Dense(1)(x) model = keras.Model(inputs, outputs)  # Create a labeled dataset (which includes unknown tokens) train_dataset = tf.data.Dataset.from_tensor_slices(     (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0]) )  # Preprocess the string inputs, turning them into int sequences train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y)) # Train the model on the int sequences print("\nTraining model...") model.compile(optimizer="rmsprop", loss="mse") model.fit(train_dataset)  # For inference, you can export a model that accepts strings as input inputs = keras.Input(shape=(1,), dtype="string") x = text_vectorizer(inputs) outputs = model(x) end_to_end_model = keras.Model(inputs, outputs)  # Call the end-to-end model on test data (which includes unknown tokens) print("\nCalling end-to-end model on test string...") test_data = tf.constant(["The one the other will absorb"]) test_output = end_to_end_model(test_data) print("Model output:", test_output) 

\

Encoded text:  [[ 2 19 14  1  9  2  1]]  Training model... 1/1 [==============================] - 2s 2s/step - loss: 0.5296  Calling end-to-end model on test string... Model output: tf.Tensor([[0.01208781]], shape=(1, 1), dtype=float32) 

You can see the TextVectorization layer in action, combined with an Embedding mode, in the example text classification from scratch.

Note that when training such a model, for best performance, you should always use the TextVectorization layer as part of the input pipeline.

Encoding text as a dense matrix of N-grams with multi-hot encoding

This is how you should preprocess text to be passed to a Dense layer.

\

# Define some text data to adapt the layer adapt_data = tf.constant(     [         "The Brain is wider than the Sky",         "For put them side by side",         "The one the other will contain",         "With ease and You beside",     ] ) # Instantiate TextVectorization with "multi_hot" output_mode # and ngrams=2 (index all bigrams) text_vectorizer = layers.TextVectorization(output_mode="multi_hot", ngrams=2) # Index the bigrams via `adapt()` text_vectorizer.adapt(adapt_data)  # Try out the layer print(     "Encoded text:\n",     text_vectorizer(["The Brain is deeper than the sea"]).numpy(), )  # Create a simple model inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),)) outputs = layers.Dense(1)(inputs) model = keras.Model(inputs, outputs)  # Create a labeled dataset (which includes unknown tokens) train_dataset = tf.data.Dataset.from_tensor_slices(     (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0]) )  # Preprocess the string inputs, turning them into int sequences train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y)) # Train the model on the int sequences print("\nTraining model...") model.compile(optimizer="rmsprop", loss="mse") model.fit(train_dataset)  # For inference, you can export a model that accepts strings as input inputs = keras.Input(shape=(1,), dtype="string") x = text_vectorizer(inputs) outputs = model(x) end_to_end_model = keras.Model(inputs, outputs)  # Call the end-to-end model on test data (which includes unknown tokens) print("\nCalling end-to-end model on test string...") test_data = tf.constant(["The one the other will absorb"]) test_output = end_to_end_model(test_data) print("Model output:", test_output) 

\

WARNING:tensorflow:5 out of the last 1567 calls to <function PreprocessingLayer.make_adapt_function.<locals>.adapt_step at 0x7fda8c3463a0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details. Encoded text:  [[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.    0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]]  Training model... 1/1 [==============================] - 0s 392ms/step - loss: 0.0805  Calling end-to-end model on test string... Model output: tf.Tensor([[0.58644605]], shape=(1, 1), dtype=float32) 

Encoding text as a dense matrix of N-grams with TF-IDF weighting

This is an alternative way of preprocessing text before passing it to a Dense layer.

\

# Define some text data to adapt the layer adapt_data = tf.constant(     [         "The Brain is wider than the Sky",         "For put them side by side",         "The one the other will contain",         "With ease and You beside",     ] ) # Instantiate TextVectorization with "tf-idf" output_mode # (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams) text_vectorizer = layers.TextVectorization(output_mode="tf-idf", ngrams=2) # Index the bigrams and learn the TF-IDF weights via `adapt()` text_vectorizer.adapt(adapt_data)  # Try out the layer print(     "Encoded text:\n",     text_vectorizer(["The Brain is deeper than the sea"]).numpy(), )  # Create a simple model inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),)) outputs = layers.Dense(1)(inputs) model = keras.Model(inputs, outputs)  # Create a labeled dataset (which includes unknown tokens) train_dataset = tf.data.Dataset.from_tensor_slices(     (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0]) )  # Preprocess the string inputs, turning them into int sequences train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y)) # Train the model on the int sequences print("\nTraining model...") model.compile(optimizer="rmsprop", loss="mse") model.fit(train_dataset)  # For inference, you can export a model that accepts strings as input inputs = keras.Input(shape=(1,), dtype="string") x = text_vectorizer(inputs) outputs = model(x) end_to_end_model = keras.Model(inputs, outputs)  # Call the end-to-end model on test data (which includes unknown tokens) print("\nCalling end-to-end model on test string...") test_data = tf.constant(["The one the other will absorb"]) test_output = end_to_end_model(test_data) print("Model output:", test_output) 

\

WARNING:tensorflow:6 out of the last 1568 calls to <function PreprocessingLayer.make_adapt_function.<locals>.adapt_step at 0x7fda8c0569d0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details. Encoded text:  [[5.4616475 1.6945957 0.        0.        0.        0.        0.    0.        0.        0.        0.        0.        0.        0.   0.        0.        1.0986123 1.0986123 1.0986123 0.        0.   0.        0.        0.        0.        0.        0.        0.   1.0986123 0.        0.        0.        0.        0.        0.   0.        1.0986123 1.0986123 0.        0.        0.       ]]  Training model... 1/1 [==============================] - 0s 363ms/step - loss: 6.8945  Calling end-to-end model on test string... Model output: tf.Tensor([[0.25758243]], shape=(1, 1), dtype=float32) 

Important gotchas

Working with lookup layers with very large vocabularies

You may find yourself working with a very large vocabulary in a TextVectorization, a StringLookup layer, or an IntegerLookup layer. Typically, a vocabulary larger than 500MB would be considered "very large".

In such a case, for best performance, you should avoid using adapt(). Instead, pre-compute your vocabulary in advance (you could use Apache Beam or TF Transform for this) and store it in a file. Then load the vocabulary into the layer at construction time by passing the file path as the vocabulary argument.

Using lookup layers on a TPU pod or with ParameterServerStrategy.

There is an outstanding issue that causes performance to degrade when using a TextVectorizationStringLookup, or IntegerLookup layer while training on a TPU pod or on multiple machines via ParameterServerStrategy. This is slated to be fixed in TensorFlow 2.7.

\ \

:::info Originally published on the TensorFlow website, this article appears here under a new headline and is licensed under CC BY 4.0. Code samples shared under the Apache 2.0 License.

:::

\

Market Opportunity
Holo Token Logo
Holo Token Price(HOT)
$0.0004859
$0.0004859$0.0004859
+0.85%
USD
Holo Token (HOT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025?

The post XRP Price Prediction: Can Ripple Rally Past $2 Before the End of 2025? appeared first on Coinpedia Fintech News The XRP price has come under enormous pressure
Share
CoinPedia2025/12/16 19:22
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44