Natural Language Processing AI Technology a Quick and Brief Intro with Examples

Natural Language Processing AI

The company that I currently work for, Nautilus Cyberneering, has a 5 year project for which the so called Natural Language Processing AI is key. We essentially want to create a virtual artificial intelligence assistant that you can run from your own local computer and communicate with you through a command line interface.

This assistant we envision, will do all sorts of things that a private user may consider of value. The user will basically interact with the “machine” indicating what he wants to achieve or do, and the “machine” will respond to his input.

As you can imagine such an application will require a good understanding of human language and it could look like this:

Clearly not exactly like in the picture but you get the point. : )

Human communication and understanding is rather complex, as you well know. Hence to achieve this we will employ “Natural Language Processing” artificial intelligence models also abbreviated as NLP.

Starting My Research

Given this I wanted to begin forming my opinion and test a few and ask around if anyone in my network had used any Natural Language Processing AI so far. It happened to be the case.

Some good friends of mine were currently using GPT-3. They told me that to them it was another employee in their company. Knowing them I knew it was no overstatement, when they told me that they used it for code review and research. Especially since their business also happens to be in machine learning, AI and automation solution consulting. Consequently, I became even more interested.

However, if you keep on reading please let me first start by saying that I do not consider myself an expert in this field, so please forgive any mistakes I may make during this post. Still you may find it interestint if you are also new to the topic.

In this post I will do the following:

  • Briefly explain what NLP is
  • How do NLPs work
  • NLP Creation Techniques
  • Known NLP Models
  • Share some links to the ones I found most interesting
  • Give you some examples of their replies to my input
  • Share some already usable tools

What is Natural Language Processing (NLP)

These are two definitions from different sources:

Wikipedia

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them.”

wikipedia.org

IBM

Natural language processing strives to build machines that understand and respond to text or voice data—and respond with text or speech of their own—in much the same way humans do.”

ibm.com

So to sum up:

NLP is an artificial intelligence technology meant to power machines. It processes human language inputs written or spoken, understanding and responding to them.

How do NLPs Work

The before mentioned summary sounds very simple, but it is not. An NLP system needs to:

  • Recognize speech, which is convert voice data into text data, no matter how they speak, where they come from or what accents or mistakes they make.
  • Tag words, be they nouns, verbs, articles, etc.
  • Decide on the intended meaning of a word given many possible meanings based on the context.
  • Differentiate between block elements such as sentences.
  • Establish relevant words, for example names of a person, state, etc.
  • Make contextual cross references, from pronouns or descriptive words, etc.
  • Infer the emotional load within a text, such as subjectivity, objectivity, sarcasm, etc.
  • Generate “human” responses from structured information.

I do not know you, but I think that this is even difficult for a human. Recall for instance when you learn a language. All the different accents, double meanings, the different sense of humor, etc. Complex indeed.

If you are curious you can read more here.

Natural Language Processing AI Model Creation Techniques

Creating a single working NLP model is difficult. Evidently, it takes a lot of effort. For many years different approaches came into existence to optimize and test this process. Research in this field has been going on for over half a century. You can get a brief overview of the past models in Wikipedia.

The currently used machine learning methods are two. The two require extensive use of computational power and can be used in combination.

One could write a book on each of them but this is not my intent so that I will try to briefly describe how I have understood them and include a link to more information.

Feature or Representation Learning

A system is set up to automatically discover and learn through prepared sets of labeled or unlabeled data. It essentially learns to recognize and associate features, common patterns, within a context and make associations of meaning. For more information here.

Deep Neural Network Learning

Is an approach in which there are different layers of inter connected nodes. Nodes are computational sets of rules that get adjusted in the form of weights during the training phase. The nodes pass information through them. The data that you input into the system proceeds through this network of decision rules and progresses through the different layers like a decision tree. For more information here.

Neural Network
Neural Network

Known Natural Language Processing AI Models

There currently exist many NLP models. It would seem that there is a race to develop the most powerful one. You will find WU DAO, GPT-3, GPT-J, Meta AI, Bert, etc.

One of the challenges researchers are facing with such models is whether the models have learned reasoning or simply memorize training examples.

Clearly as you can image, some are Open-Source and others not. Through the use and access to these available models many solutions are being. I will briefly highlight some facts about the ones that I have looked at most and which I found demo implementations for or solutions developed on them which you can try.

GPT Group

GPT stands for “Generative Pre-trained Transformer”. These are models trained to predict the next token in a sequence of tokens autonomously. A token being a set of characters when it comes to text characters.

GPT-3

This is the model that has recently created a lot of buzz since 2020 when it came out. In 2020 it was the largest model ever trained. It has been already used to implement marketed solutions by different companies.

The model was developed by OPENAI. It started out as an open source project; however, nowadays its code base has been licensed out exclusively to Microsoft.

It has been trained to perform generalist and niche tasks such as writing code in different programming languages such as Python.

GPT-2GPT-3
Date2019-022020-05
Parameters1.5 Billion125 Million – 175 Billion
Training Data10 Billion tokens499 Billion tokens
Model Progression OpenAI

Here are two interesting links:

GPT-J, GPT-Neo & GPT-NeoX

These three models have been developed by EleutherAI. It is an Open-Source project. It is a grassroots collective of researchers working on open-source AI research. The models can from what I read be considered generalist models good for most of the purposes.

GPT-NeoGPT-JGPT-NeoX
Date2021-032021-062022-02
Parameters1,3 to 2,7 Billion6 Billion20 Billion
Model Progression EleutherAI
Interesting Responses from GPT-J

Below you will find several screenshots of the responses that I got from their online test interface so that judge for yourself.

AI responding to “Who is the greatest musician of all times?”
AI responding to “which is the best beginner programming language in your opinion?”
AI responding to “what is more important to work or to live?”

Here is the link to the online test instance where I got the responses from if you are interested: https://6b.eleuther.ai/

On the other hand you also can get paid access at goose.ai and test the different EleutherAI models at very reasonable prices.

Wu Dao 2.0 – China’s Monster Natural Language Processing AI

This Natural Language Processing AI model is considered the “monster” and largest NLP model ever. It was generated by the Beijing Academy in june 2021. Its code base is open-source based on PyTorch and it is “multi-modal” being able to process images and text at the same time and being capable to learn from it. Something that the others are not capable of.

It was trained on:

  • 1.2TB Chinese text data in Wu Dao Corpora.
  • 2.5TB Chinese graphic data.
  • 1.2TB English text data in the Pile dataset.

It is supposedly capable of doing all the standard translation etc. but also composing poetry, drawing, singing, etc…

Wu Dao 2.0
Date2021-06
Parameters1,75 Trillion
Training Data4,9 TB
Model Specs Wu Dao 2.0

Some Implemented Solutions

Here you will find some interesting implementations that you can start using today if you want.

Jasper

This is a tool that I think many digital copy writers will find handy to ease their work.

Thoughts

Same applies to this solution which helps you speed up your tweets in your own style.

DeepGenX

This is a solution for developers to write code faster and easier.

Nevertheless, this is just three from many more. Here is a more extensive list of such solutions.

Final Reflections

Like with the examples above, technology never seizes to amaze me. Evidently, there is great potential in their use. Yet, what are its resulting disadvantages?

OpenAi, for instance decided when they developed their GPT-2 model to not make it fully available due to its potential to create fake news with it. In addition, later OpenAi went one step further and called out to create a general collaboration on AI safety in this post.

I agree with this line of thought. We have to weigh AI’s possibilities and dangers and check them against our values and beliefs. Technology in the end is nothing but tool, powerful though. Reason for which this old adage from before Christ rings true again:

“With great power comes great responsibility.”

Not from Marvel Comics : )

AI has only started and we are still to see much more of it in the coming years. If you want to read another interesting example of Natural Language Processing AI at work, here is another post of mine.

 

 

How to use GPG Keys the Right Way With GitHub

GPG Keys

Assuring the authenticity of work submitted to GitHub has become increasingly important. One of the common policies that organizations have used to secure the commits made by developers has been to require the use of GPG Keys to sign Git commits.

Both GitHub and Git have long natively supported cryptography signed comments:

When commits are signed by each of their respective authors it is much harder for an attacker to successfully pull-off an impersonation attack.

My Experience

When I was asked to follow Nautilus policy of having GPG signatures for commits, I followed the GitHub and Git guidelines blind without putting much thought into it. Later after some internal discussion from my colleagues, it became evident that there are some additional aspects to be considered when using GPG for Git, GitHub or any other use.

In this post I will walk you through:

  • How the default GPG keys are set up when you create the
  • Why this practice can be improved
  • Recommended Best practices
  • How to do this
  • How to use them with Git or GitHub
  • Some other recommendations (expiration date, key rotation, etc.)

GPG Keys

Like all asymmetric cryptographic keys, GPG keys are made in two parts: “Private Key”, and the “Public Key” (that is derived from the Private Key).

With GPG, the common practice is to generate a set of keys that are grouped together with an extensive set of meta-data into a so-called OpenPGP key.

An OpenPGP key typically consists of:

  • Keys
    • Primary Key (Certify, and optionally other capabilities)
    • Supplementary Keys (Any of: Authentication, Signing, Encrypting)
  • User-ID [Name, Email, Comment, etc]
    • Primary IDs
    • Additional IDs
  • Key Capabilities, signed metadata that is included in the public key, are listed in the brackets.
  • All Keys can be set with expiry dates.
  • Sub-Keys and User-ID can be independently revoked or retired
    • If the primary key is Revoked, then entire OpenPGP Key is considered compromised.

GPG Defaults

There are many arrangement and possible combinations of keys, sub-keys, user-id’s and so on. When you use GPG to generate your keys, by default it generates your keys following a standard template:

  • Keys
    • Primary Key (Certify and Signing)
    • Supplementary Key (Encrypting)
  • User-ID
    • Primary ID (Name, Email, and Comment)

You can notice that the primary key has been set with the dual-capabilities of Certifying (to make new supplementary keys) and Signing (such as signing a Git commit).

This basis structure was chosen upon the thought that the keys used for Encryption need to be (or at least should be) rotated regularly, however Signing can remain constant over the lifetime of the OpenPGP Key.

However, in many cases this is not what the user would want if given the choice.

Why is this not optimal?

The default set-up leaves still some space for improvement. This is because it does not take advantage of the possibility to create individual sub-keys for each capability.

The idea is that you essentially disconnect all the rights of your choice from your primary key and just use your sub-keys to avoid using your primary key. The only times you then use your primary is to cancel (revoke) existing sub-keys or to generate new sub-keys.

It is very advanced to separate the primary key from the supplementary keys.

The advantage of this approach is that if any of these sub-keys gets compromised, you can revoke individually and generate a new key, all while keeping your primary key valid.

If you do not do this, you probably will end up someday with your primary key compromised and will have to regenerate a new primary key, etc.

How to Create Further Sub-Keys

In order to create additional sub-keys, you need to use the GPG command-line interface.

A colleague of mine, Jose Celano wrote a very clear step-by-step guide for internal use in one of our company’s repositories, here.

I base the following summary of steps in the command line interface on his work.

  1. Type: gpg --list-keys --fingerprint --with-keygrip --with-subkey-fingerprints
  2. In the list you get an overview of all the primary key and its existing sub-keys. You will copy the second line of your public key made up of 10 pairs of 4 numbers and or letters.
  3. Using the noted public key type: gpg --edit-key <public key 40 digits without spaces>
  4. You will get a display of their associated private key and a new prompt so type: addkey
  5. Select your applicable key, most likely option (4) RSA (sign only)
  6. It will ask you to specify the keysize duration, I recommend 4096 and “0” for does not expire.
  7. Confirm the creation.
  8. You get a new overview of the new secret keys, seeing the newly generated sub-key and the changed rights of the primary key.
  9. To see the equivalent public keys for export type: gpg --list-keys --fingerprint --with-keygrip --with-subkey-fingerprints <public key 40 digits without spaces>
  10. You should now see the new sub-key and the changed primary key rights.

Removing Primary Key Rights

The last step to finish this is to remove all capabilities except the “certify” capability from the primary key. For this, you will continue using the command line but using the “expert” mode.

  1. Type: gpg --expert --edit-key <public key 40 digits without spaces>
  2. You will get an overview of the primary key’s rights.
  3. Type: change-usage
  4. Use the toggle option taking away the rights for which you already have created the new sub-key.
  5. Once you are done you get a new overview of the primary key’s rights.
  6. Type save and you are all set working on the keys.
  7. Type: gpg --list-keys --fingerprint --with-keygrip --with-subkey-fingerprints <public key 40 digits without spaces>
  8. You will see the new public key rights where you should only see the “c” option for certify at the “pub” key.

Configuring Git with Your New Key

In order to set up your new key for signing your commits you have to follow these steps:

  1. In the command prompt type: git config --global --edit
  2. This will open the git config file in your default editor. In my case it opens it in Visual Code.
  3. Once here look for the following entry of the signing-key and update it with the last 16 digits of your new signing sub-key.
  4. Save it.
  5. If you are using GitHub you will need to export your new public key and import it into it, following the necessary steps as shown in their GitHub Documentation – Signing commits.

Always use a Passphrase

When creating the set of keys you are asked for a passphrase. Set it and remember it or even better write it down somewhere. This is another safety measure but it is essential.

Backing up Your Revocation Certificate

Make sure that you keep a backup of your revocation certificate or that you print it out and store it somewhere safe in case that you were to have to use it.

Rotating Your Encryption Keys

This being one of the most used capabilities. It is recommended that you rotate these keys to prevent anyone to have access to any of your encrypted information, creating for example new keys in events such as computer change, etc. It is important though to back these up in the event that you were to have files encrypted with these.

Setting an Expiration Date

Another good idea is to set an expiration date not too far in the future in case that you were to not be able to revoke your certificate due to having lost your revocation certificate.

Some After Thoughts

In a way using GPG is good for security, but if you work yourself through all these steps to do things properly, I am convinced that you may agree with me that it could be more user friendly.

Things could be made easier especially with the default key setup which could already have all keys separate, and avoid the need to have to do all this.

If this is too advance you can just go back to the basic as in my previous post.