- Published on
Git Basics
Table of Contents
- Control of Git
- Basic Git Commands
- Getting Started with Git
- Basic Git Commands
- Git Log
- Exploring the Log
- Formatting the Log
- Making Changes
- Commits: Options, Changing, and Syncing
- Writing Good Commit Messages
- Changing Commits
- Syncing Commits
- Viewing Differences
- Configuring Git
- The .git/config File
- Bisecting
- Starting a bisect in Git
- Edge Cases
- Tagging
- Annotated Tags
- Implementation of Annotated Tags
- Software Supply Chain Attacks
- Sub Modules
- Stashing
- Patching
Control of Git
Although there are many reasons for having version control systems, one important reason is that there are often multiple versions of histories that are present that often do not agree. In software development there are multiple versions of programs.
There are Two things are under Git's control:
- Object database recording history of project development, where the project is modeled with a tree of files
- basically everything that everyone did to the project
- Index, or the cache records the future of the project or your plans for some of the future of the project
- it keeps track of what you are going to do next, since you update the index and then commit it so that it turns into the past (migrating and emptying the index)
- the motivation of this is so that you never commit anything that does not work yet
Basic Git Commands
Getting Started with Git
git init
creates a repo from scratchgit clone URL
clones an existing repogit init
followed bygit remote add NAME URL
will create a repo locally and then link it to a remote
NOTE: We don't necessarily have a "boss and servant" relationship between upstream and downstream repositories. Often times, a clone can become more active/popular than the original, in which case, the latter will start to sync with the former instead of vice versa.
Basic Git Commands
git status
tells you updates to the index and the current state of the repogit ls-files
lists std output names of all of the working files that is managed by gitgit checkout
arranges for the working files to be what they were at a certain branchgit grep "pattern"
is basically the same asgrep pattern $(git ls-files)
searches the whole repogit blame FILENAME
gives all of the authors of the commit, the time of the commit, and the shortened commit ID (unique prefix)
Git Log
git log
gives the history of the changes made in the current repository to get it to the working order listed in the reverse time orderSHIFT-G
takes us to the end of the repository
- allows us to inspect the object database (code) and the reason for the changes (message within the commit)
- The git log contains: COMMIT_ID (BRANCH), AUTHOR, DATE, COMMIT_MESSAGE
- the commit ID is the checksum of the commit contents
- A checksum is like a fixed bit integer that is a function of the bytes of the content it is encoding (similar to a pointer to the content it is encoding)
- This function must not produce collisions, where different contents produce the same checksum
- 1/(2^160) is the chance of a collision due to the 160-bit number for the unique id
- when you clone the database, you have the same ID's that represent the same commits (reason why it is a checksum)
Exploring the Log
- You can use the special
..
syntax to specify a commit range.
git log A..B # A Exclusive B Inclusive
git log B # Getting the _entire_ history _up to_ commit with ID B
You can also use "version arithmetic" to get references to commits based on aliases, like branch names, tag names, HEAD
, etc.
The special pointer
HEAD
by references the current version of the repository. This is kind of like a "you are here" pointer. You can move it around such as checking out to other branches, a specific commit, etc. in which caseHEAD
moves to the corresponding commit and Git changes your project directory to match the snapshot of that commit.
- You can use the
^
syntax to specify the parent of a reference, soHEAD^
means the commit just beforeHEAD
,HEAD^^
means the grandparent commit, etc. - Example about showing the most recent commit:
git log HEAD^..HEAD
git log HEAD^! # shorthand
Formatting the Log
git log --pretty=fuller
The fuller format shows that every commit has an author AND a committer, each with their own dates. Git distinguishes between these two contributors. This separation is routine for many larger projects. The author would be the person that writes the code, and the committer would be the overseer that reviews the and confirms the changes. These fields establish responsibility for changes, a major reason for using version control in the first place.
Making Changes
- Edit the working files.
- Run
git add FILES...
to add the specified file contents to the index (the staging area, the cache)- if you edit the file that you have already added, there will be three versions of the file (file in the HEAD), (file in staging area), (file in the working directory)
- when you commit, it is the recent version that you added, while the unstaged changes remain in the index (cache)
- if you edit the file that you have already added, there will be three versions of the file (file in the HEAD), (file in staging area), (file in the working directory)
- Run one of the
git diff
commands to verify that the changes are what you want. - Run
git commit
, which takes your index, makes a new commit, and puts it into the object database with the auto-generated checksum. In effect, it changes the commitHEAD
references.
Commits: Options, Changing, and Syncing
- Commit messages are important because in essence, they help "market" your changes.
- tell readers of the repository why certain commits were made and whether it was a "good" commit by explaining the motivation behind the changes
- The rationale behind commit messages are similar to why you should comment your code.
- There is overlap between comments in the source code and commit messages, but the primary distinction is the audience.
- Commit messages are more historically oriented, what you would tell the "software historian," people interested in the development of the repository as a whole
- Comments in the source code are for the "current developer," people interested in having to study or change your code
Writing Good Commit Messages
- first line should be at most 50 characters, and this acts as the "subject line" for the commit, like the elevator pitch
- gives any readers the gist of the commit.
- The second line should be empty, separating the subject line from the body.
- The last line should be a larger description of the git command
Changing Commits
git clean
removes all of the untracked files- This is useful for removing files created as part of some build process. If you're not sure, you can run a "what if" with the
-n
option:The
--dry-run
/-n
switch is common to a lot of Git (and Unix in general) commands. It's a good way to "preview" the effects of a potentially destructive command instead of running it blindly right away. - There's also the
-x
option which cleans files that will even be ignored:
Syncing Commits
- The upstream repository is where you cloned the downstream repo from (probably the cloud)
- the .git is the downstream repository
- Consider you want to sync the downstream changes to some changes that were made to the upstream repo, you would use git fetch
git fetch
propagates the latest upstream repository within your downstream- git has two versions where one is "main" (current working directory), and one is "origin/main" (latest upstream); this differentiates your current changes from the changes that are upstream
- git fetch changes the repositories idea of what upstream is within the repository and effectively does not change your current state (synced our opinion of upstream with what is actually upstream)
git pull
effectively fetching + copying it into your version of the repository - this is good when you have no changes on your side of the repository- if you make your own changes, there is a chance that you might collide with upstream, causing issues when using this method
Consider you clone and make a commit into your local repository.
c1 <- c2 (master, origin/master)
c1 <- c2 (origin/master) <- c3 (master)
Now consider someone made a change and pushed to the remote and you pulled.
c1 <- c2 <- c4 (origin/master)
^- c3 (<- ^- )merge (master)
Viewing Differences
git diff
is similar to the GNUdiff
command, and like a more detailed version ofgit status
Viewing the difference between the index and the working files: Δ(index vs working files)
:
git diff
This views the difference between the latest commit and the index: Δ(latest commit vs index)
:
git diff --cached
git diff --staged # equivalent
And this is Δ(last commit vs working files)
git diff HEAD
Compare the grandparent commit to the latest commit:
git diff HEAD^^..HEAD
More examples of viewing the difference between two commits:
# Typically with SHAs of the specific commits you want
git diff REF..REF
# But you can also abbreviate the hashes:
git diff 5c6cb30..53bf6bd
git diff 5c6c..54bf
# But this has a limit. This fails:
git diff 5c6..53b
# As usual you can use the HEAD ref to reference commits relative to
# the last commit:
git diff HEAD~..HEAD
git diff HEAD~4..HEAD
git diff HEAD^..HEAD
Configuring Git
A special file inside the repository containing file patterns that Git should not track. The file pattern syntax is similar to the familiar globbing pattern as the shell.
.gitignore is like a configuration file that instructs how users run Git. It's under Git's control i.e. it'll show up in git ls-files
.
What files should be ignored?
Files that we do not want to put under version control. Obvious candidates include:
- Temporary files,
\#*
- Machine-dependent code,
*.o
- Imported files (from other packages)
- Authentication information (passwords/keys/etc.)
- Hashes of passwords? If it's intended for authentication, this would be just as bad as raw passwords, so ignore them too. Hashes enable rainbow attacks on the passwords where attackers try to crack the checksum algorithm.
The .git/config File
You can view the current configuration of the Git program with:
git config -l
This outputs the information stored in the editable .git/config
file, which is specific to the current repository. Cloning a repository also copies the configuration file.
CAUTION: One notable problem (which is standard across any software) is that if there is a syntax error in the configuration file, Git stops working altogether.
.git/config
is NOT under version control because it determines how Git itself functions and because it would introduce the problem of recursion. .gitignore
IS under version control because it's like a message from the developer and contains information about how to manage the project actually being version controlled. You also don't need to worry about what's in .gitignore
to use Git itself.
Bisecting
Suppose you have a linear piece of history where somewhere between a stable version and the most recent commit, something went wrong. You can think of this problem of finding the first faulty commit as partitioning the timeline into OK and NG ("not good") sections, hence bisecting.
The timeline is "sorted" in that if you think of OK=0 and NG=1, the history will always be such that all NGs follow OKs.
- This then becomes a classic binary search problem, where we can identify the first NG commit in O(logN) time.
Starting a bisect in Git
git bisect start HEAD v4.3
- Then we tell Git to run your check script on each commit and use the exit status to determine if the commit is OK or NG
git bisect run
where the run is your shell script will input any shell command that tests your command -> if it succeeds, it will continue, if it fails, it will stop and return that commitgit bisect run make check
this command will basically keep running the bisect with your test cases and will then return the command once you are done- the make check will recompile everything and then do the test, since the source code changes
- the make files might change and git will run whatever the changed makefile will be
Edge Cases
git bisect run timeout 100 make check
- will timeout after 100 seconds to ensure that you do not get caught in an infinate loopgit bisect run -j make check
changes the bisection procedure to run in parallel- An edge case to consider is that a bug is fixed and then reintroduced, this might cause a lot of issues with the bisect
Of course, this also introduces the problem that if your test cases are buggy, then you may get false alarms. If you know ahead of time that a commit, say v3
, will produce unreliable test results, you can skip it with:
git bisect skip v3
Tagging
Suppose you want to get all of the source code of a specific version. You want a symbol that stands for a speciifc commit that never moves or changes in time.
git checkout v27
git tag
lists all of the versions in alphabetical ordergit tag REPO_NAME COMMIT_ID
creates a tag for the specific commit
Tagging is a big event within git. You would want the user to know about why the tag is being tagged in the first place. Thus, you can create something called annotated tag to get some metadata about that tag.
- you can have a branch and a tag named the same thing
- you can have many tags point to the same commit
Annotated Tags
git tag -a theirversion -m "They liked the previous version" myvers^
creates an annotated tag with the nametheirversion
of the last commit of the repomyvers
Implementation of Annotated Tags
ls -ltr $(find .git -type f) | less
we see a bunch of files that have been updated by the git tag command- when you look into the files, we can see the plain tags only have the commit id within the file
- the normal tag is not a part of the git repository
- the annotated tag is represented by an unique type of object within the git repository
- the tag itself is an object but the tag also tags a commit object as well that has a lot of extra data with it
- when you look into the files, we can see the plain tags only have the commit id within the file
Software Supply Chain Attacks
- people will write some code in a package that seems useful, but will actually exploit your system
- people will use tags to ensure that the package is secure if the tag has someone trustworthy
- to prevent faking identities, you would use a signed tag that has some sort of authentication (uses a cryptographic key)
Sub Modules
The reason a module would want to use a submodule is because we want to tie another package into our package and identify it with a commit id or specific version.
- Basically a pointer to another project containing a commit ID within the other project
- we can make changes to the latest version of another project, then you can pull the latest versions of the modules that you have copied
git submodule add https://github.com/a/b/c
- This is similar to cloning, but instead of creating a new repository complete with its own
.git
, it:- Creates an empty subdirectory.
- Create a
.gitmodules
file in the main project that establishes the relationship between the main project and the subdirectory
- Furthermore, updating a submodule is very simple. Instead of making edits to possibly to many files if you kept the dependency as part of your repository, the changes you make to the main project are simply which commit ID to use for the submodule
- This keeps both your code and cognitive load modularized.
git submodule foreach git pull origin master
for each submodule, pull each of the submodule differences
Stashing
Consider you are working on a change but have to change your attention to something else.
if you switch to a different branch with changes to the working directory, it will complain saying you have uncommitted work
- if you commit, it will not be good and there will not be a clean commit
git stash push
saves the state of your working files in some part of the index. When you want to retrieve this state, you can get it from the stash stack with:git stash apply
Patching
- share repo over the network (bad/inefficient practice)
- email patches | only tells you the change to the code but not the metadata (author, details)
git format-patch
outputs the diff and the metadata of the changes in the commitgit am FILES...
imports commits from the email patches in the .mbox format- The reason why we want to do this is because we want to ensure that people actually read the messages unlike when pull requests are made
- Authors
- Name
- Apurva Shah
- Website
- apurvashah.org